Quote:
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
All these talk about C++ make me yearn for simpler days... days when people use pure C. And why not? C is still the way to go for portability; it is not called the portable assembler for nothing.
C++ is a complex language. The compiler is naturally complex too. But C is very simple. Its compiler is naturally simpler too. And that means there are more free or open-source C compilers.
And there are a bunch of them. Some of the modern, free, simple yet good-enough ones:
The standard test to see if a compiler is good-enough is whether it is self-compiling. :lol: And these compilers pass, of course.
Note that LCC is only free for non-commerical use.
LCC and PCC are typical compilers. For example, PCC has 4 distinct stages: preprocessor, compiler, assembler and linker.
TCC is a one-pass compiler with an integrated assembler and linker. It is both small and fast. The compiler on Win32 is just 132 kB. That's how small a C compiler can be.
TCC has three unique selling points:
Of course, TCC tradeoffs its blazing compiling speed for runtime performance. The generated code performance is slower than GCC, but it is still much faster than any interpreter.
I am very excited about TCC: it has huge potential to augment scripting languages. Scripting languages are slow, but for performance critical code, a script could generate parameterized C code and execute that instead. That'll solve most performance issues.
(People have done this, but it is not a common technique.)
We need to go back to the very beginning to understand why C++ is the way it is.
C++ was originally intended as a way to bring Object-Oriented Programming (OOP) to C. The first few versions were implemented as a C preprocessor phase, outputting pure C code.
This is where the famous "you only pay for what you use" phrase comes from. If you don't use any C++ features, it should be as efficient as C. Sadly, this principle has not been true for a long time, but the phrase still lives on.
C++ has standard OO features such as encapsulation, single/multiple inheritence, polymorphism, object-centric memory allocation and function overloading.
Other than virtual functions, C++ abhors anything that increases run-time costs, so it doesn't have polymorphic constructors, multi-methods, automatic garbage collection and started without run-time type information (RTTI). C++ now has RTTI, but it is not fully exposed. Otherwise, reflections and serialization would have been simple.
(Of course, it turned out that exceptions require not just run-time support, but it also cannot be expressed entirely in C.)
One additional goal was to create objects that work identically to built-in types. That's why we have operator overloading, references and implicit type conversion functions.
Two examples of such classes: numbers and strings. Complex numbers and matrices are two popular textbook examples. String is a holy-grail. Unfortunately, none of these classes work seamlessly — even today.
For a short while, reference-counted string class seemed like a neat idea. It turned out not to be the case due to two problems: (i) it is not multi-threading safe, (ii) detecting and handling C-O-W (copy-on-write) is a big issue.
Operator overloading was also seen as a cool way to build "extensible"
components. The C++ I/O stream is the first example. Unfortunately, you need
to write a lot of code to do formatted I/O — which printf()
is best at.
Another goal was to build type-safe containers. That's what templates are meant for. A "true" array is one of the goals. Unfortunately, it is next to impossible to implement virtualized arrays due to lack of properties. (Not that people have not tried; they used proxy classes — leading to very convoluted code for such a simple thing.)
A sub-goal is to allow template specialization for increased efficiency.
An array of booleans is inefficent, so let's specialize it to be an array of
bitfields. This turned out to be a really bad idea because everyone who
declared vector<bool>
got bitten by the unexpected
"optimization".
To sum up: each feature is quite nice and simple on their own, but the features interact poorly and combine to create an ugly monster.
Updated: added C-O-W strings.
I used to like C++ a lot. If you are optimistic, you'll think it inherited many of C's strengths. However, C ended up as a liability: OO does not mix well with manual memory management.
C++ is not an easy language to program in. Sure, you can learn it easily, but you will fall into one of its many traps just as easily — and not know it.
That's where books like Effective C++ and More Effective C++ by Scott Meyers come in. They show you the traps so you can avoid them. These two books are mandatory reading.
Later, Exceptional C++ and More Exceptional C++ came out, covering mainly exception safety. Exceptions are extremely hard to get right, but they are needed if you really want constructors to work the C++ way. (There is no way to signal a failure otherwise.) There are two implications: one, once you use exceptions in one place, you got to think about exceptions in the entire progam. Two, exceptions don't play nice with manual memory management.
Then, more C++ books such as Modern C++ Design appear. And I found that I could not understand them! Why? They talk about template meta-programming; basically, "abusing" the compile-time template functionality to do fancy stuff.
(Great stuff, except they forgot to mention the downsides: hard to understand, extremely cryptic error messages, very slow compile times.)
This was the turning point. I realized C++ was just complicated for me. There are just too many gotchas, even when using a subset of it.
My key issues with C++:
So don't use raw pointers and manual memory management? Unfortunately, you cannot avoid the old feature-set entirely, and the new feature-set does not play nice with the old ones.
I finally concluded that C++ itself was making it hard to program. I looked at alternatives. In the end, I settled for plain C for simplicity and speed, and dynamic scripting languages (JavaScript and PHP) for other stuff. And I write other stuff much more often. I became much happier and never looked back.
The C++ FQA Lite (not a typo) explains what's wrong with C++ very well. Google and read it. If you aren't a C++ programmer, it will save you years of wasted experimentation to "get it to work" like other C++ programmers claim they do. (Reality: everyone is struggling.) If you are a C++ programmer, it may let you see the light, provided you aren't too brainwashed.
An interpretive emulator spends much of its time decoding instructions. It needs to do it for every instruction. What if we decode it just once; will it be faster?
Let's take a look at three MIPS R3000A instructions:
static void instrADD(void) { if(_Rd_ != 0) _rRd_ = _rRs_ + _rRt_; } static void instrADDIU(void) { if(_Rt_ != 0) _rRt_ = _rRs_ + _Imm_; } static void instrBEQ(void) { if(_rRs_ == _rRt_) doBranch(_BranchTarget_); }
They look simple, but they are full of bitwise operations.
A decoded MIPS R3000A instruction will look like this:
typedef struct { uint8_t op; uint8_t rs, rt, rd; uint32_t immed; } PsxDecodedInstr_t;
If we decode the instruction as soon as we load it into the code cache, the execution loop will look like this:
while(1) { const PsxDecodedInstr_t *instr = psxCpuGetDecodedInstr(psxRegs.pc); psxRegs.pc += sizeof(uint32_t); ++psxRegs.cycle; (*instrFnTbl[instr->op]) (instr); }
ADD, ADDIU and BEQ will look like this:
#define GPR(r) psxRegs.gpr[instr->r] static void FASTCALL instrADD(const PsxDecodedInstr_t *instr) { GPR(rd) = GPR(rs) + GPR(rt); } static void FASTCALL instrADDIU(const PsxDecodedInstr_t *instr) { GPR(rd) = GPR(rs) + instr->immed; } static void FASTCALL instrBEQ(const PsxDecodedInstr_t *instr) { if(GPR(rs) == GPR(rt)) doBranch(instr->immed); }
Essentially, we have replaced a bunch of bitwise operations for a few memory operations.
And the problem? We become memory bound (or more likely, L1 or L2 bound). If we are L1-bound, we could have run 3-4 instructions "for free" in place of the cache miss. If we are L2-bound, we could have run 10-12 instructions!
We are not making use of the "free" execution slots for the pre-decoded instructions. That's why I think it will not be any faster.
Moral of the story: fewer instructions do not necessarily mean faster code. Optimizing for modern CPUs can be tricky.
Answer: very.
Size | Read | Miss | Hit % |
---|---|---|---|
256 bytes | 157,250 | 29,397 | 25.2% |
1 kB | 210,727 | 9,433 | 82.1% |
4 kB | 265,368 | 4,253 | 93.6% |
16 kB | 230,456 | 2,354 | 95.9% |
The cache is very effective when the working set fits. Note that a cache miss will fill the entire 16-byte cache line, that's why the miss rate is multipled by 4.
I'm very surprised by the result. I had not expected the code cache to be so effective, considering it is a simple direct-mapped cache. (Most modern caches are 2-way or 4-way set associative.)
How do I vary the code cache size and gather such statistics? Ah, the marvel of emulation. ;-)
With this finding, I'm considering another approach to dynamic recompilation: decode the instructions when filling the code cache. We only need to pay for the decoding once. But I doubt it will be faster; I will explain in another entry.
Everyone takes it for granted that an emulator is slow. And almost everyone would suggest using dynamic recompliation to speed it up.
Dynamic recompilation takes care of the instruction decoding/execution bottleneck, but other bottlenecks remain.
Consider:
Suppose the memory map looks like this:
0x0000_0000 | 0x001f_ffff | Memory; cached access |
0x1f00_0000 | 0x1f00_ffff | Expansion slot |
0x1f80_0000 | 0x1f80_03ff | Scratch pad |
0x1f80_1000 | 0x1f80_2fff | H/w registers |
0x8000_0000 | 0x801f_ffff | Cached memory mirror |
0xa000_0000 | 0xa01f_ffff | Uncached memory mirror |
0xbfc0_0000 | 0xbfc7_ffff | BIOS; read-only |
There is only 2 MB of RAM, but virtually the entire 32-bit address space is
used. We can't just malloc
4 GB of memory and be done with it.
While it works for the RAM and the BIOS, it still doesn't take care of the
memory mapped I/O and the mirrored memory regions.
A naive memory read will look like this:
uint8_t psxMemRead8(uint32_t addr) { if(IN_RANGE(addr, 0x00000000, 0x001fffff)) return mem[addr]; if(IN_RANGE(addr, 0x1f000000, 0x1f00ffff)) return expansionMem[addr - 0x1f000000]; if(IN_RANGE(addr, 0x1f800000, 0x1f8003ff)) return scratchMem[addr - 0x1f800000]; if(IN_RANGE(addr, 0x1f801000, 0x1f802fff)) return psxHwRead8(addr); if(IN_RANGE(addr, 0x80000000, 0x801fffff)) return mem[addr - 0x80000000]; if(IN_RANGE(addr, 0xa0000000, 0xa01fffff)) return mem[addr - 0xa0000000]; if(IN_RANGE(addr, 0xbfc00000, 0xbfc7ffff)) return biosMem[addr - 0xbfc00000]; return 0; }
Given that we need to read each instruction from the emulated memory, you can already tell that this is going to be really slow — especially if we are executing code from the BIOS.
We want to access emulated memory in constant time, and the way to achieve that is to use paging.
Once we set up the paging LUT (Look-Up Table), memory read looks like this:
uint8_t *memReadLut[1 << 20]; uint8_t psxMemRead8(uint32_t addr) { return memReadLut[addr >> 12][addr & 0x0fff]; }
Very neat, isn't it? There are two downsides:
Note that while this looks fast, it is still quite slow because we still need two bitwise operations and two memory reads to read one emulated memory. To be really fast, we need to fold the LUT away. And that can only be done by analysing the emulated instructions.
One simple way to allow memory-mapped I/O is to check for it:
uint8_t psxMemRead8(uint32_t addr) { if(IN_RANGE(addr, 0x1f801000, 0x1f802fff)) return psxHwRead8(addr); return memReadLut[addr >> 12][addr & 0x0fff]; }
It works, except that it slows down every memory access. We need one comparison (yes, just one) and one branch.
One way to handle all memory transparently is to use a function LUT to handle the different types of memory:
uint8_t (*memRead8FnLut[16]) (uint32_t addr); uint8_t memReadLut[1 << 20]; uint8_t psxMemRead8(uint32_t addr) { return (*memRead8FnLut[memReadLut[addr >> 12]]) (addr); }
Let's see if you can understand this code. :-D
You can argue that memory access is even slower now because we need one function call and three memory reads just to read one emulated memory!
The heart of the interpretive emulator looks like this:
while(1) { uint32_t code = psxMemRead(psxRegs.pc); psxRegs.pc += sizeof(uint32_t); ++psxRegs.cycle; execInstr(code); handleEvents(); }
Events are stuff like timer, DMA and interrupts. For example, the VSync interrupt triggers every 16.68 ms (or 565,045 clock cycles at 33.8688 MHz) for NTSC displays.
However, it is just too much overhead to check for events at every instruction. A good compromise is to check during every taken branch (including calls). This cuts down the checking to 1-in-10 to 1-in-20 instructions — branch is a pretty common instruction.
Even so, it is still way too high. A better approach is to execute until the next scheduled event:
while(1) { uint32_t cycles = getCyclesToNextEvent(); cycles = CAP(cycles, CYCLES_LIMIT); while(cycles-- > 0) { uint32_t code = psxMemRead(psxRegs.pc); psxRegs.pc += sizeof(uint32_t); ++psxRegs.cycle; execInstr(code); } handleEvents(); }
We cap the execution cycles to avoid delaying a new short-lived event. With this, we now only check for events every few hundred instructions.
Just having a dynamic recompiler is not sufficient to make an emulator fast. We need an efficient design for the memory, I/O and events too.
If I were to design a 4-byte CPU instruction set, I would base it on the MIPS instruction set. There are some things I like about it:
What don't I like?
First, I'll get rid of the branch delay slot (BDS). This is one of MIPS's most distinctive features, yet it is also its downfall. It complicates superscalar and out-of-order execution. A good alternative to BDS is a branch predictor.
Second, I feel 32 registers is too many. 16 is a good balance. We only need 4 bits per operand then. I'll make all of them general-purpose: no hardwired 0, RA (Return Address) and SP (stack pointer).
A hardwired 0 is useful, but it just tie up valuable encoding space. We'll use a dedicated MOV instruction instead.
I'll make the return address go to the stack. This is needed for most functions anyway except for leaf functions.
It is not necessary to access SP often, so let's make it non-general purpose. We'll use separate instructions to access it. This makes more sense than tying up one valuable register. We will still allow indirect SP addressing, though.
Third, we should be able to load a 32-bit immediate directly. This means some instructions must be at least 5 bytes long. And once we consider variable length instructions, we may want to allow 2, 4 and 6-byte instructions. And that usually implies 2-operand mode so that they can be encoded in 2 bytes.
Other nice-to-have:
The end result will be very un-MIPS like, and looks closer to ColdFire or ARM Thumb2!
We know that main memory is slow compared to the CPU itself. But just how slow?
Stats for a Core i5 650 (3.6 GHz):
Size | Cycles | |
---|---|---|
L1 | 32 kB | 4 |
L2 | 256 kB | 7-10 |
L3 | 4 MB (shared) | 40-50 |
Mem | >1 GB | 250-350 |
I'm so out of touch. I thought main memory takes only 50 cycles at most.
The huge differential is why simple linear data structures win over complex linked-list data structures. Not surprisingly, the resizable array is now a very popular data structure.
I'm also somewhat surprised that the L1 cache is slower than registers. I took it for granted it would take just 1 cycle to access it. Apparently it is not the case — for CPUs with complex micro-architectures, at least.
This means pure register code is most likely faster. This can be difficult to achieve on x86 with just 7 general-purpose registers (GPR). (No, SP is not a GPR.) However, it should be relatively simple on x86-64 with 15 GPRs.
Taking a weekend plane flight from Singapore to KL and back is surprisingly expensive: S$183 all in. An equivalent bi-directional bus trip is S$100.
In contrast, a Thurs-Sat flight can cost as cheap as S$48! (All-in.)
But there is anothermethod: driving. Petrol costs S$22 and toll S$20. While it may not sound exactly cheap (S$88 all-in), it scales really well: the cost remains the same for 2-5 people!
But a plane flight is faster, right? 45 minutes vs 3.5 hours.
But that is deceptive because of all the waiting time. You need to check in an hour earlier, then it takes 10-15 minutes to board the plane, and finally it takes another 10-15 minutes before the plane takes off.
Plus, you most likely need to take cab 4 times (to and fro the airport, both ways). The cab fare from LCCT/KLIA to KL can cost S$32 already.
Conclusion: driving is still the better option.
In 2006, people were ready for a change. People turned up in record numbers in opposition rallies. It was not reported in the media, of course. The photos were shown freely online. Some people were awed by what they saw and wondered if the photos were doctored. I knew for fact that they weren't — because I was there.
I still remember some memorable quotes:
In the end, nothing happened because the people bought PAP's promises of change hook, line and sinker.
But the next four years got worse for the citizens. Record immigration. Record cost-of-living. Singapore changed from a-good-place-to-stay to a-good-place-to-stay — if you are rich.
Let's see if the people are fed up this time.
The C switch
statement has a reputation as syntatic sugar
for nested if-else statements. It is not hard to see why:
switch(v) { case 0: fn0(); break; case 1: fn1(); break; case 2: fn2(); break; }
is equivalent to:
if(v == 0) fn0(); else if(v == 1) fn1(); else if(v == 2) fn2();
The problem with nested if
s is that the running time is
linear. Thus, a common "optimization" is to replace a switch
with a call table:
void (*call_tbl[]) (void) = { fn0, fn1, fn2 }; (*call_tbl[v]) ();
Which should be faster as well?
That may be true 15 years ago, but today's compilers can optimize the
switch
statement rather aggressively:
The moral of the story is, take a look at the compiler output before you
decide a call table is "faster" than a switch
. It may not be.
I have forgotten a lot about C.
I wanted to hold a function pointer in a data pointer. The compiler flags a warning. It works, of course, but I don't want that warning. Typecasting can get rid of the warning, but take note: typecasting is evil. I'll rather have typecasting warnings than no warnings.
So I googled to find the proper solution. And I found that it is actually undefined behaviour.
Now, I'm more surprised that I forgot data and function pointers are distinct, rather than that it is undefined behaviour. I vaguely remember that I once knew, but now I don't.
Since I desire to conform to Standard C, I made it a union of data and function pointer. But I still have a problem: I'm unable to initalize it statically. This is because in C89, I can only initialize the first field in a union. I'll still get warnings half the time.
Argh!
C is supposed to be a portable language, but in practice, C code is seldom directly portable. This is because in its quest to be efficient, it leaves open many implementation-specific behaviour. (As opposed to undefined behaviour which is not specified in the Standard at all.)
Consider the following:
Will your program still run correctly?
The good news is, most modern CPU architecture are similiar and all C compilers compile similarly, so most C code still port easily. For example, NULL is indeed numeric 0 for most general-purpose CPUs. And all modern CPUs do use two's complement arithmetic.
So what are the main gotchas? Big vs little endian and 32-bit vs 64-bit platforms.
Padding in structures is usually not a problem until we need to change the padding for just some structures. It can be worked around, but we need to use compiler-specific techniques, which in turn requires using the preprocessor to keep the code portable.
I don't claim nor want to write a perfectly portable C program. However, I also don't want to restrict myself needlessly — which many C programmers do by using wrong assumptions.
sizeof
doesn't need parenthesis, as it is an operator, not a
function call. I still do it anyway, because I find it easier to read the
code.
Function pointers are implicitly dereferenced, so a = (*fn_ptr) (b,
c);
can be written as a = fn_ptr(b, c);
. I still use the
first form exclusively though, because I like to distinguish between direct
and indirect function calls. Call me old-fashion.
Some programmers like to put parenthesis for return
. I
absolutely hate it. I can't stand this: return (0);
. Some of
them thought it was necessary. Some of them put it "to be safe". To be safe
from what? It is carry over from their usual overuse of parenthesis for
normal expressions. Imagine, even this needs parenthesis: a = (b * c)
+ d;
, which just leaves me shaking my head. My response is always the
same: learn the C operator precedence table.
I'm referring to the famous 70-minutes The Phantom Menace review by RedLetterMedia, of course.
He has since released his reviews for episodes 2 and 3. They are still pretty good, but they are not as brilliant as the first one. I would rate the first one A+, the second and third B+.
It is only after I watched all three reviews and seeing the actual behind-the-scene footage that it dawned to me why the Star Wars prequels are so bad: almost every scene is a green-screen shot.
There is no variety. I think all of us detect it sub-consciously. Every actor is always relaxed and always standing or strolling on a flat ground. It's terrible.
For the record, I never watched the third prequel. (I gave up after watching the second movie.) It was said to be the least bad of the three. Looking at the review, I think it is still very bad. Basically, the prequels are soulless; they are dry and boring.
In Budget 2011, the Singapore Government increases the employer CPF contribution by 0.5%. It goes to the Special Account (SA). While this sounds nice in a headline, it is useless to the employee. SA is something we can almost never touch. As such, it is like a tax (if you don't already consider CPF to be a tax of some sort.)
Whenever the Government tops up the CPF, it is almost always to the SA. You can only get money out of it after the official retirement age and then, only in monthly payouts. I'm always amused by it.
The CPF contribution cap is now raised to $5,000. I remember the rationale for lowering it from $6,000 to $4,500 (progressively) was to align it with the 80th percentile income. This basically tells us that the 80th percentile income is now $5,000.
As part of the Budget 2011, the Singapore Government finally scrapped away with the TV/radio license! Finally, I can put the radio back in my car and buy a TV.
Call me cheap, but I'm against paying $26 for the car radio license and $110 for the TV license — for something I don't do. I neither listen to radio nor watch TV. However, I do listen to CDs and watch DVDs.
The Government says it will lose $120 million. That's barely 1.1 million households. Strange that Singapore has imported so many people and the household number remains about the same. Hmm..
The "high profile" security firm HBGray was hacked after the CEO was about to expose Anonymous, the group that was DoS'ing organizations that cut off WikiLeaks. And how?
Turns out to be easy:
It all starts with the website allowing SQL injection — the mother of all Internet security loopholes.
What does this tell us?
Security is hard, because people know what to do, but they still don't do it.
My comments:
Branching in a branch delay slot (BDS) is undefined behaviour for the MIPS R3000 CPU. And most MIPS docs just leave it at that — even the famed See MIPS Run.
The good news is, people have done experiments and found the exact behaviour:
Case | 1 | 2 | 3 | 4 |
---|---|---|---|---|
jmp A | jmp A | jmp A | jmp A | |
1 | jmp B | jmp B | jmp B | |
X | X | X | X | |
X | X | X | X | |
A: | 2 | 1 | jmp C | jmp C |
3 | X | X | X | |
4 | X | X | X | |
5 | X | X | X | |
B: | 2 | 1 | jmp D | |
3 | X | X | ||
4 | X | X | ||
5 | X | X | ||
C: | 2 | 1 | ||
3 | X | |||
4 | X | |||
5 | X | |||
D: | 2 | |||
3 | ||||
4 | ||||
5 |
The first one is legal code. The other three show the effect of a taken branch in the BDS. (I think the pattern continues.)
Case 2 can be used to execute a single instruction, but with a caveat: if an interrupt occurs in the BDS, the program will crash because the return address is adjusted wrongly — the CPU automatically subtracts the PC by 4 to re-execute the branch if the BDS flag is set.
The table is only correct for absolute jumps. For relative jumps, we need to offset by +4 as the PC is incorrect at the point of the delayed branch:
beq A | |
beq B | |
X | |
X | |
A: | 1 |
X | |
X | |
2 | |
B: | 3 |
4 |
A branch in the BDS is more tricky than it seems.
This is the first design to really catch my attention. And the best thing is that they are budget-class rings — although they still exceed the original budget by 50%.
Inflation is showing up slowly. Even official statistics show it — despite Governments' best statistical tricks.
Last weekend, I paid $4.50 for two slices of bread with kaya, two semi-boiled eggs and a cup of tea. This is an increase of 30 cents. I believe I paid $3.50 just less than a decade ago. (Note: same store. I know you can get the same deal for $3–$3.50 elsewhere today.)
Won't be going back soon. Most likely I'll prepare my own breakfast in the future. (My budget for breakfast is $1.50. :-D)
I am waiting eagerly for the first country to raise their interest rate. This is the standard method to counter inflation, yet no one can do it due to hot money — I don't understand this part very well.
Anyway, I wonder who will do it first? A couple of countries have revolted. It won't be long.
First, you will do everything possible. If that doesn't work, you will do the impossible, then the unthinkable, then the unimaginable.
Someone said this regarding "Helicopter" Ben, but I think it's a cool quote.
It is something to keep in mind when facing a difficult task or a task with no answers.