Quote:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

Brian W. Kernighan

Simple C compilers

programming

Mar 2011

All these talk about C++ make me yearn for simpler days... days when people use pure C. And why not? C is still the way to go for portability; it is not called the portable assembler for nothing.

C++ is a complex language. The compiler is naturally complex too. But C is very simple. Its compiler is naturally simpler too. And that means there are more free or open-source C compilers.

And there are a bunch of them. Some of the modern, free, simple yet good-enough ones:

Little C Compiler (LCC)
Portable C Compiler (PCC)
Tiny C Compiler (TCC)

The standard test to see if a compiler is good-enough is whether it is self-compiling. :lol: And these compilers pass, of course.

Note that LCC is only free for non-commerical use.

LCC and PCC are typical compilers. For example, PCC has 4 distinct stages: preprocessor, compiler, assembler and linker.

TCC is a one-pass compiler with an integrated assembler and linker. It is both small and fast. The compiler on Win32 is just 132 kB. That's how small a C compiler can be.

TCC has three unique selling points:

Ultra-fast compiling speed
C programs can be run as shell scripts (on Linux)
It can be embedded into another program to compile and run dynamically generated code fragments

Of course, TCC tradeoffs its blazing compiling speed for runtime performance. The generated code performance is slower than GCC, but it is still much faster than any interpreter.

I am very excited about TCC: it has huge potential to augment scripting languages. Scripting languages are slow, but for performance critical code, a script could generate parameterized C code and execute that instead. That'll solve most performance issues.

(People have done this, but it is not a common technique.)

Why C++ is the way it is

programming

Mar 2011

We need to go back to the very beginning to understand why C++ is the way it is.

C++ was originally intended as a way to bring Object-Oriented Programming (OOP) to C. The first few versions were implemented as a C preprocessor phase, outputting pure C code.

This is where the famous "you only pay for what you use" phrase comes from. If you don't use any C++ features, it should be as efficient as C. Sadly, this principle has not been true for a long time, but the phrase still lives on.

C++ has standard OO features such as encapsulation, single/multiple inheritence, polymorphism, object-centric memory allocation and function overloading.

Other than virtual functions, C++ abhors anything that increases run-time costs, so it doesn't have polymorphic constructors, multi-methods, automatic garbage collection and started without run-time type information (RTTI). C++ now has RTTI, but it is not fully exposed. Otherwise, reflections and serialization would have been simple.

(Of course, it turned out that exceptions require not just run-time support, but it also cannot be expressed entirely in C.)

One additional goal was to create objects that work identically to built-in types. That's why we have operator overloading, references and implicit type conversion functions.

Two examples of such classes: numbers and strings. Complex numbers and matrices are two popular textbook examples. String is a holy-grail. Unfortunately, none of these classes work seamlessly — even today.

For a short while, reference-counted string class seemed like a neat idea. It turned out not to be the case due to two problems: (i) it is not multi-threading safe, (ii) detecting and handling C-O-W (copy-on-write) is a big issue.

Operator overloading was also seen as a cool way to build "extensible" components. The C++ I/O stream is the first example. Unfortunately, you need to write a lot of code to do formatted I/O — which printf() is best at.

Another goal was to build type-safe containers. That's what templates are meant for. A "true" array is one of the goals. Unfortunately, it is next to impossible to implement virtualized arrays due to lack of properties. (Not that people have not tried; they used proxy classes — leading to very convoluted code for such a simple thing.)

A sub-goal is to allow template specialization for increased efficiency. An array of booleans is inefficent, so let's specialize it to be an array of bitfields. This turned out to be a really bad idea because everyone who declared vector<bool> got bitten by the unexpected "optimization".

To sum up: each feature is quite nice and simple on their own, but the features interact poorly and combine to create an ugly monster.

Updated: added C-O-W strings.

No more C++ for me

programming

Mar 2011

I used to like C++ a lot. If you are optimistic, you'll think it inherited many of C's strengths. However, C ended up as a liability: OO does not mix well with manual memory management.

C++ is not an easy language to program in. Sure, you can learn it easily, but you will fall into one of its many traps just as easily — and not know it.

That's where books like Effective C++ and More Effective C++ by Scott Meyers come in. They show you the traps so you can avoid them. These two books are mandatory reading.

Later, Exceptional C++ and More Exceptional C++ came out, covering mainly exception safety. Exceptions are extremely hard to get right, but they are needed if you really want constructors to work the C++ way. (There is no way to signal a failure otherwise.) There are two implications: one, once you use exceptions in one place, you got to think about exceptions in the entire progam. Two, exceptions don't play nice with manual memory management.

Then, more C++ books such as Modern C++ Design appear. And I found that I could not understand them! Why? They talk about template meta-programming; basically, "abusing" the compile-time template functionality to do fancy stuff.

(Great stuff, except they forgot to mention the downsides: hard to understand, extremely cryptic error messages, very slow compile times.)

This was the turning point. I realized C++ was just complicated for me. There are just too many gotchas, even when using a subset of it.

My key issues with C++:

Header/source file structure is inefficient
Objects do not mix with raw pointers
Exceptions do not mix with manual memory management
Much implicit code generation, especially for constructors
Hard to manage heap/non-heap objects (forced to use the heap almost all the time)
Objects cannot be initialized directly (must be new'ed). Example: vector of objects
Undefined global objects initialization order
Using the STL is like using a totally different language
Cryptic error messages for templates

So don't use raw pointers and manual memory management? Unfortunately, you cannot avoid the old feature-set entirely, and the new feature-set does not play nice with the old ones.

I finally concluded that C++ itself was making it hard to program. I looked at alternatives. In the end, I settled for plain C for simplicity and speed, and dynamic scripting languages (JavaScript and PHP) for other stuff. And I write other stuff much more often. I became much happier and never looked back.

The C++ FQA Lite (not a typo) explains what's wrong with C++ very well. Google and read it. If you aren't a C++ programmer, it will save you years of wasted experimentation to "get it to work" like other C++ programmers claim they do. (Reality: everyone is struggling.) If you are a C++ programmer, it may let you see the light, provided you aren't too brainwashed.

Will pre-decoded instructions run faster?

programming

Mar 2011

An interpretive emulator spends much of its time decoding instructions. It needs to do it for every instruction. What if we decode it just once; will it be faster?

Let's take a look at three MIPS R3000A instructions:

static void instrADD(void) {
  if(_Rd_ != 0)
    _rRd_ = _rRs_ + _rRt_;
  }

static void instrADDIU(void) {
  if(_Rt_ != 0)
    _rRt_ = _rRs_ + _Imm_;
  }

static void instrBEQ(void) {
  if(_rRs_ == _rRt_)
    doBranch(_BranchTarget_);
  }

They look simple, but they are full of bitwise operations.

A decoded MIPS R3000A instruction will look like this:

typedef struct {
  uint8_t op;
  uint8_t rs, rt, rd;
  uint32_t immed;
  } PsxDecodedInstr_t;

If we decode the instruction as soon as we load it into the code cache, the execution loop will look like this:

while(1) {
  const PsxDecodedInstr_t *instr = psxCpuGetDecodedInstr(psxRegs.pc);
  psxRegs.pc += sizeof(uint32_t);
  ++psxRegs.cycle;
  (*instrFnTbl[instr->op]) (instr);
  }

ADD, ADDIU and BEQ will look like this:

#define GPR(r) psxRegs.gpr[instr->r]

static void FASTCALL instrADD(const PsxDecodedInstr_t *instr) {
  GPR(rd) = GPR(rs) + GPR(rt);
  }

static void FASTCALL instrADDIU(const PsxDecodedInstr_t *instr) {
  GPR(rd) = GPR(rs) + instr->immed;
  }

static void FASTCALL instrBEQ(const PsxDecodedInstr_t *instr) {
  if(GPR(rs) == GPR(rt))
    doBranch(instr->immed);
  }

Essentially, we have replaced a bunch of bitwise operations for a few memory operations.

And the problem? We become memory bound (or more likely, L1 or L2 bound). If we are L1-bound, we could have run 3-4 instructions "for free" in place of the cache miss. If we are L2-bound, we could have run 10-12 instructions!

We are not making use of the "free" execution slots for the pre-decoded instructions. That's why I think it will not be any faster.

Moral of the story: fewer instructions do not necessarily mean faster code. Optimizing for modern CPUs can be tricky.

Is the code cache effective?

programming

Mar 2011

Answer: very.

Size	Read	Miss	Hit %
256 bytes	157,250	29,397	25.2%
1 kB	210,727	9,433	82.1%
4 kB	265,368	4,253	93.6%
16 kB	230,456	2,354	95.9%

The cache is very effective when the working set fits. Note that a cache miss will fill the entire 16-byte cache line, that's why the miss rate is multipled by 4.

I'm very surprised by the result. I had not expected the code cache to be so effective, considering it is a simple direct-mapped cache. (Most modern caches are 2-way or 4-way set associative.)

How do I vary the code cache size and gather such statistics? Ah, the marvel of emulation. ;-)

With this finding, I'm considering another approach to dynamic recompilation: decode the instructions when filling the code cache. We only need to pay for the decoding once. But I doubt it will be faster; I will explain in another entry.

The lesser known bottlenecks in emulation

programming

Mar 2011

Everyone takes it for granted that an emulator is slow. And almost everyone would suggest using dynamic recompliation to speed it up.

Dynamic recompilation takes care of the instruction decoding/execution bottleneck, but other bottlenecks remain.

Consider:

How do we access a disjointed and/or mirrored memory efficiently?
How do we access memory-mapped I/O efficiently?
How do we scheduled events, such as timers and interrupts?

Memory map

Suppose the memory map looks like this:

0x0000_0000	0x001f_ffff	Memory; cached access
0x1f00_0000	0x1f00_ffff	Expansion slot
0x1f80_0000	0x1f80_03ff	Scratch pad
0x1f80_1000	0x1f80_2fff	H/w registers
0x8000_0000	0x801f_ffff	Cached memory mirror
0xa000_0000	0xa01f_ffff	Uncached memory mirror
0xbfc0_0000	0xbfc7_ffff	BIOS; read-only

There is only 2 MB of RAM, but virtually the entire 32-bit address space is used. We can't just malloc 4 GB of memory and be done with it. While it works for the RAM and the BIOS, it still doesn't take care of the memory mapped I/O and the mirrored memory regions.

A naive memory read will look like this:

uint8_t psxMemRead8(uint32_t addr) {
  if(IN_RANGE(addr, 0x00000000, 0x001fffff))
    return mem[addr];

  if(IN_RANGE(addr, 0x1f000000, 0x1f00ffff))
    return expansionMem[addr - 0x1f000000];

  if(IN_RANGE(addr, 0x1f800000, 0x1f8003ff))
    return scratchMem[addr - 0x1f800000];

  if(IN_RANGE(addr, 0x1f801000, 0x1f802fff))
    return psxHwRead8(addr);

  if(IN_RANGE(addr, 0x80000000, 0x801fffff))
    return mem[addr - 0x80000000];

  if(IN_RANGE(addr, 0xa0000000, 0xa01fffff))
    return mem[addr - 0xa0000000];

  if(IN_RANGE(addr, 0xbfc00000, 0xbfc7ffff))
    return biosMem[addr - 0xbfc00000];

  return 0;
  }

Given that we need to read each instruction from the emulated memory, you can already tell that this is going to be really slow — especially if we are executing code from the BIOS.

We want to access emulated memory in constant time, and the way to achieve that is to use paging.

Once we set up the paging LUT (Look-Up Table), memory read looks like this:

uint8_t *memReadLut[1 << 20];

uint8_t psxMemRead8(uint32_t addr) {
  return memReadLut[addr >> 12][addr & 0x0fff];
  }

Very neat, isn't it? There are two downsides:

It uses a lot of RAM. We need two LUTs: one for read and one for write. They use up 8 MB in total. We can reduce this in two ways: (a) increasing the page size, (b) mapping a smaller address space.
It only works for memory. Memory-mapped I/O now fails to work.

Note that while this looks fast, it is still quite slow because we still need two bitwise operations and two memory reads to read one emulated memory. To be really fast, we need to fold the LUT away. And that can only be done by analysing the emulated instructions.

I/O map

One simple way to allow memory-mapped I/O is to check for it:

uint8_t psxMemRead8(uint32_t addr) {
  if(IN_RANGE(addr, 0x1f801000, 0x1f802fff))
    return psxHwRead8(addr);

  return memReadLut[addr >> 12][addr & 0x0fff];
  }

It works, except that it slows down every memory access. We need one comparison (yes, just one) and one branch.

One way to handle all memory transparently is to use a function LUT to handle the different types of memory:

uint8_t (*memRead8FnLut[16]) (uint32_t addr);
uint8_t memReadLut[1 << 20];

uint8_t psxMemRead8(uint32_t addr) {
  return (*memRead8FnLut[memReadLut[addr >> 12]]) (addr);
  }

Let's see if you can understand this code. :-D

You can argue that memory access is even slower now because we need one function call and three memory reads just to read one emulated memory!

Events

The heart of the interpretive emulator looks like this:

while(1) {
  uint32_t code = psxMemRead(psxRegs.pc);
  psxRegs.pc += sizeof(uint32_t);
  ++psxRegs.cycle;
  execInstr(code);
  handleEvents();
  }

Events are stuff like timer, DMA and interrupts. For example, the VSync interrupt triggers every 16.68 ms (or 565,045 clock cycles at 33.8688 MHz) for NTSC displays.

However, it is just too much overhead to check for events at every instruction. A good compromise is to check during every taken branch (including calls). This cuts down the checking to 1-in-10 to 1-in-20 instructions — branch is a pretty common instruction.

Even so, it is still way too high. A better approach is to execute until the next scheduled event:

while(1) {
  uint32_t cycles = getCyclesToNextEvent();
  cycles = CAP(cycles, CYCLES_LIMIT);

  while(cycles-- > 0) {
    uint32_t code = psxMemRead(psxRegs.pc);
    psxRegs.pc += sizeof(uint32_t);
    ++psxRegs.cycle;
    execInstr(code);
    }

  handleEvents();
  }

We cap the execution cycles to avoid delaying a new short-lived event. With this, we now only check for events every few hundred instructions.

Conclusion

Just having a dynamic recompiler is not sufficient to make an emulator fast. We need an efficient design for the memory, I/O and events too.

The ideal CPU instruction set

programming

Mar 2011

If I were to design a 4-byte CPU instruction set, I would base it on the MIPS instruction set. There are some things I like about it:

Fixed instruction size. Easy to decode the instruction stream.
3 operand mode. Very flexible and avoids most false dependencies.

What don't I like?

Branch delay slot. It turned out that it was hard to put an instruction there for loops, so it was filled with NOP most of the time. A branch-likely delay slot (introduced in later MIPS) works better.
32 registers, with a hardwired 0. They use up too much space (3 * 5 = 15 bits).
Able to load a 16-bit immediate at most. This is bad because we need to use two instructions to load most constants, including global memory addresses.

First, I'll get rid of the branch delay slot (BDS). This is one of MIPS's most distinctive features, yet it is also its downfall. It complicates superscalar and out-of-order execution. A good alternative to BDS is a branch predictor.

Second, I feel 32 registers is too many. 16 is a good balance. We only need 4 bits per operand then. I'll make all of them general-purpose: no hardwired 0, RA (Return Address) and SP (stack pointer).

A hardwired 0 is useful, but it just tie up valuable encoding space. We'll use a dedicated MOV instruction instead.

I'll make the return address go to the stack. This is needed for most functions anyway except for leaf functions.

It is not necessary to access SP often, so let's make it non-general purpose. We'll use separate instructions to access it. This makes more sense than tying up one valuable register. We will still allow indirect SP addressing, though.

Third, we should be able to load a 32-bit immediate directly. This means some instructions must be at least 5 bytes long. And once we consider variable length instructions, we may want to allow 2, 4 and 6-byte instructions. And that usually implies 2-operand mode so that they can be encoded in 2 bytes.

Other nice-to-have:

Bitmask push/pop instruction to allow efficient saving/restoring of registers
Addressing relative to program counter to allow position-independent code
Conditional move or If-Then-Else instruction to avoid branching

The end result will be very un-MIPS like, and looks closer to ColdFire or ARM Thumb2!

Memory is slow, but exactly how slow?

programming

Mar 2011

We know that main memory is slow compared to the CPU itself. But just how slow?

Stats for a Core i5 650 (3.6 GHz):

	Size	Cycles
L1	32 kB	4
L2	256 kB	7-10
L3	4 MB (shared)	40-50
Mem	>1 GB	250-350

I'm so out of touch. I thought main memory takes only 50 cycles at most.

The huge differential is why simple linear data structures win over complex linked-list data structures. Not surprisingly, the resizable array is now a very popular data structure.

I'm also somewhat surprised that the L1 cache is slower than registers. I took it for granted it would take just 1 cycle to access it. Apparently it is not the case — for CPUs with complex micro-architectures, at least.

This means pure register code is most likely faster. This can be difficult to achieve on x86 with just 7 general-purpose registers (GPR). (No, SP is not a GPR.) However, it should be relatively simple on x86-64 with 15 GPRs.

Driving vs flying

Feb 2011

Taking a weekend plane flight from Singapore to KL and back is surprisingly expensive: S$183 all in. An equivalent bi-directional bus trip is S$100.

In contrast, a Thurs-Sat flight can cost as cheap as S$48! (All-in.)

But there is anothermethod: driving. Petrol costs S$22 and toll S$20. While it may not sound exactly cheap (S$88 all-in), it scales really well: the cost remains the same for 2-5 people!

But a plane flight is faster, right? 45 minutes vs 3.5 hours.

But that is deceptive because of all the waiting time. You need to check in an hour earlier, then it takes 10-15 minutes to board the plane, and finally it takes another 10-15 minutes before the plane takes off.

Plus, you most likely need to take cab 4 times (to and fro the airport, both ways). The cab fare from LCCT/KLIA to KL can cost S$32 already.

Conclusion: driving is still the better option.

Change is in the air

social

Feb 2011

In 2006, people were ready for a change. People turned up in record numbers in opposition rallies. It was not reported in the media, of course. The photos were shown freely online. Some people were awed by what they saw and wondered if the photos were doctored. I knew for fact that they weren't — because I was there.

I still remember some memorable quotes:

At that time, oil was very expensive. A minister said the people should not choose the opposition as they don't have answer to high oil prices. The retort: PAP does not, either.
A minister proclaimed that the PM will receive an overwhelming mandate of 90% in his constiuency. This is unbelieveable even in the best of times, what more when people were unhappy? This just goes to show how out-of-touch he is with the common people.
The PM said the newbie WP team sent to him was a suicide squad. Retort: better a suicide squad than a cowardly squad. There was much applause and cheering. The PM got a 66.6% mandate.

In the end, nothing happened because the people bought PAP's promises of change hook, line and sinker.

But the next four years got worse for the citizens. Record immigration. Record cost-of-living. Singapore changed from a-good-place-to-stay to a-good-place-to-stay — if you are rich.

Let's see if the people are fed up this time.

Not your grandfather's switch statement

programming

Feb 2011

The C switch statement has a reputation as syntatic sugar for nested if-else statements. It is not hard to see why:

switch(v) {
case 0:
  fn0();
  break;

case 1:
  fn1();
  break;

case 2:
  fn2();
  break;
}

is equivalent to:

if(v == 0)
  fn0();
else if(v == 1)
  fn1();
else if(v == 2)
  fn2();

The problem with nested ifs is that the running time is linear. Thus, a common "optimization" is to replace a switch with a call table:

void (*call_tbl[]) (void) = { fn0, fn1, fn2 };

(*call_tbl[v]) ();

Which should be faster as well?

That may be true 15 years ago, but today's compilers can optimize the switch statement rather aggressively:

The compiler can determine that the cases are close to one another and use a jump table automatically.
For cases that are close but yet aren't too close together, a jump table can waste a lot of space (4 bytes per entry). However, the compiler can construct a two-step jump table. The first table has smaller entries (say 1 byte per entry) and index into the second table that contains the jump addresses.
The compiler can break up several distinct ranges and construct jump tables for each range.
For really sparse cases, the compiler can do a "binary search". This reduces the average running time. (Note that this could make things worse too.)

The moral of the story is, take a look at the compiler output before you decide a call table is "faster" than a switch. It may not be.

Data and function pointers don't mix

programming

Feb 2011

I have forgotten a lot about C.

I wanted to hold a function pointer in a data pointer. The compiler flags a warning. It works, of course, but I don't want that warning. Typecasting can get rid of the warning, but take note: typecasting is evil. I'll rather have typecasting warnings than no warnings.

So I googled to find the proper solution. And I found that it is actually undefined behaviour.

Now, I'm more surprised that I forgot data and function pointers are distinct, rather than that it is undefined behaviour. I vaguely remember that I once knew, but now I don't.

Since I desire to conform to Standard C, I made it a union of data and function pointer. But I still have a problem: I'm unable to initalize it statically. This is because in C89, I can only initialize the first field in a union. I'll still get warnings half the time.

Argh!

C is supposed to be a portable language, but in practice, C code is seldom directly portable. This is because in its quest to be efficient, it leaves open many implementation-specific behaviour. (As opposed to undefined behaviour which is not specified in the Standard at all.)

Consider the following:

Big endian vs little endian
NULL with a non-zero representation
Non two's complement integer
Padding in structures
sizeof(int) != 4
sizeof(int) != sizeof(long)
sizeof(int) != sizeof(int *)
sizeof(int *) != sizeof(void *)
sizeof(void *) != sizeof(void (*) (void))

Will your program still run correctly?

The good news is, most modern CPU architecture are similiar and all C compilers compile similarly, so most C code still port easily. For example, NULL is indeed numeric 0 for most general-purpose CPUs. And all modern CPUs do use two's complement arithmetic.

So what are the main gotchas? Big vs little endian and 32-bit vs 64-bit platforms.

Padding in structures is usually not a problem until we need to change the padding for just some structures. It can be worked around, but we need to use compiler-specific techniques, which in turn requires using the preprocessor to keep the code portable.

I don't claim nor want to write a perfectly portable C program. However, I also don't want to restrict myself needlessly — which many C programmers do by using wrong assumptions.

Off track

sizeof doesn't need parenthesis, as it is an operator, not a function call. I still do it anyway, because I find it easier to read the code.

Function pointers are implicitly dereferenced, so a = (*fn_ptr) (b, c); can be written as a = fn_ptr(b, c);. I still use the first form exclusively though, because I like to distinguish between direct and indirect function calls. Call me old-fashion.

Some programmers like to put parenthesis for return. I absolutely hate it. I can't stand this: return (0);. Some of them thought it was necessary. Some of them put it "to be safe". To be safe from what? It is carry over from their usual overuse of parenthesis for normal expressions. Imagine, even this needs parenthesis: a = (b * c) + d;, which just leaves me shaking my head. My response is always the same: learn the C operator precedence table.

The review that is better than the movie

Feb 2011

I'm referring to the famous 70-minutes The Phantom Menace review by RedLetterMedia, of course.

He has since released his reviews for episodes 2 and 3. They are still pretty good, but they are not as brilliant as the first one. I would rate the first one A+, the second and third B+.

It is only after I watched all three reviews and seeing the actual behind-the-scene footage that it dawned to me why the Star Wars prequels are so bad: almost every scene is a green-screen shot.

There is no variety. I think all of us detect it sub-consciously. Every actor is always relaxed and always standing or strolling on a flat ground. It's terrible.

For the record, I never watched the third prequel. (I gave up after watching the second movie.) It was said to be the least bad of the three. Looking at the review, I think it is still very bad. Basically, the prequels are soulless; they are dry and boring.

The 80th percentile and CPF

finance

Feb 2011

In Budget 2011, the Singapore Government increases the employer CPF contribution by 0.5%. It goes to the Special Account (SA). While this sounds nice in a headline, it is useless to the employee. SA is something we can almost never touch. As such, it is like a tax (if you don't already consider CPF to be a tax of some sort.)

Whenever the Government tops up the CPF, it is almost always to the SA. You can only get money out of it after the official retirement age and then, only in monthly payouts. I'm always amused by it.

The CPF contribution cap is now raised to $5,000. I remember the rationale for lowering it from $6,000 to $4,500 (progressively) was to align it with the 80th percentile income. This basically tells us that the 80th percentile income is now $5,000.

No more TV/radio license!

19 Feb 2011

As part of the Budget 2011, the Singapore Government finally scrapped away with the TV/radio license! Finally, I can put the radio back in my car and buy a TV.

Call me cheap, but I'm against paying $26 for the car radio license and $110 for the TV license — for something I don't do. I neither listen to radio nor watch TV. However, I do listen to CDs and watch DVDs.

The Government says it will lose $120 million. That's barely 1.1 million households. Strange that Singapore has imported so many people and the household number remains about the same. Hmm..

A doctor does not take his own medicine

Feb 2011

The "high profile" security firm HBGray was hacked after the CEO was about to expose Anonymous, the group that was DoS'ing organizations that cut off WikiLeaks. And how?

Turns out to be easy:

SQL injection [Custom CMS that is poorly tested for loopholes]
Non-salted passwords [able to use pre-prepared password tables]
Passwords hashed with single-pass MD5 [fast to decode]
Trivial passwords (short + simple)
Common passwords on other systems
Server running older unpatched code allowing root escalation
ssh allowing remote access via password (instead of keys)
"Chief Security Specialist" gave user name/pw over email (fell to social engineering; not his day)

It all starts with the website allowing SQL injection — the mother of all Internet security loopholes.

What does this tell us?

Security is hard, because people know what to do, but they still don't do it.

My comments:

I'm paranoid about SQL injection. Even so, I miss some SQL queries during code inspection. Just about the only way is to write a SQL query helper module and use it all the time.
Salt is a must. At the least, it reduces the embarassment when you are hacked. :lol:
The minimum password hashing is SHA-256, but BlowFish is recommended.
The system must reject trivial passwords.
Server must be up-to-date.
ssh passwords must be non-trivial and unique! (I still hesitate to use keys.)
No username/pw over email, absolutely. Or at least have a challenge first.

Branching in the MIPS R3000 branch delay slot

Feb 2011

Branching in a branch delay slot (BDS) is undefined behaviour for the MIPS R3000 CPU. And most MIPS docs just leave it at that — even the famed See MIPS Run.

The good news is, people have done experiments and found the exact behaviour:

Case	1	2	3	4
	jmp A	jmp A	jmp A	jmp A
	1	jmp B	jmp B	jmp B
	X	X	X	X
	X	X	X	X
A:	2	1	jmp C	jmp C
	3	X	X	X
	4	X	X	X
	5	X	X	X
B:		2	1	jmp D
		3	X	X
		4	X	X
		5	X	X
C:			2	1
			3	X
			4	X
			5	X
D:				2
				3
				4
				5

The first one is legal code. The other three show the effect of a taken branch in the BDS. (I think the pattern continues.)

Case 2 can be used to execute a single instruction, but with a caveat: if an interrupt occurs in the BDS, the program will crash because the return address is adjusted wrongly — the CPU automatically subtracts the PC by 4 to re-execute the branch if the BDS flag is set.

The table is only correct for absolute jumps. For relative jumps, we need to offset by +4 as the PC is incorrect at the point of the delayed branch:

	beq A
	beq B
	X
	X
A:	1
	X
	X
	2
B:	3
	4

A branch in the BDS is more tricky than it seems.

Rings with a twist

Feb 2011

This is the first design to really catch my attention. And the best thing is that they are budget-class rings — although they still exceed the original budget by 50%.

Inflation showing up slowly

Jan 2011

Inflation is showing up slowly. Even official statistics show it — despite Governments' best statistical tricks.

Last weekend, I paid $4.50 for two slices of bread with kaya, two semi-boiled eggs and a cup of tea. This is an increase of 30 cents. I believe I paid $3.50 just less than a decade ago. (Note: same store. I know you can get the same deal for $3–$3.50 elsewhere today.)

Won't be going back soon. Most likely I'll prepare my own breakfast in the future. (My budget for breakfast is $1.50. :-D)

I am waiting eagerly for the first country to raise their interest rate. This is the standard method to counter inflation, yet no one can do it due to hot money — I don't understand this part very well.

Anyway, I wonder who will do it first? A couple of countries have revolted. It won't be long.

Nothing is impossible

Jan 2011

First, you will do everything possible. If that doesn't work, you will do the impossible, then the unthinkable, then the unimaginable.

Someone said this regarding "Helicopter" Ben, but I think it's a cool quote.

It is something to keep in mind when facing a difficult task or a task with no answers.

My Rambling Thoughts

News:

Simple C compilers

Why C++ is the way it is

No more C++ for me

Will pre-decoded instructions run faster?

Is the code cache effective?

The lesser known bottlenecks in emulation

Memory map

I/O map

Events

Conclusion

The ideal CPU instruction set

Memory is slow, but exactly how slow?

Driving vs flying

Change is in the air

Not your grandfather's switch statement

Data and function pointers don't mix

Off track

The review that is better than the movie

The 80th percentile and CPF

No more TV/radio license!

A doctor does not take his own medicine

Branching in the MIPS R3000 branch delay slot

Rings with a twist

Inflation showing up slowly

Nothing is impossible