My Rambling Thoughts

Quote:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

Brian W. Kernighan

Quote:

If you ride a motorcycle often, you will be killed riding it. That much is as sure as night follows day. Your responsibility is to be vigilant and careful as to continue to push that eventuality so far forward that you die of old age first.

Unknown source

News:

Date: . Source: .

The annual $10 million ritual

finance
Queuing for the $10 mil Toto

Queuing for hope.

Lottery is a tax on the math-challenged, but I do think it is worth buying once in a while — when I perceive the odds are better.

(The $10 million draw may seem a lot, but it usually ends up $2 to $3 million due to the shared pot. In other words, don't go heavy on this draw.)

I don't buy often, though. My annual gambling budget is S$30 — $20 for 4D and $10 for Toto.

And that means I can't visit the IR! :cry:

Emulating the PSX VSync counter correctly

programming

PCSX treats the screen as 386 x 262 at 60 Hz, with VSync lasting 22 scan lines. As a result, it programs these values into the counters:

Counter Rate
Pixel clock 5 VSync / (386 * 262)
Horizontal scan 1973 VSync / 262
Total vertical scan, NTSC 564,480 33,868,800 / 60
Vertical blanking 47,388 22 HSync

This is pretty straightforward. PCSX uses 517,092 for VSync instead of the correct 517,080 because it divides dumbly. However, it has a BIAS value of 2 that makes all the values 2x smaller.

The correct values?

I'm not able to find the PSX's video clock frequency. Should it drive the digital display at 320x240 (with suitable overscan) or the analogue NTSC at 4.5 MHz?

A quick google search shows that NTSC is 286 x 262.5 at 59.94 Hz, with vertical blanking lasting 19.5 scan lines. (This is looking at the NTSC interlaced fields. If we look at frames, then it is 286 x 525 at 29.97 Hz, with 39 scan lines.)

PCSX should program these values instead:

Counter Rate
Pixel clock 7.5264075 VScan / (286 * 262.5)
Horizontal scan 2152.5526 VScan / 262.5
Total vertical scan, NTSC 565,045.05 33,868,800 / 59.94
Vertical blanking 41,974.775 19.5 HSync

Fractional bits must be used especially at low multipliers to retain accuracy.

The correct reaction in a fish fight

So a fight broke out in a wet market because a woman accused a fishmonger of molesting her and the fishmonger's wife took offence.

How did the shoppers react?

Staring at a fight

Quote of the year:

I like this picture the best. It shows Singaporeans gawking as one united people regardless of race, language or religion with the SAF citizen-soldier head and shoulders above everyone else.

:clap:

Emulating the PSX root counters

programming

The PSX has 4 root counters (timers):

  • Counter 0 is the pixel clock
  • Counter 1 is the horizontal retrace clock
  • Counter 2 is 1/8 system clock
  • Counter 3 is the vertical retrace (non-programmable)

PCSX keeps a CPU cycle count. It basically assumes 1-cycle per instruction.

When a timer event has occurred, PCSX records the current cycle and the target cycle to trigger the next event. Because the counters count in their own units, PCSX uses a multiplier to convert the target count to cycles.

Typical multipliers:

System clock 1
1/8 system clock 8
Pixel clock VSync multiplier / (386 * 262)
Horizontal retrace VSync multiplier / 262
Vertical retrace (NTSC) 33,868,800 / 60 - 22 HSync

We can tell that the video screen size is 386x262 including the overscan.

The VSync multiplier is fixed. On NTSC, it is 517,092. It means the VSync will be triggered every so many CPU cycles. It is equal to 240 HSync. The VSync "retrace" lasts for 22 HSync.

When a timer event is triggered, PCSX sets the corresponding counter interrupt bit at address 0x1f801070 which subsequently triggers psxException() in R3000A.c.

The SPU counter

PCSX creates a 5th counter for asychronous SPU with a multiplier of 49,152, meaning it will be called 689 times per second. It is only used if the SPU plug-in supports SPU_async().

Optimization

When a counter is updated, PCSX will look for the first counter event to be triggered and check against that. This allows it to do just one comparison instead of 4 (or 5). Every cycle counts because timer events are checked on every taken branch (something like 1 in 10-20 instructions).

Hacks

There are two configurable hacks to make the timers trigger faster.

What is BIAS?

In theory, everything should work fine. However, PCSX uses a divisor of 2 (the so-called BIAS) when it converts timer counts to CPU cycles. This has the effect of triggering the timer events 2 times faster. Strangely, everything still seems to work fine.

Improvement

PCSX handles the root counters well, except for the fact that it does not take care to preserve as much precision as possible during conversion of units. The multipliers are also integers, so errors can add up over time.

For example, if PCSX calculates the VSync interval accurately, it should be triggered every 517,080 cycles. It is 12 cycles off every second.

A slight improvement to PCSX memory access

programming

A page entry has the value 0 for invalid memory regions. I would just point these pages to a dummy page:

pg = addr >> 16;
pg_base = mem_r_lut[pg];
return pg_base[addr & 0xffff];

There is no special case. It works because we can map the write LUT invalid pages elsewhere, so our dummy read page is untouched.

If we need to detect h/w registers and special memory regions:

pg = addr >> 16;
pg_base = mem_r_lut[pg];

if((UINT) pg_base >= 8)
  return pg_base[addr & 0xffff];
else
  return (*sp_mem_r_handlers[(UINT) pg_base]) (addr);

A LUT entry can either be a valid page or 0-7 for special memory handlers. (This is no faster than PCSX, but is more generic.)

There are still two problems with this modified code:

  • The pages are 64 kB. It is different from PSX's native 4 kB pages
  • It invalidates the code memory on all writes

I would do it differently if I were to handle the PSX memory map from scratch.

The PCSX memory map

programming

Memory access can make or break an emulator. Emulators must be fast to be useful. If resources are constrained, then fast means being efficient.

Memory access is typically a 1-cycle event on a 32-bit CPU — we can assume a cache hit rate of 0.95. In other words, we have to be as efficient as possible.

We have already seen PSX's huge and non-contiguous 4 GB memory map. Let's see how PCSX handles it.

PCSX first allocates 2 MB for the RAM, 512 kB for the BIOS and 64 kB each for the parallel ports (unused) and the h/w registers.

It then allocates two 64k-entry LUTs (Look-Up Tables) — 256 kB each. Each entry points to a 64 kB page or is NULL for an invalid memory region.

One LUT is used for read access and the other write access. This allows memory regions to be made read-only. However, this is only used for the BIOS.

The LUTs are very sparse: the read LUT only has 394 entries, and the write LUT has 286 entries (8 fewer).

Read access

Memory access is straightforward. Use the top 16 bits as the index into the LUT, get the page's base address and then add the bottom 16-bit offset to it. This is basically software-based paging.

pg = addr >> 16;

// optional if known/assumed to be memory
if(pg == 0x1f80)
  return read_hw_reg(addr);

pg_base = mem_r_lut[pg];
if(pg_base != NULL)
  return pg_base[addr & 0xffff];
else
  return 0;

Two conditionals are needed. In many cases, PCSX assumes it must be a memory access, so it skips the first check. That still leaves one check.

The recompiler will fold away the page lookup for constant memory locations. However, an unscientific profiling shows that only 27% of the LW instruction can be folded away.

Write access

Write is very similar, but PCSX needs to clear the recompiler's code memory and handle some special memory addresses.

pg = addr >> 16;

// optional if known/assumed to be memory
if(pg == 0x1f80)
  write_hw_reg(addr);

pg_base = mem_w_lut[pg];
if(pg_base != NULL) {
  pg_base[addr & 0xffff] = value;
  invalidate_code(addr);
  }
else {
  handle_special_mem(addr);
  }

My take

PCSX's method is acceptable, but it can be made slightly more efficient. A known mem read can be done without any conditional checks. However, the main problem is that a mem write always writes to two memory locations.

All you need to know about the current un-"crisis"

finance

From an online forum:

Greece, those clowns racked up $350 billion of external public debt with a population of 11 million?!

Holy calamari Batman!! That's $31,000 per Greek. What *******s!!

Who could be so foolish?

Wait, lemme check OUR numbers... $13 trillion / 300 million = $43,000 per American.

Holy flat screen Batman!!

And that doesn't even take into account the 3 trillion we have parked off balance sheet at the FED in some Enron-on-steroids scheme.

Haircuts are coming. A little at a time, or all at once — but they are coming. Doesn't mean the worlds gotta end, but I don't know if there is enough duct tape around to keep this game going.

Sit back and relax. Do not panic, everything is under control. :-D

The most common instructions

programming

At a random point in time in PCSX:

ld 16.5%
st 8.0%
br 12.9%
jal 0.76%
lui 5.9%
arth 12.8%
bit 6.2%
shift 4.1%
nop 20.8%
Others 11.8%

Load/store make up almost 25% of the instructions. Fast memory access is vital.

br, used for conditionals and loops, is very common.

jal are function calls. Not that common.

lui is an indication that a 32-bit integer is needed. It is quite a common occurrence, but MIPS has no direct way to load it.

arth is add, sub. bit is and, or, nor. shift is shift left and right.

nop is surprisingly common. It must be due to the load and branch delay slot.

The takeaway

Basically, a few key instructions make up the bulk of the instruction stream.

I feel the MIPS architecture has two major flaws:

  • It can only load a 16-bit signed integer
  • It has a branch delay slot, which are often filled with nop

The net effect is that some common operations are effectively 2 cycles and 8 bytes.

However, it is not fair to criticise the MIPS architecture because it dates back to 1981! The processor complexity back then was different. Too bad we still have to live with the flaws 30 years down the road.

How fast can you emulate the PSX?

programming

Data gathered with an instrumented PCSX. I implemented simple counters that are designed to have the least possible overhead.

I used the interpreter mode as my counting method is not accurate with the recompiler (when its peephole optimizations work).

Test instr COP0 GTE GPU CD SPU MDEC
FF 6 normal 8.1M 1.4k 0 2.5k 2.8k 1.0k 0
FF 7 normal 16.5M 2.0k 460k 9.9k 0 425 0
FF 8 normal 13.7M 2.2k 271k 539 0 131 0
FF 6 movie 26.7M 2.2k 0 6.2k 6.1k 2.2k 47
FF 7 movie 27.9M 8.6k 183k 14.5k 9.0k 0 49
FF 8 movie 29.1M 7.9k 5.2k 16.2k 48.4k 1.8k 45

instr is the number of MIPS instructions per second, including COP0 and GTE (COP2).

GPU, CD, SPU and MDEC counts the number of requests/transactions per second.

The interpreter can emulate around 32M instructions/s on my CULV 1.2 GHz Core2Duo CPU. A drop indicates how heavily the other sub-systems are using the CPU.

COP0 is Memory Management and System Interrupt related. They are not as rare as I thought they would be.

GTE (COP2): complex and slow 3D instructions. Expect a slowdown when they show up.

GPU: will almost always be high, especially for a 2D scene, but it is usually fine because we have a separate GPU/core to take care of it.

CD: non-stop usually indicates streamed music. Random and one-off numbers usually means loading.

SPU: will almost always be present. Low throughput. Complexity is unknown.

MDEC: usually shows up during movies only. Contrary to my expectations, it does not seem to be computationally intensive at all.

When the sub-systems don't seem to be heavily loaded and yet the #instructions is low, it usually means there is some heavy memory transfer. For example, for FF 6 normal, I suspect the game is blitting to the screen on every VSync... my system can only manage 20-25 fps for it.

An overview of PCSX file structure

programming

PCSX is a very simple program. It is written in a very modular fashion: each source file corresponds to one aspect or sub-system of the PSX. The core emulator is implemented in PsxHw.c, PsxInterpreter.c and PsxMem.c.

Once the user starts a game, PCSX executes psxCpu->Execute(), an infinite loop, on the main thread and never returns. The way it works is that the emulator engine calls SysUpdate() to handle the GUI on every emulated VSync (60 Hz for NTSC).

PsxMem.c allocates memory for the various PSX memory regions and creates two 64k-entry LUTs (Look-Up Tables) to map the huge non-contiguous PSX memory. It uses one for read access and the other for write access. An entry is either 0 or points to a 64 kB page.

PsxInterpreter.c emulates the MIPS R3000 instructions. It does not emulate load delay slot. It emulates branch delay slot for load and co-processor instructions. GTE instructions are handled in gte.c. The core interrupt handling is done in R3000A.c. The tests are done in psxBranchTest() on every taken branch.

PsxHw.c handles the h/w registers. It maps GPU and SPU registers directly to the plug-in function calls. Their DMAs are handled in psxDma.c.

MDEC is handled in Mdec.c.

CD-ROM requires additional handling in CdRom.c, which interfaces to the CD-ROM plug-in. It also uses Decode_XA.c to decode XA data.

Serial I/O is handled in Sio.c. It handles memory cards internally and interfaces to the Pad plug-in.

PsxCounters.c implements root counters. They are updated in psxRcntUpdate(), called by psxBranchTest().

plugins.c loads the plug-ins. The functions are stored in function pointers and used by other files directly.

PsxBios.c contains an emulated BIOS.

Misc.c contains the CD-ROM and EXE file loader, and the core save state functionality, among others.

The recompiler

iR3000A.c is a direct replacement for PsxInterpreter.c. iGte.h replaces gte.c.

ix86.c contains the x86 code emitter functions.

The recompiler allocates its own PSX code memory (2.5 MB) and uses a 64k-entry LUT to map the non-contiguous PSX memory. The code memory points into a 8 MB native instruction store. When the store is filled up (down to the last 64 kB), it is reset and all recompiled instructions are lost.

The recompiler recompiles the instructions standalone up to a branch or 500 instructions, whichever comes first. It does some peephole optimization and constant propagation within a basic block.

The recompiler allows the code memory to be invalidated using psxCpu->Clear(). This is called by DMA writes.

psxMemWrite8/16/32() also zeroes out the code memory directly to support self-modifying code. However, this is likely not to work due to the way the recompiled instructions are executed.

Studying an emulator

programming

I am looking at PCSX, an open source PSX emulator. Its last official release was May 2003. Its team moved on to PCSX2, a PS2 emulator. A few groups forked the project, but they made very few changes to the core emulation.

The forked projects:

  • Pcsx-df: focusing on making Pcsx run better on Linux
  • Pcsx-Reloaded: based on Pcsx-df, backported to Windows
  • Pcsx-rr (Re-Recording): focusing on recording
  • Pcsx-r (Revival): announced but no releases

I have been thinking how to start. I can think of a few things to do:

  • Profile. I want to know how the sub-systems (CPU, GTE, GPU, SPU, CDROM, MDEC) interact
  • Measure. I want to know the effectiveness of the recompiler's optimizations
  • Crash report. I want a memory/register dump so that other people can report issues that they encountered

Other PSX emulators

There are two better PSX emulators: ePSXe and pSX. Both are closed source.

ePSXe 1.7.0 was last released in May 2008, a full 5 years after its 1.6.0 release in 2003. It uses the same PSEmu Pro based plug-in system as PCSX.

pSX 1.13 was last released in August 2007. It does not use any plug-ins.

Not many people are interested in PSX emulation any more. Its hay day is over.

A look at the PSX specs

The Sony Playstation 1 specs, mainly taken from Wiki:

CPU MIPS R3000A (33.8688 MHz). 30 MIPS.
Bus 132 MB/s.
GTE Geometry Transformation Engine. 66 MIPS. 1.5M flat shaded or 500k textured polygon/s.
MDEC Data Decompression Engine. Decompressing images and video. 80 MIPS. Capable of 320x240 30 fps or 640x480 15 fps.
GPU Graphics Processing Unit. 360,000 flat-shaded or 180,000 texture-mapped and light-sourced polygon/s. 1 MB VRAM.
SPU Sound Processing Unit. ADPCM source. Max 24 channels. Max sampling rate 44.1 kHz. 512 kB RAM.
CD-ROM XA Mode 2, CD-DA, 300 kB/s.
Memory card 128 kB.

Very impressive for a 1994 console.

The PSX can be emulated properly in a present day system (1.5 GHz onwards); it is a matter of efficiency.

We can get by without a powerful graphics card, because the 3D processing is done using the GTE, which is a co-processor to the CPU.

The bane of emulators: non-contiguous memory map

programming

We can recompile instructions from one CPU architecture to another, achieving near native speed. Are we done? No, there is an even harder problem to solve: memory access.

There are three things that make it difficult to emulate a memory map:

  • A huge and non-contiguous memory map
  • Memory-mapped hardware registers
  • Mirrored memory regions

An emulator needs to do these:

  • Provide fast access to RAM/BIOS
  • Make the BIOS read only (optional?)
  • "Trap" on h/w registers
  • Handle invalid memory accesses
  • Fold mirrored regions

If the emulator does dynamic recompilation, it also needs to detect any changes to the code memory. This means additional checks.

It is easier to understand if we look at a real system, because memory maps are usually very system specific. Take for example Sony Playstation 1's memory map:

0x0000_0000 0x001f_ffff 2 MB RAM
0x1f00_0000 0x1f00_ffff 64 kB Parallel port
0x1f80_0000 0x1f80_03ff 1 kB Scratchpad
0x1f80_1000 0x1f80_2fff 8 kB H/W registers
0x8000_0000 0x801f_ffff 2 MB RAM mirror (cached)
0xa000_0000 0xa01f_ffff 2 MB RAM mirror (uncached)
0xbfc0_0000 0xbfc7_ffff 512 kB BIOS

It is trivial to map a sparse 4 GB memory region in a 64-bit OS (provided it allows multiple mapping to the same underlying page — which we need for the mirrored regions).

We cannot do that on a 32-bit OS; we need to do everything in software. We have to define at least 3 regions, check which region is being accessed and handle it appropriately, so a general purpose memory access will always be at sub-native speed.

Official vs real memory map

The above is the "official" memory map. However, PCSX, an open source PSX emulator, mirrors the first 2 MB memory at 0x0020_0000, 0x0040_000 and 0x0060_0000 too.

It also makes the BIOS read only.

There must be a reason why they are doing these.

A new cube, a new way to work

My department is shifting to another building. The new cubes will be smaller — 50% smaller.

(My current cube is 48 sq feet, already downsized from its original 64 sq feet. The new cube is a mere 24 sq feet.)

I decided to change the way I work after looking at the mockup cube. I have 3 PCs, 2 LCD monitors and one set of keyboard/mouse. One of the PCs is already headless and I VPN in 99% of the time.

Now, I'll make the secondary PC headless too and use Remote Desktop to connect to it. I can remove one monitor and reclaim some desk space.

I will also be able to place the two headless PCs further away — perhaps not even in my cube!

Jams everywhere

transport

I left home at 7:45 am. Just before I joined PIE, I saw the EMAS display "Massive Jam on PIE (to Tuas)". The EMAS is usually right and I do not want to be in a MASSIVE jam. I decided to take Farrer road instead.

Bad choice! Farrer road had the worse jam I have ever seen. After being stuck in the jam for a while and still at the peripheral of Toa Payoh, I decided to come back to Toa Payoh for my breakfast. Time: 8:30 am.

I'm going to be late for work.

I hope this is not a new trend. Just last week, I wanted to "detour" to Boon Lay to pack my lunch before I head for work. I left home at 8:15 am. I found a massive jam waiting for me at Eng Neo that supposedly originated at Clementi Ave 6. That's 10 km of jam! I decided to abort my plan and exit at Eng Neo. It still took me until 9:50 am to reach office!

There are a few reasons why I like to go to office early. This is one of them: avoiding jams. Jams are such a huge waste of time. Peak hours always have some sort of jam. However, there is also a good chance of a super-jam, usually due to an accident.

However, I can't seem to wake up as early these days. Some of my colleagues are already remarking that I'm coming to office later and later. (I'm known to reach the office early.)

Handling the MIPS branch instruction

programming

(Note: the info here is based on an open source MIPS R3000A emulator.)

It is pretty easy to emulate the MIPS branch instructions, even with the branch delay slot. However, we need to handle the special case where (a) the branch is taken and (b) the delay instruction writes to a register used in the first target instruction.

First, we check if the delay instruction matches:

Delay Instr Cond
COP0 op == 0x10
COP2: MFC2 or CFC2 op == 0x12 /*COP2*/ && func == 0 && (rs == 0 /*MFC2*/ || rs == 2 /*CFC2*/)
LB, LH, LWL, LW, LBU, LHU, LWR op >= 0x20 && op <= 0x26
LWC2 op == 0x32

If so, we handle them specially if reg (the rt register in the delay instruction) is used in the target instruction:

Target Instr Cond Effect
SLL (excluding NOP), SRL, SRA rd == reg && rt == reg Delay R/W
rt == reg Delay R
rd == reg Delay W
JR rs == reg Delay R
JALR rd == reg && rs == reg Delay R/W
rs == reg Delay R
rd == reg Delay W
SLLV, SRLV, SRAV
ADD, ADDU, SUB, SUBU
AND, OR, XOR, NOR
SLT, SLTU
rd == reg && (rt == reg || rs == reg) Delay R/W
rt == reg || rs == reg Delay R
rd == reg Delay W
MFHI, MFLO rd == reg Delay W
MTHI, MTLO rs == reg Delay R
MULT, MULTU, DIV, DIVU rt == reg || rs == reg Delay R
BLTZ, BGEZ, BLTZAL, BGEZAL rs == reg Delay R
JAL reg == 31 Delay W
BEQ, BNE rs == reg || rt == reg Delay R
BLEZ, BGTZ rs == reg Delay R
ADDI, ADDIU, SLTI, SLTIU
ANDI, ORI, XORI
rt == reg && rs == reg Delay R/W
rs == reg Delay R
rt == reg Delay W
LUI rt == reg Delay W
COP0: MFC0 or CFC0 rt == reg Delay W
COP0: MTC0 or CTC0 rt == reg Delay R
COP2: MFC2, CFC2, MTC2 or CTC2 rt == reg Delay R
LWL, LWR rt == reg Delay R/W
rs == reg Delay R
LB, LH, LW, LBU, LHU rt == reg && rs == reg Delay R/W
rs == reg Delay R
rt == reg Delay W
SB, SH, SWL, SW, SWR rt == reg || rs == reg Delay R
LWC2, SWC2 rs == reg Delay R
Delay R The branch delay instruction is executed. The new rd is saved and restored to its original value.

The first target instruction is executed.

The rd from the branch delay instruction is restored.

Delay W Normal behaviour
Delay R/W The branch delay instruction is skipped

Branch delay slots

programming

The MIPS architecture has the branch delay slot feature. It was thought to be a good idea in the 80s to reduce pipeline stalls. Unfortunately, it was not as useful as it seemed and it contained nop most of the time. (It is quite bad as it is a 4-byte penalty.)

Also, it did not stand up to the test of time and now it is just a pain in the ass, but we are stuck with it for backward compatibility.

How it works:

  instr1
  beq label1
  instr2
  instr3

label1:
  instr4

Branch taken:

  instr1
  instr2
  instr4

Branch not taken:

  instr1
  instr2
  instr3
  instr4

Notice that instr2 is always executed. I usually treat the code like this (for a CPU without branch delay slot):

  instr1
  instr2
  beq label1
  instr3

label1:
  instr4

This is not true if there are dependencies between instr2 and instr4, but compilers usually avoid this case.

Enter the dynamic recompiler!

programming

(Note: the info here is based on an open source MIPS R3000A emulator.)

We need to do dynamic recompilation to get some decent speed. We can simply translate

addi rs rt imm

to

mov eax, vm.regs[rs]
add eax, imm
mov vm.regs[rt], eax

(MASM syntax; I'm very old school.)

This assumes vm.regs[] evaulates to an absolute memory location.

We don't even need the dispatcher If we put this in an execution buffer. We only need 3 instructions to emulate addi.

Now we are talking.

(How about vm.pc and vm.cycles? We can update them just once per basic block.)

Where's the dynamic compilation?

We can emit 10 different code sequences if we want to:

1 rt == 0 (nop)
2 rs.state == CONST rt.state = CONST;
rt.val = rs.val + imm
3 rt == rs, imm = 1 inc vm.regs[rt]
4 rt == rs, imm = -1 dec vm.regs[rt]
5 rt == rs, imm = 0 (nop)
6 rt == rs, else add vm.regs[rt], imm
7 imm = 1 mov eax, vm.regs[rs]
inc eax
mov vm.regs[rt], eax
8 imm = -1 mov eax, vm.regs[rs]
dec eax
mov vm.regs[rt], eax
9 imm = 0 mov eax, vm.regs[rs]
mov vm.regs[rt], eax
10 else mov eax, vm.regs[rs]
add eax, imm
mov vm.regs[rt], eax

Many of the cases are not really necessary anymore: we can remove the special handling for imm = +/-1. (They save space, but are not any faster and may in fact be slower.)

Note that the operation is folded in some cases. If we want to get fanciful, we can even track register aliasing (case 9), or even real register allocation (eax has the value of vm.regs[rt] in some cases).

In general, I prefer a dumb recompiler and a separate flow analyzer that is able to work across emulated instructions.

A look at a real emulator

programming

The dispatcher:

while(vm.run_flag) {
  vm.code = mem(vm.pc);
  vm.pc += 4;
  ++vm.cycle;
  (*op_tbl[vm.code >> 26]) ();
  }

The opcode table:

void (*op_tbl[64]) () = {
  ...
  instr_addi, /* entry 8 */
  ...
  };

And the instruction:

// Rt = Rs + Im
void instr_addi() {
  int rt = (vm.code >> 16) & 0x1f;

  if(rt == 0)
    return;

  int rs = (vm.code >> 21) & 0x1f;
  int imm = (S16) vm.code;

  vm.regs[rt] = vm.regs[rs] + imm;
  }

This code is extracted from an open source MIPS R3000A emulator.

I estimated 25 real instructions are needed to emulate addi, not to mention a ton of memory accesses. The dispatcher itself already needs around 12 instructions. If the original CPU runs at 33.9 MHz, it takes a 850 MHz CPU to emulate it at full speed. And addi is one of the simpler instructions. Other instructions require even more decoding.

Also, this is just the CPU. We need to emulate other parts of the system too. Obviously, we need to do much better than this.

The simplest virtual machine

programming

Want to create your own virtual machine?

#define INSTR_HLT 0
#define INSTR_MOV 1

int mem[1024];
int *pc = mem;
int run_flag = 1;

void exec() {
  int src, dest;

  while(run_flag) {
    switch(*pc++) {
    case INSTR_MOV:
      dest = *pc++;
      src = *pc++;
      mem[dest] = mem[src];
      break;

    case INSTR_HLT:
      run_flag = 0;
      break;
    }
    }
  }

This is of course terribly inefficient, but it'll work. This is just to show what a virtual machine consists of:

  • Memory to store instructions and data
  • An infinite loop
  • Dispatching instructions

Do we need to use a switch statement? No, the dispatcher can look like this:

  while(run_flag)
    (*instr_tbl[*pc++]) ();

Such VMs are never efficient. Main reasons:

  • many real instructions to execute one virtual instruction
  • many memory accesses (virtual registers and the like)
  • many branches/calls — a killer even with branch prediction

We'll take a close look at performance in the future.

Now, why do we want to write our own VM?

Simple: to get true write-once-run-everywhere functionality!

Plus, it is a way to add very versatile scripting to our application.

Reward points make a difference

transport

A comparison of the rewards from the big three oil companies.

Assumptions:

  1. Pump formula 98
  2. $1.93/litre (3 Feb 2010)
  3. 12% discount (5% station + 7% credit card)
  4. Spend $1,000
  5. The reward points is used to redeem petrol voucher
Caltex 1 Thanks! point per litre.

1400 points for $40 petrol voucher.

589 points ($1000 / (1.93 * 0.88))

Equivalent to $16.80 petrol voucher.

ExxonMobil 1 Smiles point for every $1.25 spent.

750 points for $30 petrol voucher.

800 points ($1000 / 1.25)

Equivalent to $32 petrol voucher.

Shell 1 Escape point per litre.

600 points for $20 petrol voucher.

589 points ($1000 / (1.93 * 0.88))

Equivalent to $19.60 petrol voucher.

As you can see, ExxonMobil has the best reward based on the current assumptions. The higher the price, the better the reward. However, if the price is very low, say $1/litre, then Shell has the best reward, and Caltex and ExxonMobil are about the same.

The massive XBox Live ban hammer

Traditionally, Microsoft bans modded XBox 360 every November. Last year, they started on 30 October and lasted a few weeks. The ban was slightly different from previous years.

First, it was on a much larger scale. A news article said 600,000 consoles were banned in North America and another 400,000 worldwide. This is 5% of their Live membership.

Second, the consoles were not just banned from XBox Live, but also access to their own HD! (You still can play games using the DVD drive.) The functionality was much reduced.

Tip: never buy a second hand XBox 360.

In previous years, Microsoft targetted several aspects:

  • Out-of-band data: some data do not exist on a DVD-R
  • Behaviour: hacked firmware did not respond identically under some circumstances
  • Timing: hacked firmware responded too fast for certain commands

The latest hacked firmware resolved all that and was thought to be undetectable.

Until now. So many consoles were banned that people began to suspect the firmware was detectable after all. It is still unknown how Microsoft did it, but these facts are clear:

  • No unmodded consoles were banned
  • Modded with playing just one copied game was banned
  • Modded with no copied game was banned

(There are very few samples for the last case, so its accuracy is in doubt.)

The detection actually started several months back. It was done even in offline use. Once the console was flagged, it would be banned. Nothing could save it (using the original flash, playing only original games).

The best recourse is not to connect to XBox Live so that the console would not receive the ban signal that would render the HD useless. Keep the console for offline use only. But how to tell if the console is banned or not? Apparently you can call and ask for warranty support. They'll tell you if your console is tampered with or not — tampered consoles have their warranty voided.

This will work until the next "wave" of games that require the latest update to run. The update contains the blacklist, so the marked consoles will fail to work then.

So, can Microsoft detect the firmware?

My guess is yes, Microsoft can detect the firmware reliably. The simplest I can think of is the timing changes. A big change would indicate a modded firmware. Is it sufficient to get you banned? Yes, because a modded firmware is considered tampering.

Originally, I thought Microsoft wanted to be really sure, so the second flag (at least one copied game) must be tripped. But no, just the firmware itself was sufficient: some people were banned just by swapping drives — of different makes with original firmware. Unfortunately for them, this is easily detectable.

How would Microsoft detect copied games?

There were bad game rips that gave the game away, but some people had religiously followed the "stealth" instructions and still got banned. Plus, some people claimed to be banned playing just one game. It is clear: Microsoft is able to detect a single copied game.

My guess: perhaps there is a unique serial no. embedded in the out-of-band data. (DVDs are known to have that.) If more than one is in use at the same time, it is obvious the game is copied. (A non-unique serial no. will work too, just not as well.)

Microsoft wins this round.