My Rambling Thoughts

What if the IBM PC has paging?

I was thinking about bankable memory when I wondered: what if we could bank the entire address space?

Wow, that's half a Memory Management Unit (MMU) there!

We insert a MMU in-between the processor (8088 or 8086) and memory. It will map the "logical" address into "physical" address.

Once we have an MMU, we are not constrained by 8086's 20-bit address space anymore. We can only access 1-MB window at a time, but there can be much more memory available.

If we use 2-bytes per page and 16 kB page size, we only need 64 entries to map 1 MB memory space. If we use 12-bits for the upper address (4-bits for other purposes), the total addressable memory is 64 MB (2^12 * 16 kB), which is pretty respectable even in the late 90s.

The total page size is 16 kB, which is still very managable.

This is even better than what the 80286 can achieve — a mere 16 MB! (Excluding its use of segmentation.)

With paging, we have complete freedom to map any address — at multiples of 16k — anywhere. We can even hide it, e.g. the pesky video memory at 0xA0000.

Why 16 kB?

Most processors opt for 4 kB page size, especially those designed in the 80s. In hindsight, it's kind of small and resulted in huge page tables. Intel introduced 2 MB pages with Pentium.

16 kB is a good sweet spot.

Why is it only half a MMU?

Note that the address translation happens without any support from the processor!

This means we are unable to handle page faults transparently. If the processor wants to read from a not-present page, we cannot tell the processor to abort the current instruction, service the page fault exception, then restart the instruction.

Accessing a not-present pages will just return 0. Writing to read-only pages will be ignored.

If we want full MMU support, we need a second processor to resolve the page fault exception while the first processor waits for the memory access to complete (the memory would appear to be super-slow).

This has been done before, just not on 8086.


We can still have "pageable" memory, just that it is not fully transparent.

Before we access a 16 kB segment, we need to 'lock' it. The MMU will then bring the segment in as needed. We 'unlock' it after use so that the MMU can discard it or swap it to disk when memory is tight.

Kernel mode

Because we can set pages to not-present, we can now achieve memory protection.

We can have "User mode" where the not-present bit takes effect, and a "Kernel mode" where pages are always present. (The kernel will use a separate map to know if the pages actually exist or not.)

To switch from user to kernel mode, we trigger an interrupt via a MMU I/O port (interrupts must be enabled on the processor). The MMU will disable checking the no-present bit and the ISR will be in kernel mode.

Before the interrupt handler returns, it enables the checking of no-present bit in the MMU, i.e. back to user mode.

This means we need to protect the interrupt table. Other programs can hook to interrupt handlers (they will run in user mode), just not to the MMU ones.

(It is best to have a separate bit for kernel/user mode, though.)

Turning paging to 11

When 64 MB is insufficient, we'll switch to 4-bytes per page and use the upper 24 bits. We get a total addressable memory of 256 GB (2^24 * 16 kB), which is better than most 32-bit processors (4 GB) [*] and somewhat competitive with 64-bit processors!

[*] Pentium Pro has 36-bit address bus that allows it to access 64 GB memory in PAE (Physical Address Extension) mode.

Even today, entry-level PC/notebooks come with 8 GB of memory and mid-end ones come with 16 GB.

However, our page table now balloons to 64 MB! :horror: But of course, we don't have to map the entire space. If we only have 16 GB memory, we only need to map that. The page table is just 4 MB.

Reimaging the CGA. What if?

The CGA is remembered for its very limited color capabilities in graphics mode. It can do:

  • 80x25 16-color text mode
  • 320x200 4-color
  • 640x200 2-color

Plus two undocumented modes:

  • 160x100 16-color
  • 160x200 16-color (composite display only)

The limitation is the 16 kB RAM. (620 x 200 / 8 = 16 kB.)

The CGA monitor takes in 4-bit TTL color inputs. It can display 16 colors in all modes. We can achieve 620x200 16-color if we have enough RAM — 62.5 kB.

64 kB was a tall order in 1981, but 32 kB was perhaps conceivable. The CGA card could ship with 16 kB and be expandable to 32 kB. However, the card has to be redesigned as it was a full-length card and it was already full of chips.

32 kB will enable 320x200 16-color, most often used by games.

Let's take a look at the CGA monitor first.

The CGA monitor

The CGA monitor has a horizontal sync rate of 15.75 kHz, meaning it scans 15,750 horiztonal lines per second.

The refresh rate is 60 Hz, so the monitor can display 262 lines per refresh. 16 of that is used for vertical retrace.

CGA uses a dot clock of 14.318 MHz, that means 909 pixels/line (14.318 MHz / 15.75 kHz). We need 80 for horizontal retrace.

The above means the biggest image we can display is 829x246. This is much higher than 640x200!

Plus, if it operated in interlaced mode (30 Hz), it could display 829x492!

(From what I read, the original CGA monitor did not interlace correctly — it did not half-offset interlaced rows.)

This is even higher resolution than the highest VGA resolution of 640x480!

So much potential gone to waste.

Can a CGA monitor really display 829 pixels horizontally?

Note that CRT does not have concept of pixel. The electron gun just sweeps across the screen and the controller "changes the colors" as required.

CRT, however, has dot-pitch, the spacing between two alike colors.

Let's say we have a 12" monitor, displayable 11" (it's usual to be smaller). The width is 8.8" or 224mm. If the dot-pitch is 0.3mm, we get a maximum of 746 resolvable dots. (There are several asterisks to this calculation.)

Generally, we do not want to exceed the maximum resolvable dots — else not all pixels show up properly.

Does CGA really waste so much resolution?

It is hard to believe, but yes. CGA has a huge overscan border whose color can be set. It was almost always noted as a "feature". On VGA, the border is just 1 character wide.

It did not dawn to me until now that the CGA must still be actively "drawing" the overscan borders.

If you calculate the difference between 829x246 and 640x200, the horizontal borders are 94 pixels on each side and the vertical borders are 23 pixels on each side.

80x25 text mode

The 80x25 text mode was so ingrained that it remained unchanged until we switched to using graphics mode (i.e. Windows 3.1). Even then, windowed consoles default to 80x25.

CGA could support maybe 29 rows? EGA supports 43 rows and VGA 50 rows by using 8x8 fonts. But they never took off.

Columns-wise, a holy grail was 132 columns. EGA could not do it, but I vaguely recall VGA could (as could some Super EGA cards). But 90 or 100-columns should be within reach.

80-columns was a holdover from punch card days and there was no reason to stick to it. We still suffer from it even today — some coding standards limit line length to 80-columns! I'm old-fashioned but I have to say this is too archaic. :lol:

If CGA could support variable text mode (*), programs would need to query the screen size and handle accordingly.

(*) Not counting 40x25 since that was intended for composite displays (aka TVs).

4-color palette?

The 320x200 4-color mode has a choice of two palettes, plus intensity. There is also an "undocumented" palette, so altogether three choices:

Mode 4

Palette 0:

1010Bright Green
1100Bright Red

Palette 1:

1011Bright Cyan
1101Bright Magenta
1111Bright White

Mode 5

101-Bright Cyan
110-Bright Red
111-Bright White

0 is normally black, but it is really the "overscan" color. In 320x200 graphics mode, the overscan color is used for the background color. In 640x200 graphics mode, it is used as the foreground color.

The first question on everyone's mind was, could IBM have put in some palette system so that we could choose any 4 out of 16 colors?

They could (with additional h/w), but ultimately, 4 colors is too few (plus CGA resolution is too coarse to dither). We need at least 8 colors.

In an alternate universe, the CGA could support:

  • 320x200 4-colors with 16 kB (default)
  • 320x200 8-colors with 24 kB (8 kB expansion RAM)
  • 320x200 16-colors with 32 kB (16 kB expansion RAM)

With 32 kB of RAM, the CGA could do 640x200 4-colors and we have the same palette problem again. :-P

VBLANK interrupt

VBLANK interrupt was left out of CGA. This is very useful to know when to flip pages, for example, for double-buffering.

CGA does not have enough memory for two pages in graphics mode, though.

EGA has VBLANK interrupt, but it was not well documented and the functionality never caught on (in a standard way).

The CGA snow

The CGA is also infamous for its "snow" during 80x25 text mode. In this mode, there is not enough bandwidth for both CPU and screen refresh, so if the CPU wants to update the active video buffer, the screen "glitches" as it draws random bits (why can't it be zero?).

The workaround is to update the active video buffer only on horz sync or vertical sync. Obviously this makes screen updating very slow.

Another way is to blank the screen. The BIOS does this when it wants to scroll the page.

Yet another way is to use page flipping. The VBLANK interrupt would have been useful here.

I don't understand why the CGA controller cannot tell the CPU to wait. The CPU should be able to access the video buffer 7 out of every 8 pixels. For the last one, it needs to wait for the CGA controller to read two bytes. It cannot be that long, can it?

Reimaging the IBM PC. What if?

The IBM PC had several "mistakes" other than the 640 kB memory barrier. Of course, this is in hindsight.

IRQ lines

One common lament is the lack of IRQs or IRQ conflicts. The IBM PC has 8 IRQ lines, the AT increases it to 15.

0System timer
1Keyboard controller
6Floppy disk controller

If your PC is fully decked out, you only have one IRQ line left — IRQ 2. But generally people only have one printer (LPT1) and perhaps one modem (COM1), so LPT2 and COM2 are free.

Each device (that can generate interrupt) needs its own IRQ line because they cannot be shared. This is the main issue. What if they can be shared?

First, it should be possible to share IRQs for like-devices. LPT1/LPT2 and COM1/COM2 can use just one IRQ line. The ISR must poll all four of them. If this was done, then there would be 4 free IRQs.

Also, the Keyboard Controller has very low interrupt rate (at most 15 chars/s), so it can be shared with the above.

The IRQ assignment could look like this:

0System timer
3Keyboard Controller/COM1/COM2
6Floppy disk controller

We don't even need 7 new IRQ lines on AT:

0System timer/RTC
3Keyboard Controller/Mouse/COM1/COM2
6Disk controller (Floppy, HDD pri/sec)
7FP unit

Edge vs level triggering

The reason why IRQ lines cannot be shared is because IBM chose to use edge-triggered interrupts. If two devices raise the same interrupt at the same time, the ISR does not know which one triggered it.

If IBM had chosen to use level-triggered interrupts, the device would hold the interrupt line high. This allows the ISR to know which device triggered it.

DMA channels

The IBM PC has 4 8-bit DMA channels, the AT increases it to 7 (the new ones are 16-bit). Unlike IRQ lines, there is no contention because few devices use DMA (or at least system DMA). On the IBM PC:

0Memory refresh
2Floppy drive

To be useful, DMA must out-perform the processor for 16-bit transfers. (Hint: on the AT, it doesn't.)

Bus speed

The expansion bus runs at the same speed as the CPU. This was not an issue on the IBM AT which topped out at 8 MHz. But later when IBM AT-compatibles increased the speed to 12 MHz, expansion cards start to fail.

The solution was to fix the bus speed to 8 MHz.

This was not IBM's fault. By this time, it has moved to its new patented MCA bus for its PS/2 line.

The AT bus has a theoretical bandwidth of 5.33 MB/s (8 MHz * 16-bits / 3 clocks/transfer). However, the data must go somewhere, say memory, so that takes another 2 - 3 cycles plus wait states, plus instruction execution time, so the max is around 1.78 MB/s.

This severely throttled performance of video games graphical applications in the early 90s. To refresh 320x200 256-colors at 60 Hz, we need a bandwidth of 3.66 MB/s.

Even with hindsight, it is difficult to create a PC bus that is forward-looking. It has to support:

Speed4.77 MHz8 MHz8 MHz
Sys speed/n
Sys speed/n
Bus masteringNNYY

Instead of fixing the bus speed to 8 MHz, there should be options to run the bus at a divisor of system speed, e.g. /1 to /4. This is especially useful for video cards when system speed goes above 20 MHz. Most other cards can run at system speed /2 or /3. It would be even better if this were configurable per slot!

(The fastest system speed was 40 MHz, IIRC. CPUs ran with a multiplier, e.g. the 486DX2-66 ran at 66 MHz with system speed of 33 MHz.)

INT 13h disk addressing

From the late 80s to late 90s, there were a lot of confusion on disk limits: 504 MB, 2 GB, 8 GB and 128 GB. If your HDD exceeded one of these sizes, you better make sure your BIOS supported it.

(There are other limits — imposed by DOS due to FAT format.)

How did these limits come about?

INT 13h Disk Services specify the sector using CHS addressing (Cylinder-Head-Sector).

BIOS1024255 (*)637.84 GB
ATA1024162562 GB
Effective10241663504 MB

(*) The limit should be 256, but DOS will not boot up.

This is the first limit: 504 MB.

Once your disk exceeds 504 MB, you need a BIOS that translates "logical" CHS to physical CHS. This allows disks up to 2 GB to be used. But the IDE Cylinder register is a 16-bit register, so we can go much bigger:

BIOS (logical)1024255637.84 GB
IDE6553616256128 GB

Now we are limited by the BIOS interface of 7.84 GB.

INT 13h Extensions used 64-bit LBA (Logical Block Addressing). At the device level, it is either 28-bits (128 GB) or 48-bits (128 PB).

There is really no advantage of LBA-28 over P-CHS except it is simpler.

We have long exceeded 128 GB, but it'll take some time to hit 128 PB. Disk sizes today are 1 - 8 TB.

I rambled so long that I forgot what I was going to say.

The original INT 13h Disk Services use three 8-bit registers for a total of 24-bits, that's why we have the 7.84 GB limit.

There are two changes we can make:

  • use two 16-bit registers
  • use >512 bytes/sector

It was unfortunate that 512 bytes/sector was so ingrained that it could not be changed even after 2 decades!

Sector sizeDisk size
512 bytes2 TB
2 kB8 TB
4 kB16 TB

Async disk access

The IBM AT BIOS has support for async disk access, but I doubt anyone took advantage of it. Not DOS, certainly.

The 640 kB barrier

Ah, the infamous 640 kB memory limit that plagued PC programmers for one and a half decade. :lol:

IBM carved up the 1 MB addressable memory into three parts:

  • Memory (640 kB) [0x00000 to 0x9FFFF]
  • Video memory (128 kB) [0xA0000 to 0xBFFFF]
  • ROM (256 kB) [0xC0000 to 0xFFFFF]

The first PC was shipped with either 16 kB or 64 kB memory. 640 kB was 10x that. Later, they were shipped with 256 kB on-board memory and a 384 kB memory expansion board was offered.

The MDA (Monochrome Display Adapter) has 4 kB video memory at 0xB0000. The CGA (Color Graphics Adapter) has 16 kB video memory at 0xB8000. In other words, from day 1, it was only possible to increase contiguous memory by 64 kB (to 0xAFFFF).

The EGA (Enhanced Graphics Adapter) was introduced in 1984 and has 64 kB video memory at 0xA0000 (in graphics mode). By then, the possibility to increase contiguous memory closed forever.

How can things be different in an alternate universe?

Large ROM space

The first "mistake" was the large space — four 64 kB segments — reserved for ROM. The last segment 0xF0000 was reserved for system BIOS, the first three segments (192 kB) were for expansion ROMs. The assumption was that some (most?) expansion cards would come with ROM. It turned out that only two were needed: video and hard disk. Everything else could be loaded as drivers.

The system BIOS scans for expansion ROMs starting from 0xC0000 in 2 kB intervals to 0xEF800 (best not to use 0xF0000).

EGA has 16 kB ROM (at address 0xC0000) to extend system video BIOS. If it kept 100% CGA compatibility, the system BIOS could control it and the extended functionality be put in a driver.

Alternate memory map:

  • System ROM (32 kB) [0xF8000 to 0xFFFFF]
  • Expansion ROM (64 kB) [0xE8000 to 0xF7FFF]
  • Video memory (32 kB) [0xE0000 to 0xE7FFF]
  • Memory (896 kB) [0x00000 to 0xDFFFF]

This gives an additional contiguous 256 kB.

Fixed video memory address

The second "mistake" was the fixed video memory address.

EGA needed 28,000 bytes in its highest resolution 640x350 graphics mode (each pixel is 1-bit due to planar mode).

VGA (Video Graphics Array), introduced in 1987, needed 38,400 bytes in 640x480 mode, (again, planar mode) and a whopping 64,000 bytes in its 320x200 256-color mode.

There are two possible ways:

  • Make the base address selectable using DIP switch
  • Make the buffer bankable

The second one can free up the entire video memory space if there is no need to access video memory.

Revised memory map:

  • System ROM (16 kB) [0xFC000 to 0xFFFFF]
  • Expansion ROM (48 kB) [0xF0000 to 0xFBFFF]
  • Bankable memory (64 kB) [0xE0000 to 0xEFFFF]
  • Memory (896 kB) [0x00000 to 0xDFFFF]

The maxmium memory remains the same, but now we have a more general-purpose bankable memory segment that can be used for Expanded Memory (EMS), for example.

Meet the 896 kB limit

After we overcome the 640 kB limit, we now have a new limit of 896 kB, or if we squeeze a little more, 960 kB. But if you need >640 kB (*), you probably need a lot more — 1 to 2 MB, not a measly 256 kB. It is unavoidable to use either EMS (Expanded Memory) or XMS (Extended Memory).

(*) By the time DOS and standard drivers are loaded, you have around 530 kB to 580 kB of free memory (IIRC).

Conversely, if we need to shrink the program to make it work in 896 kB, we can probably shrink it even more to fit in 640 kB, perhaps even in 256 kB — widely considered to be the minimum memory configuration.

8086 1-byte opcodes: all have to go

The 8086 has 95 1-byte opcodes. I'll keep only 8:

  • CS:, DS:, ES:, SS:
  • NOP
  • INT 3

NOP and INT 3 must be 1-byte. The other six are frequently used. Everything else must go (i.e. become 2-byte opcode).

It is kind of a waste to use 4 slots on segment prefixes when they are going away on 32-bit architecture... (or do they? They will truly go away in 64-bit architecture.)

Some multi-byte instructions that favour registers must also go:

  • MOV reg, immed
  • ADD/SUB/... ax, immed
  • TEST ax, immed
  • MOV ax, [disp]
  • MOV [disp], ax

Specialized instructions

A few comes to mind: MUL/IMUL, DIV/IDIV, SHL/ROL family, IN/OUT, CBW/CWD, JCXZ, LOOP, XLAT.

MUL/IMUL puts the result in DX:AX. It "wastes" DX even if we are not interested in it. Can generalize it partially.

DIV/IDIV divides DX:AX and puts the reminder in DX. It is common for starting DX to be zero (i.e. divide 16-bit by 16-bit). Can generalize it partially.

SHL/ROL uses CL. Shifts are usually constant. RCL/RCR only makes sense with 1.

IN/OUT uses AX and DX. Ports are usually accessed in a group. If we allow IN reg, [reg+disp], we just need to assign the base port address once.

CBW uses AX. CWD uses DX:AX. Can generalize (partially) to use other registers.

JCXZ and LOOP use CX. Can generalize to allow other registers.

XLAT uses BX and AL. Can generalize to use other registers.

Immediate operand

It turns out that it is common to operate on small numbers.

We can allow OP reg, i5s in 2-bytes and it would be much more useful. This gives registers an advantage over memory operands.

With this, there is no need for INC/DEC instructions.


PUSH i8s and PUSH i16 are also useful.

Instead of 1-byte PUSH reg, we can have 2-byte PUSH reg-mask. Normally we want to save a group of registers.

SP is no longer a GPR. We have explicit MOV and ADD for it. Other operations do not make sense for it.


JS/JNS (Jump if signed) and JP/JNP (Jump if Parity) are useless. Remove them totally.

Allow all conditional jumps to have 16-bit offsets.

Near and far calls

I'll love to have a single set of RET that can handle both near and far returns, but unfortunately I am not able to think of a way.

We can differentiate near and far CALLs if we restrict call targets to start at even addresses. In that case, an even target is a near call and an odd target is a far call.


Remove REP and REPNE. Strings must be manually looped. These two are "signature" x86 instructions.

LODS, STOS, SCAS can use registers other than AX.

I'm thinking whether we need MOVS. It can be implemented as:

  JCXZ skip
  LOOP @b

REP MOVSW is tempting because it looks like it is the fastest way to move memory. It was, but there are faster ways now.

In fact, do we even need LODS, STOS and LOOP?

  JCXZ skip
  MOV ax, [si]
  ADD si, 2
  MOV [di], ax
  ADD di, 2
  DEC cx
  JNZ @b

(Assuming forward direction.)


Add an additional byte after ESC to increase co-processor opcode space. The 8087 is stack-based, most operations work on the ToS (top-of-stack), making it hard to pipeline in the future (starting with Pentium). Make it register-based.

Instruction formats

Make the formats oriented for "fast" decoding:

1 op
2 ext op
ext op reg
op reg, i5s
op reg, r/m
op r/m
op i8s
branch rel8
op reg, r/m+d8
op r/m+d8
op i16
ext branch rel8
branch rel16
op reg, r/m+d16
op r/m+d16
3 ext op d8
ext op reg, d8
op r/m, i8s
esc op r/m
op r/m+d8, i8s
esc op r/m+d8
op r/m+d16, i8s
esc op r/m+d16
4 op r/m, i16 op r/m+d8, i16 op r/m+d16, i16
5 op i32

(This covers 99% of the instructions.)


Characteristics of RISC:

  • Register-based operations, with explicit operands (e.g. d = s1 op s2)
  • Memory access limited to MOV, mostly
  • Large register bank
  • Fixed length instruction

8086, like most (all?) CISC, can operate on memory directly, i.e.

ADD ax, mem
ADD mem, ax

Reading from memory and operating on it is fine, but operating on memory directly? It is fine for a single operation, but a series of operations will have unnecessary read/writes. For example:

ADD mem, 3
AND mem, ~0x03
SHL mem, 1

(3 reads and 3 writes)


MOV ax, mem
ADD ax, 3
AND ax, ~0x03
SHL ax, 1
MOV mem, ax

(1 read and 1 write)