Tales of Japan first impressions

Jun 2026

Folk Tales of Japan

Given that the books were sold by Amazon.jp, I thought they would be shipped by express delivery. :lol: But then, it was 'free shipping'.

The books are smaller than I thought. I thought they were 4 x 6" — close, 4.4 x 7".

Each story is abbreviated from the original, then he gives his commentary.

I'll say it is acceptable, but that's all.

Should I have bought all four books? No. This is why I want to buy one book to try first. :lol:

S$32 for one book vs S$88 for all four. What's your choice?

What-if redesign of 8087

Jun 2026

First, register-based instead of stack-based.

There are 8 FP registers. Each is 48-bit (4-bit type, 12-bit exponent and 32-bit mantissa in two's-complement representation and no implicit bit). The type indicates if it is a normal number, denormal, infinity, NaN, integer or pointer. This is obviously not IEEE 754 format.

Just like the redesigned 8086, it allows memory operands as source and MOV is used to write to memory.

It supports 32-bit single-precision float directly. There is no need to move it to a register first.

; fp src
fadd fp0, fp1       ; 48f + 48f -> 48f
fmov a, fp0         ; 48f -> 32f, default rounding

; mem src
fadd fp0, b         ; 48f + 32f -> 48f
fmov.rz a, fp0      ; 48f -> 32f, truncate

Operations can be rounded to single-precision directly. This allows repeatable calculations in case of register spillage.

; a = b + c + d
fmov fp0, b         ; 32f -> 48f
fadd fp0, c         ; 48f + 32f -> 48f
fadd fp0, d         ; 48f + 32f -> 48f
fmov a, fp0         ; 48f -> 32f, default rounding

; a = b + c + d (single precision)
fmov fp0, b         ; 32f -> 48f
fadd.r fp0, c       ; 48f + 32f -> 48f (default rounding to 32f)
fadd.r fp0, d       ; 48f + 32f -> 48f (default rounding to 32f)
fmov a, fp0         ; 48f -> 32f, default rounding

Conversion operations can specify rounding mode. This is useful for float-to-int conversion (truncate towards zero) when the default is round-to-nearest.

fcnvint dx:ax, fp0     ; convert to 32-bit int (implicit .rz)
fcnvint.r dx:ax, fp0   ; convert to 32-bit int w/ default rounding

Rounding modes are: nearest even (.re), zero (.rz), up (.ru), down (.rd). .r uses the default (normally set to .re).

Only basic arithmetic operations are provided, as well as FMULADD — this must be in, no matter what! — FRECIPORAL and FSQRT.

Special values

Denormals use the full mantissa. The difference from normal numbers is that the top bit is 0. Exponent is set to -2^11.

Infinity uses the top bit of the mantissa as the sign. The rest of the mantissa is not used, but should be set to 0. Exponent is set to 2^11 - 1.

For NaN, exponent and mantissa are not used, but they should be set to 0.

Integer and Pointer types are for "NaN-boxing". Integer allows 44-bit integer to be stored in exponent and mantissa. Pointer allows 44-bit pointer. It is separate so that second-level decoding is not needed.

Double precision

The problem with floating-point math is that we cannot use single precision floating-point instructions as building blocks to obtain higher precision floating-point numbers — well, we can, but they are not in double precision format.

We either need to support it natively, or have floating-point emulation friendly instructions such as CMOV (conditional move), CNTLZ (count leading zero) and SHLD/SHRD (shift left/right double).

An eye on the future

Vectorization, different formats (64-bit double precision, various 16-bit half-prec, various 8-bit, Posits).

Revelation on interstellar travel

Jun 2026

After listening to many AI Richard Feynman videos why aliens cannnot visit us due to the immerse distance involved — where even light speed is slow — it suddenly dawned to me that we are boxing ourselves to 'reality', that is why it is impossible.

Long story short, we have not found aliens because we are doing it wrong. In our reality, it seems obvious to use radio waves and use HI line as indicator of intelligent life.

But no advanced civilizations do it this way. It is simply too inefficient — primitive, even.

What is the way to communicate — and travel? It is through higher dimensions. If a civilization does not know how to do this, they are not qualified. Such as us.

Once we breakthrough, we may find the Universe to be quite boisterous!

In an unrelated news, the third part of FF7 Remake was announced: FF7 Revelation. Many people were sure it was going to be called Reunion.

Book time!

Jun 2026

Folk Tales of Japan

I happened upon some skits on YouTube by Kyota Ko and they were entertaining — and educational. He did the skits to encourage people to buy his books. I intended to pick up one book — maybe two — first, but ended up buying all four.

Murdoku: 80 Murder Mystery Logic Puzzles (Vol 1)

While on Amazon.sg, this book was recommended to me. It is a logic puzzle book. It merges Sudoku-style number placement with logic grid mysteries. It piped my curiosity. Let's buy one and see first. I bought this from Blackwell because it was cheaper there.

(This is the UK edition. The US edition has purple cover and is OOS.)

Murdle Murder Mystery Logic Puzzles (Book 1)

Also recommended at the same time, this is a collection of grid-based murder-mystery logic puzzles. Again, let's try one book first.

A Thousand Miles of Wind, the Sky at Dawn: Part 1 (Book 5)

Twelve Kingdoms is being re-translated by Seven Seas Entertainment. I knew about this, but I didn't plan to buy — I already have the Tokyopop edition.

A Thousand Miles of Wind, the Sky at Dawn: Part 1 was just released — Jun 2026! This is the most 'happening' arc. I decided to buy it and try. If it is done well, I'll buy Part 2 as well (Sep 2026). I bought this from Blackwell.

I may be interested to get Shadow of the Moon, Shadow of the Sea, but only Part 2. It covers the second half of the protagonist's journey in the unfamiliar fantasy world.

Brief notes on 8087

Jun 2026

The 8087 was designed by a numerical analysis expert and served as working proof for the IEEE 754 floating-point spec. It was revolutionary. It was released in 1980, the spec was ratified in 1985.

Before the 8087, floating-point math had proprietary formats (limiting inter-operability), lacked accurary and consistency (rounding and precision) and was mostly emulated (so ultra-slow).

The 8087 supports IEEE 754 single precision (32-bit) and double precision (64-bit) formats. Internally, it uses a stack-based 8-deep set of 80-bit FP registers. Each FP register has 1 sign bit, 15-bit exponent and 64-bit mantissa (no implicit bit).

Most 8087 instructions operate on ToS (Top-of-Stack). Programmers were used to operands and were unfamiliar with stack-based operations. It was a struggle to write efficient code.

The biggest issue with 8087 is its buggy stack architecture. Due to misalignment between the design and hardware teams, the hardware does not automatically spill an overflowed stack to memory-based virtual stack. It is handled as an exception which is complex and slow. Software work around it by not overflowing the stack in the first place.

Because of this, it gives unpredictable inconsistent result depending whether the calcuations are done entirely in 80-bit registers or spilled into 64-bit FP in memory (with less precision) midway — depending on compiler and optimization level.

At one point, it was thought the reliance on ST(0) made it impossible to pipeline FP operations — because they all used ST(0). But Pentium proved it was possible to do register renaming with FXCH and achieved pipelined FP operations. It was a breakthrough. From that point, x86 FP became competitive in speed with RISC CPUs.

The second issue is that explicit synchronization is needed using FWAIT. This is needed 99% of the time, so FP assembly instructions insert FWAIT automatically before the actual instruction. This is not needed from 80287 onwards as the CPU waits for the FPU automatically.

8087 runs in parallel to 8086. It is slow compared to integer operations, e.g. FADD takes 70 – 100 cycles, so it is possible to run many x86 instructions before executing the next FP operation. Question is, how many programs made use of this?

The third is emulation. Unlike 80286, 8086 does not raise a Coprocessor Absent exception that would have allowed transparent software emulation. This means the executable does not contain FP instructions directly, but must use emulator-transformed code that call emulated functions if 8087 is absent, or modified to 8087 code if present — this is an excellent use of self-modifying code.

The technique is pretty clever. The compiler emits actual 8087 code and marks them as requiring fixup (relocation), the linker transforms them into emulated calls using fixup (which is an addition) if emulator support is needed.

(This technique does require FWAIT before each FP instruction to patch properly.)

Side note. 8087 also supports 64-bit integer and 18-digit BCD operations. These are obsolete today.

SafariUni 5x25 bino

Jun 2026

Safari Uni (观穹) 5x25 bino

Despite a different housing, it is unmistakably the same bino as VisionKing. They even have the same pouch! VK has "VisionKing" custom label on the pouch. This does not have custom label.

(VK has custom labelled shipping box and packing tape. Now, that's professional. I didn't expect this for a low-end bino.)

Brightness is fine. No noticeable dimness in day time. Same brightness as VK.

Zone of focus in the middle only, as expected. It is sharper than VK and has crispy snap-to-focus.

All lens surfaces seem coated, unsure if prism is coated — can see one white and some cream/yellow (single coated) reflection like VK. No phase coating.

Bino focusing knob claims 15° and FMC.

Twist-up eyecups have click-stops.

Measured weight is 577 g w/ front caps and w/o strap. This is heavier than all my other binos!

Claims min focus distance is 1.9 m, actual is ~3 m. This is worse than VK!

I wonder if there are two variants depending on the batch. The min focus distance is either 1.9 m or 3 m. Safari Uni got the 1.9 m one in their first batch, so they advertised as such. VK was advertised with 3 m, but mine was ~2 m.

But it's fine, no one uses this bino for macro viewing. :lol: (Note that this is a valid use-case for binos.)

VisionKing 5x25 bino

Jun 2026

VisionKing (视界王) 5x25 bino

Brightness is fine. No noticeable dimness in day time. Looking at the sky through the objective lens at an arm's length, it is as bright as Svbony SV202. At night, seems brighter than SV202! This is possible, 5 mm exit pupil (25 / 5) vs 4 mm (32 / 8).

Zone of focus in the middle only, as expected. It is not super sharp, but is acceptable. A bit hard to achieve optimal focus due to 'mushy' focus — the focus zone is blurry.

All lens surfaces seem coated, even the objective glass. Prism is said to be uncoated. Touchlight test shows white (no coating) and cream/yellow (single coating) reflections among purple and green (multi-coating). No phase coating.

Bino focusing knob claims 15.8° and does not claim MC now. Box says FC.

Twist-up eyecups are friction-based, no clicks and a bit loose.

Measured weight is 540 g w/ front caps and w/o strap. This is heavier than most of my binos!

Claims min focus distance is 3 m, actual is ~2.1 m!

Greenfield 8086: accessing flat memory

Jun 2026

Short of making 8086 into a 24-bit CPU, we need to add instructions to manipulate 24-bit addresses. Taking cue from pointer arithmetic, not many are needed:

MOV.A, ADD.A, SUB.A, CMP.A
ADC.A, SBB.A
AND.A, OR.A
LEA
PUSH.A, POP.A
MOV.H

MOV, ADD, SUB and CMP are basic operations.

ADC and SBB can be used to construct >24-bit pointers, but I don't know if they will ever be used.

Bitwise AND and OR are useful.

LEA is of course essential. There is no 16-bit variant, only 24-bit.

PUSH.A pushes 24-bit value onto the stack (as two 16-bit values) and POP.A pops two 16-bit values.

MOV.H moves a value into the upper bits of another register and vice-versa. This allows more extensive, though lengthier, operations using standard 16-bit instructions. Technically, with MOV.H, we do not need any 24-bit specific instructions, but the operations will be lengthier.

Adding to a pointer:

; w/ 24-bit instructions
add.a si, 4

; w/ 16-bit instructions
mov.h ax, si
add si, 4
adc ax, 0
mov.h si, ax

Calculating diff:

; Calculate si - di

; w/ 24-bit instructions
mov.a ax, si
sub.a ax, di   ; ax (24-bit) contains the diff

; w/ 16-bit instructions
mov ax, si
mov.h dx, si
mov.h cx, di
sub ax, di
sbb dx, cx     ; dx:ax contains the diff

Stack operations

Remember I said SP should not be a GPR? It means we need SP specific instructions as well. Only a few are needed:

MOV.A SP, r/m | immed
MOV.A r/m, SP
ADD.A SP, r/m | +/-immed
SUB.A SP, r/m

SP can be used directly to reference stack variables:

push arg
call proc
...

proc:
sub.a sp, N    ; reserve space for local vars

mov ax, [sp + N + 4]   ; arg
mov bx, [sp + N - N]   ; first local var
mov cx, [sp + N - 2]   ; last local var
...

add.a sp, N    ; epilogue
ret 2

Or via a frame pointer:

push arg
call proc
...

proc:
push.a fx
mov.a fx, sp
sub.a sp, N    ; reserve space for local vars

mov ax, [fx + 8]       ; arg
mov bx, [fx - N]       ; first local var
mov cx, [fx - 2]       ; last local var
...

mov.a sp, fx   ; epilogue
pop.a fx
ret 2

Any register that allows register indirect addressing can be used — memory is flat!

If we use a dedicated frame pointer, we end up with 7 GPRs. If we don't, then it is just as difficult to get stack trace and unwind the stack as no frame pointer — cos we don't know which is the frame pointer!

Unwinding the stack without frame pointer is a big issue on modern CPUs. Metadata is needed.

Using Return Stack

It is not common to have two explicit stacks, one solely for return address, the other for data, though modern CPUs use shadow stack or Return-Address Stack (RAS) to prevent Return Oriented Programming (ROP).

I don't see why not — it seems trivial to support it. Stack buffer overflow — whether unintentional or malicious — is a never-ending source of bug. We need a separate SP — let's call it RSP — a couple of instructions and change CALL/RET to use it.

New instructions needed:

MOV.A RSP, r/m
MOV.A r/m, RSP

These are privileged instructions. The Return Stack should be on special protected pages if MMU is present. There are no instructions to PUSH/POP nor manipulate the Return Stack. It is purely for CALL and RET.

To help to unwind the stack, we push SP onto the Return Stack too, so each CALL uses 8 bytes (4 for return address and 4 for SP).

CALL and RET must be paired. There is no need to restore SP because it is done automatically.

push arg
call proc
...

proc:
sub.a sp, N    ; reserve space for local vars

mov ax, [sp + N]       ; arg
nov bx, [sp + N - N]   ; first local var
mov cx, [sp + N - 2]   ; last local var
...

ret 2          ; no epilogue needed, will restore SP

One downside is that we can no longer RET to an arbitrary address — but this is the whole point!

push.a ax      ; no longer allowed
ret

Efficiency

Pushing four 16-bit words on each CALL on a 16-bit CPU is inefficient. Even functions that do not use the stack pay this penalty.

Since the address space is 24-bit, we use the upper 8 bits to store the stack size in words. This allows up to 510 bytes of local variables. If more is needed, we can use 255 to mean the lower 24-bit is SP — we push two additional words in this case.

Thus, we push only two 16-bit words on the Return Stack for each CALL in most cases. If the function uses ENTER, it modifies the top 8 bits of the return address or pushes two additional 16-bit words.

Revamped code:

push arg
call proc

proc:
enter N        ; reserve space for local vars, updates Return Stack

mov ax, [sp + N]       ; arg
nov bx, [sp + N - N]   ; first local var
mov cx, [sp + N - 2]   ; last local var
...

ret 2          ; no epilogue needed, will restore SP

Unfortunately, this scheme does not work if we PUSH onto the stack. It is possible to make it work. This is left as an exercise for the reader. :-P

In the future, for 32-bit CPU, we will always push two 32-bit words (Return Address and SP) on the Return Stack.

Greenfield 8086: flat memory model

Jun 2026

The first thing I'm going to get rid of is the 8086's segmented memory model — its defining characteristic!

Segmented memory model works well in the 60s and 70s, simplifying code/data relocation and is a cheap way to provide protection in multi-process environment.

It works well as long as your data fits within a segment. Once exceed, it is painful.

With hindsight, we can see segmentation falling out of favour with paging being the choice of memory management.

8086 also has a bigger address space (20-bit) than its register size (16-bit), so it is difficult to address the entire space.

80286 — nearly flat memory

80286 uses segment selectors instead of physical segments in Protected Mode. The segment registers index into a Descriptor Table that contains the base of the segment, among others.

If the upper bits of a logicial address go directly into the lower bits of the selector, we can access >64 kB almost seamlessly.

But Intel put 3 control bits at the bottom, so complicated pointer arithmetic was needed again (need to +8 to increment to next selector).

; ideal sel:ofs
0000:ffff + 1 -> 0001:0000

; 286
0000:ffff + 1 -> 0008:0000

It is not 100% seamless — a single element cannot span segments, so >64 kB data has to be accessed carefully. Maybe this was why Intel purposely made selectors non-contiguous — you needed special handling code anyway.

Anyway, this is water under the bridge.

Other ways

The alternative to segmentation is flat memory. We need to either widen register size or use paired-registers.

Instead of 16-bit registers, we will have 20-bit registers. This allows us to put a full address in a register and dereference it directly. But this raises a question. How do we manipulate these 20-bit registers? Does it mean it is a 20-bit CPU now?

The other approach is to use paired registers. This is a common approach. 8-bit CPUs pair two 8-bit registers to access 16-bit memory space. But the 8086 has only 8 registers. Pairing them means we only have 4 — typically we need 2 – 3 pointers at the same time, so we only 4 – 2 registers left.

I'll go for the widened register approach. Instead of 20-bit, let's go for 24-bit — giving 16 MB address space. All registers are 24-bit. The CPU remains 16-bit. Most instructions manipulate 16-bit data, but some manipulate 24-bit — for pointer arithmetic.

With linear addressing, there is no more memory models. All pointers are FAR, indirect jumps/calls are FAR and all function returns are FAR. We are free from the 64 kB barrier.

The stack is also free of its 64 kB limit (though stacks seldom grow this big), but more importantly, any register can now reference the stack directly.

This does increase pointer size from 2 to 4 bytes. This makes the ISA unsuitable for systems with 64 kB or less since they only need 2-byte offsets. Once we get above a threshold, say 128 kB, the overhead no longer matters.

Another con is that we need a big relocation table. In the absence of paging and running each process in its own isolated memory space, all global code/data references need to be relocated.

A possibility is to use Global Offset Table (GOT). This makes the code PIC (Position-Independent Code) and make it reusable in multi-process environments.

What-if evolution of 8086 instruction set

Jun 2026

The 8086 instruction set has very nice orthogonal instructions, but it also has a bunch of short instructions that use up a lot of valuable one-byte opcode space.

When Intel created the 32-bit 80386, there was really no reason to keep the same encodings — the object code needed to be regenerated anyway.

And when AMD defined the 64-bit instruction set (Intel was not interested at this point because they were creating 64-bit Itanium), they also missed the chance to redefine the encodings.

What if we could go back to the very beginning and define the instruction set properly without considering backwards compatibility with 8080/8085?

Key questions:

1-byte instructions?
Load-store or reg-mem ALU operations?
How many registers?
Memory addressing modes
Immediate operand?
Displacement addressing with immediate operand?
Misaligned mem access
String instructions
Size of address space
20-bit pointer arithmetic
Segmented or linear addressing
Memory-mapped or port-mapped I/O

Note that this is a 16-bit processor. We need to keep in mind future extensions like floating-point math, 32-bit and 64-bit modes, and vector instructions.

1-byte instructions?

x86 is CISC, so it will have variable-length instructions. The question is, do we want 1-byte instructions?

8086 was designed in the mid-70s and released in 1978. At that time, microcomputers had 4 kB of memory. The IBM PC shipped in 1981 with 16 kB, expandable to 64 kB on the motherboard. By 1985, it was common to fill up the entire 640 kB RAM space.

1-byte instructions are essential with 4 kB memory, not so much with 640 kB. Most x86 1-byte instructions are not high-occurrence instructions either.

Removing 1-byte instructions free up valuable opcode space that can be used for defining shorter multi-byte instructions.

Load-store or reg-mem ALU operations?

Load-store architecture is a RISC characteristic. CISC generally allows ALU operations directly on memory.

There are 3 kinds of mem ALU operations: reg-to-mem, mem-to-reg and mem-to-mem.

x86 supports the first two. It turns out that reg-to-mem (i.e. mem = mem op reg) is not good for future superscalar execution.

; load-store
mov ax, [mem]
add ax, bx
mov [mem], ax

; mem-to-reg
add bx, [mem]
mov [mem], bx

; reg-to-mem
add [mem], bx

How many registers?

RISC likes to have 32 GPRs, but they are an overkill — kills interrupt and context switch performance. 16 GPRs is generally sufficient, especially with register renaming. I think 8 is enough, but they must really be general-purpose!

The 8086 has 8 GPRs, but it really only has 6 — SP is not a GPR and BP is needed to access stack variables.

x86-64 uses REX prefix to expand the number of registers, among others. This is a very useful technique that can be added later.

Memory addressing modes

8086 has very limited memory addressing modes. Only 4 registers can be used for register-indirect, and only in limited ways.

80386 revamps this with SIB (scale-index-base) which is super flexible — but it is not needed most of the time.

; x86
[bx]
[bx + disp]
[bx + si]
[bx + si + disp]

; SIB
[bx * 2]
[bx * 2 + disp]
[si]
[si + disp]
[bx * 2 + si]
[bx * 2 + si + disp]

Two are sufficient: register-indirect and register-indirect with offset.

With hindsight, it is very useful to have PC-relative addressing. This is used by x86-64, ARM and MIPS. It enables loading big literals (e.g. 64-bit) without making the instructions super long. It also enables PIC (Position-Independent Code).

Immediate operand?

Immediate operand increases instruction size. Do we support 1-byte and 2-byte immediates? Do we support 4-byte and 8-byte immediates in the future?

The 8086 has 1-byte and 2-bytes immediates. 80386 supports 1-byte and 4-bytes immediates. (2-bytes is supported via a size prefix, making it 3-bytes.)

Displacement addressing with immediate operand?

Displacement addressing has one optional offset (1 – 2 bytes). Immediate has 1 – 2 bytes. It adds 4 additional bytes to the instruction in 16-bit mode, but in 32-bit mode, it is 8 additional bytes. It makes for very long instructions.

Example:

mov [mem], imm

Generally, 64-bit CPUs do not have 64-bit displacement nor immediate — they make the instruction too long and they are not often used.

Misaligned mem access

As CISC, misaligned mem access is a given. However, there are times we want to enforce word-aligned access, for example, the stack, jump and call targets.

String instructions

By strings, I mean the famous LODS, STOS, SCAS, MOVS and CMPS. STOS and MOVS are especially useful when paired with REP.

They are great for manipulating strings in memory constrained systems, but we are way past that.

First, REP can be replaced by a tight loop. Next, with hindsight, MOVS is the only remaining useful instruction. The others can be written using simple instructions. For example, LODS is:

mov ax, [si]
add si, 2

Size of address space

The 8086 supports 20-bit address space — 22-bit with segment registers. In the mid-70s, 1 MB address space was unimaginable. But by mid-80s, the IBM PC had already reached its limit (640 kB conventional memory).

We will widen the address space to 24-bit. 16 MB was pretty big even in the early 90s. Windows 95 required only 4 MB of memory, though it ran better with 8 MB.

Does this mean we shift the segment register by 8 bits? We want to get rid of segmentation...

20-bit pointer arithmetic

It is difficult to do 20-bit seg:ofs pointer arithmetic and comparison.

Pointer addition (up to 65536 - 16):

; convert es:di to normalized pointer
mov dx, es
mov ax, di
shr ax, 4
add dx, ax
mov es, dx
and di, 0fh      ; es:di is now normalized

; inc pointer by 4 (any value up to 65536 - 16)
add di, 4        ; es:di is incremented, but it is not normalized

Pointer addition (any value):

; es:di -> linear pointer in dx:ax
mov ax, es
mov dx, ax
shl ax, 4
shr dx, 12       ; dx:ax now contains 20-bit linear seg addr
add ax, di
adc dx, 0

; add 32-bit ofs in cx:bx
add ax, bx
adc dx, cx

; dx:ax -> normalized pointer in es:di
mov di, ax
shr ax, 4        ; btm 12-bits of seg
shl dx, 12       ; top 4-bits of seg
or ax, dx        ; combine them
mov es, ax
and di, 0fh      ; es:di is now normalized

Pointer subtraction:

; ds:si -> linear pointer in dx:ax
mov ax, ds
mov dx, ax
shl ax, 4
shr dx, 12       ; dx:ax now contains 20-bit linear seg addr
add ax, si
adc dx, 0

; es:di -> linear pointer in cx:bx
mov bx, es
mov cx, bx
shl bx, 4
shr cx, 12       ; cx:bx now contains 20-bit linear seg addr
add bx, di
adc cx, 0

; find the diff in dx:ax (ds:si - es:di)
sub ax, bx
sbb dx, cx

It will help if the CPU can manipulate 20-bit pointers (24-bit with our design) directly. It will also help to have an extra segment register so that DS can point to the global data segment all the time.

Addition with CPU assistance:

mov dx:ax, [p]
ptr.p2l dx:ax    ; phy->linear addr

add ax, bx
adc dx, cx

ptr.l2n dx:ax    ; linear->normalized

les di, dx:ax    ; es:di -> dx:ax

Subtraction:

mov dx:ax, [p1]
ptr.p2l dx:ax

mov cx:bx, [p2]
ptr.p2l cx:bx

sub ax, bx
sbb dx, cx

Segmented or linear addressing

Segmentation allows relocatable code and data without load-time fixup. It does not affect code much, but it is difficult to work with data bigger than 64 kB.

If we want linear addressing, we either need to widen the register size or use register-pairs for memory addressing.

Example of linear addressing:

; p1 and p2 are linear pointers
; *p2++ = *p1++

mov ex:si, [p1]
mov ax, [ex:si]  ; paired-reg is automatically linear
add si, 2
adc ex, 0
mov [p1], ex:si

mov ex:si, [p2]
mov [ex:si], ax
add si, 2
adc ex, 0
mov [p2], ex:si

Pointer manipulation is a pain when address space (20 – 24 bits) > register size (16-bit). The problem goes away with 32-bit — both address space and register size match, and address space is big enough for most use-cases.

Memory-mapped or port-mapped I/O

8086 uses a separate 64 kB port-mapped I/O address space. If we expand the memory size to 16 MB, we can just use memory-mapped I/O. With hindsight, everyone uses MMIO nowadays.

A popular pattern with port-mapped I/O is to select the index, then read/write the value. This is to reduce I/O ports used — the IBM PC has only 1,024 I/O addresses as it uses 10-bit I/O address on the 8-bit ISA bus (*). With memory-mapped I/O, we just access the I/O registers directly.

; Port I/O
mov dx, 0x3d4    ; CGA CRTC index reg
mov al, 0        ; 0 = Horizontal Total Register
out dx, al

inc dx           ; CGA CRTC data reg
mov al, 0x38     ; value
out dx, al

; Mem I/O
mov ax, CGA_CRTC_REG_BASE
mov es, ax
mov es:[CGA_CRTC_HORZ_TOTAL_REG], 0x38

(*) IBM expanded the ISA bus to use 16-bit I/O address with IBM AT, but there were many I/O cards doing 10-bit decoding, so it was not safe to use higher address space — unless you reserved the range in the lower address space first. IBM should have put a compatibility jumper beside each slot that disables access if the higher address bits are non-zero.

Also, memory-mapped I/O gives the expectation that I/O can be read back. Nightmare of CGA where many registers are write-only.

My Rambling Thoughts

Tales of Japan first impressions

What-if redesign of 8087

Special values

Double precision

An eye on the future

Revelation on interstellar travel

Book time!

Brief notes on 8087

SafariUni 5x25 bino

VisionKing 5x25 bino

Greenfield 8086: accessing flat memory

Stack operations

Using Return Stack

Efficiency

Greenfield 8086: flat memory model

80286 — nearly flat memory

Other ways

What-if evolution of 8086 instruction set

1-byte instructions?

Load-store or reg-mem ALU operations?

How many registers?

Memory addressing modes

Immediate operand?

Displacement addressing with immediate operand?

Misaligned mem access

String instructions

Size of address space

20-bit pointer arithmetic

Segmented or linear addressing

Memory-mapped or port-mapped I/O