douglar wrote on 2023-10-11, 02:08:
Great write up. Can I talk you into giving your version of the classic story about A20, himem, and the keyboard controller?
Yeah, but be prepared for some unpopular oppinion mixed in it.
When Intel designed the 80286 processor, they didn't specifically target the IBM PC, and especially, Intel didn't target the IBM AT (which was not even designed while the 286 processor was designed). While the 80286 was designed, it wasn't expected that everyone would be running DOS the next 10 years. The 286 was designed for two use cases: One is the use as "Turbo 8086", which keeps the complete 8086 architecture and executes real-mode software. The other use case is for "advanced computers" using a multi-tasking operating system with a protected-mode kernel and suitable userspace software. The use case of a "hybrid computer" that is meant for both real-mode DOS software and modern protected mode multi-tasking software was obviously not envisioned by Intel.
Let's start with a small excourse into arithmetic logic unit design: Most computers use a number representation for negative integer numbers called "2's complement". Instead of showing binary or hexadecimal numbers, I'm going to explain that concept using decimal numbers. Imagine you can only handle numbers with 3 digits. In that case, you can obviously calculate something like 015 + 033 = 048, or 333 + 456 = 789. But it's also possibe to get a "carry" into the fourth digit (which doesn't exist), e.g. in 710 + 611 = 321 (+1000 as carry). For normal arithmetic calculations, the processor keeps a carry flag, and if you want to deal with number exceeding 16 bits (in computer technology) or 3 digits (in my example), you need to process the carry flag. The processor has an add-with-carry instruction for that. For example, if you add 000'710 and 000'611, you first calculate 710+611 to obtain 321, and then 000 + 000 + carry to obtain 001, resulting in a total result of 001'321. The most important observation is that generating a carry is not an error, or something that must be avoided, but normal operation. You can use that for your advantage: Observe how adding 999 has the same effect as subtracting 1! And that's actually how negative numbers are implemented in computers: If we deal with "signed numbers", we consider 000..499 positive and 500..999 negative, representing the number -500 to -1. On the other hand, when we deal with "unsigned numbers", we treat 999 as 999. However we treat 999 does not matter for how addition is to be performed. The interpretation of 999 as -1 is equally valid as the interpretation as 999, and it depends on context what interpretation is "correct".
Now, back to the 8086 and its addressing: To make this discussion easier to follow, instead of segments and offsets being binary 16 bit numbers, and the segment being multiplied by 16, lets examine what happens in a similar architecture that uses 3-digit decimal numbers for segments and offsets, and the segment is multiplied by ten, with the full addressable memory range being 10'000 bytes, numbered from 0000 to 9999. Obviously, using a segment value of 000, we can address 0000..0999 using the offsets 000..999. A segment value of 100 provides a base of 1000 (as segment values are multiplied by 10, just by appending a zero to the segment value), so offsets 000..999 now reach to addresses 1000..9999. If segment zero only contains 500 elements, the next segment could already start at segment value 050, which provides a base address of 0500. The physical addresses reachable from this base address are 0500..1499. In this kind of addressing, offset 999 is obviously no longer equivalent to -1, because treating the offset at -1 with a base of 0500 would target 0499, not 1499. This happens, because the result has 4 digits, but the offset only has 3 digits. The equivalence of -1 and 999 only exists for 3-digit results, not for four-digit results. But now let's look at the segment base addresses: the highest segment we can choose is segment 999, which starts at 9990. The offsets 0..9 will reach into physical memory addresses 9990..9999 (the last 10 bytes of this hypothetical machine that can address 10'000 bytes). On the other hand, offsets 10..999 will yield addresses 0000..0989, because there is no fifth digit that the carry of the addition could go into. So you can interpret segment "999" as a segment near the end of the memory range, which only has 10 elements till the memory ends. At the same time, this is also segment -1, with a base address of -10, and permittable offsets inside this segment starting at 10. This kind of dealing with "wrap-around", allowing negative segment numbers to exist (if you supply a high enough offset value) is a fundamental property of a segmented four-digit decimal addressing architecture. It is not a bug or a quirk, but it just works as designed. So actually, in the 8086, the highest segment register value is not 999, but 0FFFFh (a hexadecimal number), being the same thing as 65535 (a decimal number), and this does not point 10 bytes, but 16 bytes before the end of the 1MB address space, but the idea is exactly the same: For the first 16 bytes inside segment 65535, you get the last 16 bytes of address space (an IBM PC has the end of the BIOS ROM there), and for the remaining 65520 bytes in that segment, you get to see the start of the address space, so the segment number is treated as -1 in this context. This happens because the highest address the 8086 can send to the mainboard has the hexadecimal value 0F'FFFFh, the next number would be 10'0000h, but this number is not representable by the 20 address lines (called A0..A19), so it wraps to physical addres 0'0000h.
As the 80286 can address 16MB, it now has 4 extra address lines called A20..A23, allowing addresses up to 0FF'FFFFh. This means, there is no intrinsic overflow at the end of the 1MB address space anymore, and treating segment 0FFFFh as -1 no longer works if the main board actually cares about A20..A23. And that's where the actual use of the of the 286 processor did not match the design expectations. Intel expected the computers using the 80286 as "turbo-8086" do not care about A20..A23, and provide the compatible 1MB address space, while modern computers with modern operating systems worked in the protected virtual-address mode anyways in which calculating with physical addresses at the application level made no sense anymore. Intel did not expect a system that mainly operates in real-address mode, but has more than 1MB of address space, but that is exactly what the IBM AT does! So using a processor outside the use-case planned for it requires some creativity by the system designer - and IBM got sufficiently creative here. Their AT can both execute DOS and ignore the A20 address line, providing the address wraparound that is part of the 8086 system architecture, or it can be switched to protected mode using a BIOS call, which will switch the computer (not just the processor) into a different operation mode that honors the A20 line and switches the processor into the protected virtual-address mode. The idea by Intel was that a 80286 based computer either works in real mode all the time, or it just does some preliminary set-up in real mode, and then switches into protected mode for good. Intel could have made the choice between real mode and protected mode influenced by a pin on the processor instead of making protected mode a software-enabled option, but as the protected mode requires some data structures to work at all, most prominently the global descriptor table containing the kernel segment descriptors, having the initialization of the protected mode data structures run in real mode and do a controlled switch when this is done seemed the more viable option.
The keyboard interface between the XT and the AT is completely different. The keyboard interface of the XT contains of a synchronous serial receiver (no transmitter at all!) that receives a single byte from the keyboard, then triggers an interrupt and reads the received byte from a discrete TTL logic shift register chip through a parallel interface chip, the Intel 8255 (a chip that connects 3 8-bit "ports" to a data bus). The XT keyboard interface had no intelligence at all. The most observable consequence is that the CPU can not tell the keyboard to turn on/turn off caps lock, num lock or scroll lock LEDs. This is no issue for the standard PC and XT keyboard - because they don't have any LEDs! With the AT, the keyboard port was extended to be a bi-directional communication port, and interfacing that port could no longer be built from discrete TTL components, but handling this bi-directional synchronous serial interface was implemented in software in an Intel 8042 microcontroller (which is a variant of the 8048 microcontroller). This controller is called the "keyboard controller". While the primary purpose of this 8042 microprocessor was to handle communication with the keyboard, there were some spare pins on it that could be used for different purposes. And that's how IBM decided to implement the mainboard mode switch between XT-compatible 1MB addressing and new-fangled 16MB addressing: One pin on the keyboard controller outputted a signal that was fed into an AND gate to mask the address line A20. This AND gate is called the "A20 gate", and sometimes, the control signal from the keyboard controller to this gate is also called "A20 gate". That's how the keyboard controller is involved in adressing memory on the IBM AT.
HIMEM does a new twist on the game: HIMEM puts the mainbaord into "A20 unmasked" mode most of the time, but keeps the processor in real mode most of the time. So suddenly, address 10h in segment 0FFFFh no longer wraps down to address 0, but you can peek into the first 64K of extended memory. "Extended memory" is the name for memory with a physical address of 1MB or higher, which should not be visible to 8086-compatible code. It could have been a quite easy hack - if there wouldn't have been a notable amount of real-mode software actually using negative segment numbers! This was not a bug, but a sensible design choice: When software proceeds forwards through a memory range that might exceed 64KB, at makes sense to "re-normalize addresses" from time to time by increasing the segment number by some amount to minimize the offset, so that the maximum possible bytes forward from the current address is visible without changing the segment address. This pattern is completely unproblematic. On the other hand, when you proceed backward through memory, the sensible choice is not to minimize the offset value, but to maximize it, by minimizing the segment number. If an algorithm like this got down to re-normalize the segment/offset combination at a physical address of 32KB, it would choose an offset of 64KB (the maximum representable offset), and a segment with a base address of negative 32KB (which would be segment number -2048), and voilà, there is your negative segment number that perfectly works on a 8086 IBM PC, but will fail to work on an IBM AT with the A20 gate opened, as offset 64KB in that segment would no longer point to 32KB, but to 1MB + 32KB, so it would point into the start of extended memory. Most notable is the Microsoft Linker used to link all the EXE utilities shipped with MS-DOS. It had the option to apply a very primitive EXE compression using run-length encoding in release builds (enabled by the command line switch /EXEPACK, thus this scheme got named "EXEPACK"). As expansion of that compression scheme causes the output to be bigger than the input, Microsoft chose the sensible thing to do: It processes the data backwards, so it can operate in-place. EXE files are not limited to 64KB, so this is a text-book example of an algorithm processing a data block that might exceed 64KB backwards, so it uses pointers normalized for maximal offset. So HIMEM and DOS worked together to smartly control the A20 gate (and on the IBM AT and compatible machines, this means it interfaced the keyboard controller). When you load HIMEM and enable "DOS=HIGH" in CONFIG.SYS, DOS moves parts of the DOS kernel data structures into the first 64KB of extended memory (called the "high memory area"), and keeps the A20 gate open most of the time. But whenever an EXE file is started, the A20 gate gets closed (because that EXE file might be packed using EXEPACK), and it keeps being closed (and the adressing is now PC/XT compatible) until a system call into the DOS kernel is made, which re-opens the A20 gate. It is known that the EXEPACK unpacking code which is executed first in EXEPACK-packed executables does not perform any DOS call by itself, so when a system call happened, either the unpacking failed ("packed file is corrupt"), or it is finished and the actual application started running. So in fact, when I wrote a notable amount of real-mode software actually using negative segment numbers I didn't actually write about different software vendors writing their own algorithms, but actually about a notable amount of DOS software linked with the Microsoft Linker with /EXEPACK enabled.
OK, so HIMEM manages the A20 line, but that's not the only purpose of HIMEM. HIMEM also manages allocation of the extended memory, and handles data copying into and out of the extended memory. HIMEM provides a software interface conforming to the "Extended Memory Specification" (XMS), which allows to keep track whether some software uses the HMA (which can not be allocated partially to different XMS API users), and how much extended memory is borrowed to real mode by being used as HMA. The remaining extended memory can be allocated and freed as "Extended memory blocks" by different applications like RAMDRIVE.SYS or SMARTDRV.SYS, or by application programs. The only common use of the HMA was allocating the HMA to the DOS kernel using DOS=HIGH, but it would have been possible to not use DOS=HIGH, and have some other application make use of the HMA. As most users used DOS=HIGH anyway, I don't know of any applications making use of the HMA on their own. The XMS API also provides function calls to allocate and free UMBs, but this part of the XMS API is not implemented by HIMEM.SYS and thus not in scope of this post.
So I already wrote a very long post, but a lot of readers are waiting for the most well known elphant in the room to be addressed: As Intel intended the real-address mode to only be an intermediate set-up for software that is going to switch to protected mode, Intel skipped the logic to re-initialize the processor back to normal real-mode operation. There is no way to leave protected mode without completely re-initializing the 80286 processor, by resetting the processor using the processor reset pin. Issuing a processor reset does not mean issuing a system reset, though. On the IBM AT, the 80286 could be reset without any other parts of the system receiving a reset signal. This dedicated CPU reset signal to leave the protected mode was again generated by the 8042 keyboard controller, by giving it a special "pulse output pins" command. As there is no publicly documented way to access memory beyond 1024KB + 64KB (-16 bytes) without entering protected mode, the standard IBM BIOS call to "copy data between any addresses in conventional or extended memory" indeed switched the system to protected mode, enabled the A20 gate, performed the copy, and issued a processor reset afterwards. The BIOS intialization code then checks the status bits of the keyboard controller: If it indicates that is has not seen a system reset since the last complete POST, the BIOS knows that no one pushed the reset button (which would cause a system reset), but this is a dedicated processor-only reset. In that case, the next step of the AT BIOS is to check the CMOS RAM, byte 15. This is called the "shutdown status" byte. This byte indicates how the computer is supposed to resume operation after a reset. Some codes are internal to the IBM BIOS, while other are officially documented. They mostly work by doing a minimal system reinitialization followed by a jump to an address specified in the BIOS data area, so that the application that invoked the reset can then continue in real mode.
The 286-12 our family had when I was young with an AMI BIOS required around 1ms to switch from protected mode to real mode (as measured by a performance check utility supplied with a DOS extender), which is quite slow (12'000 clock cycles). It would be nice to be able to access extended memory without paying this cost every time after accessing it. And that's where the black magic comes in: There is an undocumented 286 opcode, known today under the name "286 LOADALL". This opcode was not intended for application or operating system use, but for in-system emulators. An in-system emulator is a hardware debugger plugged between the processor and the processor socket of a system, that is able to interrupt normal execution and investigate processor and system state, then continue execution. This works by having dedicated emulator ROM and RAM on the in-system emulator, and switching the processor from executing application code into executing debugger code. When the debugger was done and system code execution was meant to continue, the processor needed to be re-initialized to the state it had when it was interrupted. This in-system emulator has to work with real mode as well as with protected mode target code. Interrupting and resuming protected-mode code is not completely straightforward, because segment registers in protected mode contain segment numbers, not the actual properties of the segment, and when a segment register is loaded, that number is used to look up the segment description, and the contents of the segment description is put into an invisible shadow register of the processor called the "descriptor cache". If the description is changed afterwards, the processor keeps using the old cached descriptor. For this to work even with breaks into the debugger code of an in-system emulator, the instruction LOADALL had to be able to directly load the segment descriptors, independent of the current descriptor table contents. As it got know in ~1988, the LOADALL instruction could not just be used to exit from the debugger code of an in-system emulator, but it could also be called from real-mode code. Now that's a game changer! This allows real-mode code to load any base address into the segment descriptor cache, even past the 1MB barrier, without entering protected mode, thus also without the need to leave protected mode. 286 LOADALL is quite inconvenient, because it loads the register contents from a fixed memory address (that was supposed to be in emulator RAM, and in this context, the specific address surely makes sense), but this address is somewhere inside the memory allocated for the DOS kernel. MS-DOS 5.0 or newer reserved this space for HIMEM usage, so HIMEM can use the LOADALL approach instead of the switch-to-protected-mode approach to implement the "XMS copy" function.
So, we can get rid of switching to protected mode to copy data from/to/between extended memory. But we can't get rid of masking A20 everytime an executable packed by EXEPACK starts (which involves the keyboard controller on AT systems). And we can't get rid of resetting the 286 to leave protected mode once protected mode was actually entered. While HIMEM could skip entering protected mode, Windows 3.0 (and 3.1) running in standard mode could not, because the point of standard mode is to execute 286 code in protected mode. Yet, Windows 3.x required to switch to real mode a lot of times: Everytime a system service provided by DOS or the BIOS was called, the processor had to be switched back to real mode. But we can get rid of requiring to kindly ask the keyboard controller to reset the 286 processor when it seems the next time for the 8042 to handle commands sent to it, because most 286 mainboards provide an internal shortcut to generate a CPU reset: Whenever the 286 in protected mode gets lost that much that it has no recovery procedure available (because something, like an invalid memory access, happened that needed to be reported to the OS (this is a fault), but while trying to report that invalid memory access to the OS, this turned out to again violate protection rules (now it's a double fault) and reporting this condition to the backup recovery handler also resulted in a protection violation (a triple fault)) it just gives up executing code, and it notifies the main board by issuing a special bus cycle, called the "shutdown cycle". Many main boards identify a shutdown cycle and generate a processor reset as recovery procedure. If this works, you can eliminate the keyboard controller from the path back from protected mode to real mode, yet you can't elminiate the reset itself.