BitWrangler wrote on 2023-03-25, 03:32:
What do I mean? Well when Cyrix brought out a memory mapped and i/o mapped capable version of the x87, which original 83D87 didn't support, in the next iteration of the 83D87, they apparently rolled it in... and some indication they carried it through to all of their x87 class hardware thereafter...
Supposedly there's 33% performance boost just hanging there for using these modes, and they move the FPU out of lock step with the integer unit, no waiting on either end... isn't that part of what makes Quake slow? (Not saying it would close the whole gap since it's programmed very tightly to intel interleaving, but it might close the gap halfway, just by avoiding the forced way it is made to operate otherwise.)
Dear Santa for Christmas I want a pony and a TSR that translates all FPU calls to Cyrix memory mapped mode.
I don't think you can get that TSR. You would need a hardware modification first to get the Cyrix FPU running on memory cycles. The Cyrix processor implements the 387 protocol, which includes an interface consisting of 2 ports of 32 bits each on the "local bus". These ports are the "command port" and the "data port". The 387 interface also includes some control lines to indicate unmasked exceptions (ERROR#), readyness of the processor (BUSY#) and a data transfer request line (PEREQ) which are special-purpose FPU pins and not available to MMIO devices, and thus arguably not part of the 80386 local bus. The Cyrix processor is designed in a way that it can be connected to a 386-type local bus without ERROR#, BUSY# and PEREQ in use.
This causes some loss of features, though:
- If you issue a floating point instruction while the coprocessor is unable to accept a new instruction, the standard 387 protocol allows the processor to detect that state using the BUSY# pin, and wait for readyness while the local bus is idle (and available for DMA operations) and the processor is ready to handle interrupts. Omitting the BUSY# pin, as it is done in the "memory mapped" mode causes the 387 to detect 387 bus cycles all the time, but if the 83D87 can't handle the cycle, it will stall the local bus until it can handle the cycle. In that case, neither DMA transfers nor interrupt handling is possible.
- Furthermore, the ERROR# signal would allow "properly designed" system for deterministic FPU exception reporting: The idea is that when the 286 (or any later processor) is going to issue an instruction (or executes WAIT), it will poll the ERROR# line and enter the trap handler with the return address pointing to the subsequent FPU instruction. In the IBM AT (and later) series of computers, this "proper design" is not implemented anyway (as it would collide with INT10 used for video function in the IBM BIOS), and ERROR# is managed using IRQ13 which can happen at any time, not just at an FPU instruction, so this loss is likely negligible in AT-compatible computers.
- Not using the 387 protocol renders the 386 support that is available in that protocol unavailable. This is mostly the tracking of the last executed instruction address and (if applicable) the address of the data operand used in this instruction. If a coprocessor is accessed using the 387 protocol, this data is injected by the 386 into the data stream returned from the 387 during the execution of the FSTENV instruction.
- Lastly, the PEREQ signal is used to signal readyness for data transfer in instructions that need to perform work before transferring data to the host processor. The 387 is supposed to perform this work while the 386 is still execution of further instructions. At some point, the 387 can request data transfer, which will happen completely transparent. On the other hand, in the memory mapped interface, data transfer can't be performed in the background, as the 387 is not a busmaster device. The host processor needs to perform a read of the data register at a moment the host processor sees fit. If it is too early, the 83D87 will stall the front side bus until the data is available. If it is too late, the 83D87 will idle until that read occurs.
On the other hand, disregarding the 387 protocol has the advantage that it allows more pipelining. For example the 83D87 can buffer one command write. You can submit the subsequent command to the 83D87 while it is still working on the previous command. This complicates precise exception tracking even more, because an excpetion from the previous command might not yet have occurred when the current command is transferred into the buffer. If code relies on handling stuff in an exception handler, trying to find out what exactly failed and how to fix it up will become very difficult. In normal PC/AT operation with standard consumer software/games, though, I don't know of any example that does require precise exception tracking. Furthermore, transferring data through programmed I/O instead of using the PEREQ DMA channel allows for less bus arbitration delays if the data transfer instruction is already pending (i.e. stalling the bus) when the 83D87 gets ready to transfer the data. This improves the performance of the 83D87 instruction execution (and this is one of the reasons the "basic execution time" in the Cyrix Data Sheet beats the "Intel System Time"), OTOH, it does not generally improve system performance. Furthermore, transferring data through a memory mapped interface makes the 80386 execution unit deal with operand transfer, whereas the 387 protocol allows the execution unit to execute subsequent instruction with the bus interface unit handling the data transfer.
BitWrangler wrote on 2023-03-25, 03:32:they move the FPU out of lock step with the integer unit, no waiting on either end...
Now, that's a common misconception. While the bus interface of the 8087 ran in very tight lockstep with the bus interface 8086, and still runs in lock step with the 80387, the execution unit of the 8087 never was coupled to the execution unit of the 8086. Parallel execution of Integer and Floating Point instruction had always been possible, and parallelization works better with the 387 FPU interface (due to the PEREQ signal enabling background DMA-like transfers) than with the MMIO interface. The MMIO interface still requires the CPU to actively send FPU instructions to the FPU, and to actively transfer data between the FPU and the CPU, so I don't see it as looser coupling compared to the 387 interface. Actually, the opposite is true: The explicit data transfer makes the coupling even more tight. A possible advantage of the tight coupling: it allows sending operands (e.g. 32 bit integers) from processor registers directly to the FPU without going through memory. This is impossible using the 387 protocol. (but see footnote 1)
BitWrangler wrote on 2023-03-25, 03:32: and some indication they carried it through to all of their x87 class hardware thereafter...
This makes little sense, though. I would consider such indications mostly fake rumors, but see below. There is no standardized memory mapped interface to the 83D87. In contrast to the Weitek processors, the 387 does not contain a sensible decoder to detect MMIO cycles destinated to it (that's why the Weitek processors need an extra row of pins). Instead, the 387 protocol uses a special signalling on the local bus that only requires minimal decoding the in 387: Intel (artificially) limited the number of I/O ports available on the 386 bus protocol to 65536. This allows I/O devices to be simpler, because they do not need to care about more than 16 address bits. This means every proper I/O cycle of a 386 processor has A16 to A31 at a low level. For 387 cycles, an I/O cycle with A31 set high is used. Furthermore, A2 is connected to CMD# of the processor to select the command or data register, so it responds to all virtual 32-bit I/O ports between 8000000 (command)/80000004 (data) up to FFFFFFF8 (command)/FFFFFFFC (data), i.e. 2GB if virtual I/O space. To operate like this, the 387 only needs to get the "I/O cycle indication" and A31 to decide whether the cycle is a 387 cycle. The pins on the 387 that are to be connected to M/IO# and A31 are called in the neutral way NPS1# (needs to be low for FPU access) and NPS2 (needs to be high for FPU access). "NPS" is an abbreviation for "numeric processor select". To access the 83D87 in MMIO mode, external logic, which is not specified by Cyrix needs to detect MMIO cycles targetting the system-specific MMIO address of the 83D87 command/data ports and provide the NPS1#/NPS2 signals. You don't have that logic on AT-compatible mainboards. Also, as the address is system-specific, Cyrix can not have carried over the MMIO way of accessing the 83D87 on some exoctic 386 systems (or even non-386 computers) to later Cyrix x86 processors.
On the other hand, with the FPU being integrated on the CPU (486DX onwards), there is no need to implement the Intel 387 communication protocol. Especially, this means that you can build a 486-class FPU/NPU combnination that ignores the synchronization required for precise exception reporting, and allows the pipelining the Intel 386 purposefully chose to not do (to enable precise exception reporting). Most likely not coincidentally, the Cyrix 5x86 processor does have an "FP_FAST" control bit, and explanation about that bit relate it to exception handling. My intuition is that enabling FP_FAST on a Cyrix 5x86 gives you the best of both worlds: 387-compatible programming interface, 387-compatible background data transfer, but MMIO-comparable pipelining by explicitly relinquishing the requirement for precise exception reporting. You might call this "they carried over the MMIO interface", but I consider this quite a stretch.
1: There actually is one instruction that allows direct data transfer between the FPU and the CPU without going through memory. This instruction has been introduced with the 286/287 chip combination. It is FSTSW AX. This instruction captures the likely most prominent case of FPU-provided data being required an a CPU register, but is obviously less powerful than being able to transfer any FPU data into/from CPU register.