• Please review our updated Terms and Rules here

8088/ISA bus sniffer device

reenigne

Veteran Member
Joined
Dec 13, 2008
Messages
717
Location
Cornwall, UK
Continued from the 8088 prefetch algorithm thread (which was getting a bit off topic).

Of course any access to a BIOS interrupt or video RAM would disturb the sequence. I understand it would be for a very specific use case you have in mind for your own code so that can be controlled.

Yes, exactly - this is for analyzing the execution of some specific pieces of code for a particular set of experiments, rather than being a general ISA bus analysis device - it's not as multi-purpose as other analyzers.

I just don't understand the effort to achieve the result you want over picking up a $50 OBLS like westveld suggests.

I just noticed that that can do 32 channels (I was thinking it could only do 16, which would make the experiments I wanted to rather fiddly). With 32 it would be possible to monitor most of the CPU signals, which would probably be enough for what I want to do. Still I suspect that having gotten this far, it'd be easier to do it with the board that I've designed. That's a good backup plan though if I can't get this one to work.

Keep hacking though. It's what makes all this great.

Yes, it's definitely a fun little project if nothing else!
 
Couple notes on OBLS in general and 32-bit capture. There is a fixed 24KB storage provided by internal block RAM. If your capturing 16-bit, you get 12K samples.. 32bit 8K samples. It does to continuous capture with full rule based trigger generation to stop the capture at a configurable percent before/after the trigger point. Also 32-bit will likely degrade the max sampling frequency to 50 MHz tops.

I have not tried slaving the OBLS to an external clock source like the ISA bus clock, but it in theory should be possible.

On your design RAM refresh cycles may also cause havok. You might want to design in a capture inhibit for when the refresh signal is active.
 
On your design RAM refresh cycles may also cause havok. You might want to design in a capture inhibit for when the refresh signal is active.

I have a way of checking for that in the microcontroller software (it helps that the refresh DMA addresses are all in the low 64Kb, so A19 will never be raised by the refresh DMA). But I might just turn the DRAM refresh off anyway - as long I'm touching enough memory addresses I can do so safely, and it's much easier to get consistent and simple results that way. I've been turning off the refresh for the software timing experiments I have done so far.
 
If this is for 8088 timing analysis, why not just put together a "minimum mode" 8088 system with some SRAM? No interrupts to worry about, no wait states, no DMA issues--just you and the CPU.
 
If this is for 8088 timing analysis, why not just put together a "minimum mode" 8088 system with some SRAM? No interrupts to worry about, no wait states, no DMA issues--just you and the CPU.

The point of this exercise is to look at the QS0 and QS1 queue status lines, but they aren't available in minimum mode.

Also, there's no wait states on the 5150/5160 for system memory. And turning off the interrupts and DRAM refresh is much easier than putting together another 8088 system.
 
The point of this exercise is to look at the QS0 and QS1 queue status lines, but they aren't available in minimum mode.

Also, there's no wait states on the 5150/5160 for system memory. And turning off the interrupts and DRAM refresh is much easier than putting together another 8088 system.

A maximum mode system then :) Only 5 ICs are required for a basic system: 8088 CPU, 8284 clock generator, 8288 bus controller, 74LS373 or 74LS573 address latch, and a flash ROM for the code... Add a 8255 PPI for basic I/O (e.g. diagnostic LEDs and some switches).

Or even better connect a 8088 CPU directly to a Arduino or similar microcontroller, drive clock/reset/data lines from it, and analyze control signals and address (8 lower address bits should be enough, and they are shared with data bus). NMOS 8088 uses some dynamic cells, and requires clock frequency of at least 2 MHz. CMOS 80C88 doesn't have this limitation (its clock can be stopped) and otherwise as far as I know it has exactly the same instruction timing.

In case you will build a bus monitor for PC/XT, you can monitor less signals... something like CLK, READY, RESET, QS0, QS1, S0-S2, and perhaps AD0-AD7 if you want to look at instructions opcodes and lower byte of address. It is possible to determine wait states using READY signal, and ignore them so they won't affect your measurements.
 
A maximum mode system then :)

I suppose it's a matter of perspective - designing my own ISA card seems to be simplest way to do it to me because I've just done that. You've designed your own 8088 system so perhaps that makes it seem simple to you!

There's another reason that I want to use an ISA card as well. While the main problem at the moment is figuring out the 8088 instruction timings, I eventually also want to emulate the rest of my XT with cycle accuracy, so at some point I'll probably also want to do some experiments looking at interrupt, DMA and timer signals.
 
I might be talking hearsay here, but have you considered making an FPGA core for this purpose? If you make the sample fifos double (maybe triple)-buffered, a small microcontroller can dump to storage while the FPGA collects new samples from the bus.
 
I finally picked this project up again after leaving it the back burner for more than 3 years. Turns out I was pretty close - in just a few days I've managed to get some sensible-looking results out of it. I can capture up to 2048 CPU cycles (~429 microseconds) at once, at a sample rate of 14.318MHz. I can capture 38 of the CPU's pins and 52 of the bus's pins (not all of which are very useful). I've also got 13 spare sampling pins that I can hook up to other lines parts of the machine if necessary.

The trace linked to above is of a small part (~7 iterations) of the mod player (end credits) code from 8088 MPH (albeit not with the same data). Most of the signal lines look correct to me, though some might be wrong (I haven't looked at them in very much detail yet). Some of the captions of the CPU lines are wrong as they're written with the assumption that it's plugged into the CPU socket, but in the end I found it easier to plug the device into the FPU socket. Most of the lines are the same, but NMI, INTR, LOCK, RQ/GT0 and RD are missing and RQ/GT1 shows up where RQ/GT0 should be. I might make a little board that I can plug into the CPU socket and break out those pins - then I'll be able to do sampling with an 8087 in the machine as well.

This is going to be on the XT server but it's not quite ready for public consumption yet (in particular, the XT server crashes on the second sampling run after a restart - I think I know how to fix this, though).

I also plan to make the output a bit more user-friendly:
  • Making the CPU's 20-bit bus show only what's valid - address or data and/or S3-S6.
  • Showing the state of the prefetch queue.
  • Making it clearer what data is going where on the bus.
  • Making it clearer what type of CPU/bus cycle is happening (T1, T2, T3, T4 or Tw).
  • Showing a disassembly of whatever instruction is actually being executed.

I'm also planning to use the data I gather from this to make my 8088 emulator code cycle-exact.
 
How can you achieve a sample rate of 14.318 MHz? Assuming you unrolled the entire capture sequence and had mux values pre-loaded in registers, wouldn't you need at least 7 cycles to bring in two bytes and update the mux selection? (in port A (1), store ind ram w/ post inc (2), in port B (1), store ind ram w/ post inc (2), out port mux select (1))

At 20 MHz, that's only 2.86 MHz sampling rate for 1/8th of the samples. To get all 104 bits is 357 KHz sampling rate. And every 13 bits will be sampled at least 45 degrees out of phase (with 8 vs 5 bits also phase shifted). I'm very confused. Please enlighten!
 
Last edited:
NM, I didn't even read the 2nd page with my own responses. So you are locked to OSC and every 3 ticks you are doing: in port A (1), store ind w/ inc (2) - at an effective rate of 4.77 MHz and you have to run your code 16 straight times to get a full capture?

Should work. It's very Macgyver'ish. And equally insane..

Since this thread, I have built a real-time continuous bus sniffer for a AT&T 3B2. It works pretty well at a bus speed of 10 MHz. Could be adapted for ISA if there is every any interest. Consists of 3 quick switches, CPLD, and Cypress FX2LP USB peripheral. I imagine with enough development, you could not only snoop on the bus, but reconstruct graphics and sound on the host PC in real time.
 
Last edited:
How can you achieve a sample rate of 14.318 MHz? Assuming you unrolled the entire capture sequence and had mux values pre-loaded in registers, wouldn't you need at least 7 cycles to bring in two bytes and update the mux selection? (in port A (1), store ind ram w/ post inc (2), in port B (1), store ind ram w/ post inc (2), out port mux select (1))

At 20 MHz, that's only 2.86 MHz sampling rate for 1/8th of the samples. To get all 104 bits is 357 KHz sampling rate. And every 13 bits will be sampled at least 45 degrees out of phase (with 8 vs 5 bits also phase shifted). I'm very confused. Please enlighten!

The secret sauce is that I'm not sampling all the bits in any given cycle - I only read 6 or 7 bits at a time, at a rate of 4.77MHz (I have unrolled loops on the microcontroller but they take 3 cycles to read the value of a port and store it to RAM). The microcontroller is using OSC for its clock (14.318MHz). So I'm actually running the same 8088 code 48 times (8 mux values times 2 ports times 3 cycle phases). Each iteration I do a HLT to synchronize with the PIT and reset the PIT counters and DRAM refresh DMA address so that everything is identical on each run. Currently it'll break if I access CGA memory since the wait states will be different on different runs, but I'll fix that at some point.

Admittedly this isn't very useful for watching what happens when an external peripheral (e.g. keyboard, disk, joystick, mouse, serial port, parallel port) is accessed but I'm mostly interested in figuring out the timings for the 8088 itself so I'm not too bothered about that.
 
Since this thread, I have built a real-time continuous bus sniffer for a AT&T 3B2. It works pretty well at a bus speed of 10 MHz. Could be adapted for ISA if there is every any interest. Consists of 3 quick switches, CPLD, and Cypress FX2LP USB peripheral. I imagine with enough development, you could not only snoop on the bus, but reconstruct graphics and sound on the host PC in real time.

Very interesting! I do actually have a Cypress CY7C68013A dev board here - I'm thinking I should be able to use it for capturing RGBI output from the CGA, but I haven't figured out how to do anything with it yet. I also have an AD9708/AD9280 dev board that I'm hoping I'll be able to hook up to the Cypress to do continuous real time analog signal capture and signal generation on the cheap (off the shelf solutions for this seem to be really expensive for 20+MHz). I have been using a composite video capture card and a VGA card respectively for this task but they both miss data during horizontal and vertical sync.
 
The ISA bus sniffer is now generating (at least somewhat) comprehensible output: here is what it looks like now.

I've removed some of the lines that didn't provide useful information, fixed the ones that were wrong and added some more information on the right, showing:
  • CPU bus phase (T1, T2, T3, Twait, T4 or idle)
  • DMA controller bus phase (S0, S1, S2, S3, Swait, S4 or idle)
  • Bus transfers: direction, address, data, segment, type (normal, instruction fetch, DMA, interrupt acknowledge, port IO).
  • Queue operation (E = empty, I = initial byte, S = subsequent byte).
  • Instruction bytes and disassembly.

It was tricky to find the right place to put the instruction disassembly - if we put it at the "I" line for the instruction then we might not have all the bytes, but putting it right before the "I" line for the following instruction would mean that the program needs to "look into the future" (to see if there's an "I" on the following line which we haven't processed yet) which complicates things. In the end I put it on the last queue operation line (final "S" or "I" for one-byte instructions) which is a time when we know for sure that the CPU is executing that instruction. My disassembler knows how long instructions are so it's wasn't hard to find the last byte.

The disassembly isn't "perfect" because we don't have access to all the information. In particular, we only know the physical address of an instruction and not its segment:eek:ffset pair, so IP-relative instructions (8 and 16 bit jumps, calls and loops) are shown as (e.g.) "LOOP IP-3B" instead of the offset you'd normally see in a disassembly.

I've switched to sampling at 4.77MHz instead of 14.318MHz. The extra samples were useful for debugging the system but ultimately not particularly enlightening (they either duplicated other samples or displayed inconsistently due to sampling around the time that lines are changing). This does mean I don't see the ALE signal but that doesn't seem to be very useful (a better way to detect T1 seems to be by watching for the S0-2 status lines going out of the "passive" state).

I've already learned some interesting things:
  • DRAM refresh DMA accesses that I had always assumed took 4 cycles actually seem to take 5 or 6 depending on where in the CPU's bus cycle the access begins.
  • Port writes seem to be incurring a cycle of wait state which I did not expect.
  • It looks like sometimes the EU is able to suppress a prefetch cycle from starting even when it seems like there should be space in the queue (though perhaps this just means that the bytes aren't removed from the prefetch queue until sometime after the CPU sends its "queue status" information).
  • I have timed this routine as taking exactly 288 cycles per iteration, but none of these instances seem to be exactly that length. I think it just takes a while for it to settle down so that the DRAM refreshes fall into the optimal timeslots. I may experiment with doing some longer runs in order to see this happening.
  • In between accesses (i.e. in state T3 generally) the data bus usually floats to FF, but sometimes to FD or DD instead.
  • Not all the lines necessary to determine the DMA controller state are available on the FPU socket or ISA bus. I've fudged it a little so that the DMAs look sensible but they might not show up right for memory-to-memory copies. I may tap some lines from the motherboard to fix this, if it becomes a problem.

You can try this out! Grab these files:
https://github.com/reenigne/reenigne/blob/master/8088/defaults_bin.asm
https://github.com/reenigne/reenigne/blob/master/8088/defaults_common.asm
https://github.com/reenigne/reenigne/blob/master/8088/trace/trace.asm
https://github.com/reenigne/reenigne/blob/master/8088/trace/build.bat
http://yasm.tortall.net/Download.html (NASM should work too with minor modifications).

Replace the code under "testRoutine:" with the code that you'd like to get a trace of (it needs to run exactly the same each time it's run, so be sure to initialize all registers and memory you use - I made that mistake and it has me scratching my head for a bit). Build trace.bin and upload the code to http://www.reenigne.org/xtserver - within seconds you should get a link to a trace like the one I've linked to above.

Code accessing the CGA probably won't work right yet - I think it should just be a matter of using my "lockstep" macro instead of waiting for IRQ0, then resetting the refresh DMA address. The result may not give consistent results from run to run but it should at least give consistent results for different iterations of the same run, which is all that's needed to get a good trace. I may have a play about with this tomorrow.
 
The secret sauce is that I'm not sampling all the bits in any given cycle - I only read 6 or 7 bits at a time, at a rate of 4.77MHz (I have unrolled loops on the microcontroller but they take 3 cycles to read the value of a port and store it to RAM). The microcontroller is using OSC for its clock (14.318MHz). So I'm actually running the same 8088 code 48 times (8 mux values times 2 ports times 3 cycle phases). Each iteration I do a HLT to synchronize with the PIT and reset the PIT counters and DRAM refresh DMA address so that everything is identical on each run. Currently it'll break if I access CGA memory since the wait states will be different on different runs, but I'll fix that at some point.
I guess my FPGA idea didn't sit all that well with you :p? Ah well, at least I don't have to build this anymore.

How sure are you that the prefetch queue will be in the same exact state each time?

I'm guessing you could model your XT server program as three different basic blocks:
A- Whatever was executing before custom code was loaded
B- Reset the DMAC, PIT, etc
C- Your custom code, jumps back to B.

Given that C jumps back to B, can you prove that when we get back to the beginning of C, that the prefetch queue state (number of bytes filled and byte contents) and the execution unit state (current opcode/operand of an instruction being parsed, and the current cycle number within the current instruction being executed*) are the same every loop?

If the beginning of C has the same internal processor state every single execution, then yes your scheme works :); same input for the same internal state will produce the same output!

I would personally like to know when the EU decides to start grabbing bytes from the queue. Does it wait for a full instruction and grab one byte each cycle? Does the EU take a byte immediately on the first cycle of the BIU starting to fetch the next byte? Does the EU ever delay grabbing a byte, until, say, the second or third or fourth cycle of the BIU fetching the next byte? I imagine this information can be gleamed from the bus activity alone if enough instruction types are tested.

* An example: MUL takes 40+ cycles. Therefore the state inside the EU might reasonably be; opcode/operand parsing done, cycle 25 out of 40+ of actually doing the MUL.
 
How sure are you that the prefetch queue will be in the same exact state each time?

By using a long-running instruction with no memory access before starting the test, such as a mul or div, you can make sure that the CPU has had ample time to fill the prefetch-buffer.
Alternatively, a jump should always empty the entire buffer.
So you can force it to be in the full or empty state.
 
I guess my FPGA idea didn't sit all that well with you :p?

Not so much that as the fact that I had already built the microcontroller version at that point. I still mean to get into FPGAs at some point, perhaps for eventually doing something similar with a 286.

How sure are you that the prefetch queue will be in the same exact state each time?

I halt the CPU (which stops prefetching) and wait for the timer interrupt (jumps, calls, returns and interrupts all clear out the queue). The trickier part is getting the same refresh DMAs to happen at the same times, which involves completely resetting timer 1 and DMA channel 0. In order to avoid DRAM decay during this process, I need to increase the timer 1 frequency temporarily to refresh all DRAM rows before halting and after restarting.

I'm currently not doing anything special to get the CPU into a known state with respect to the CGA clock - I think I can do this just by ensuring that timer channel 0 runs for a multiple of 4 timer cycles each iteration (there are 3 CGA clock cycles per 16 CPU cycles).

I'm guessing you could model your XT server program as three different basic blocks:
A- Whatever was executing before custom code was loaded
B- Reset the DMAC, PIT, etc
C- Your custom code, jumps back to B.

Yes, that's about right.

Given that C jumps back to B, can you prove that when we get back to the beginning of C, that the prefetch queue state (number of bytes filled and byte contents) and the execution unit state (current opcode/operand of an instruction being parsed, and the current cycle number within the current instruction being executed*) are the same every loop?

Yes, I think so (and it seems to be giving sensible results). There are a couple of ways of doing this - one is to empty the prefetch queue (with a control flow instruction) and the other is to fill it (with a long-running instruction like a MUL). I'm using the former method here as it kills two birds with one stone (the other being to get the CPU into a known state with respect to the PIT).

I would personally like to know when the EU decides to start grabbing bytes from the queue. Does it wait for a full instruction and grab one byte each cycle?

We can tell without sniffing that it can't wait for the full instruction to be in the prefetch queue, because the queue is only 4 bytes long and some instructions (even discounting prefixes) can be 6 bytes.

Looking at traces, it seems that the EU is capable of grabbing a byte from the queue each cycle, but does not always do so. For example, in the three-byte instruction "add ax,9999", the EU grabs the instruction byte on cycle 0 and the two immediate bytes on cycles 2 and 3. Looks like a mod/rm byte (if there is one) will be grabbed on cycle 1, though.

Does the EU take a byte immediately on the first cycle of the BIU starting to fetch the next byte?

Looking at a JMP instruction, we see a "queue is emptied" signal from the CPU. Let's call the cycle that happens on cycle 0. Immediately afterwards, the bus starts doing the prefetch for the destination address. That takes cycles 1 through 4 inclusive (bus states T1-T4). Then the bus starts prefetching the second code byte (i.e. destination+1), which takes cycles 5-8. The EU doesn't grab a byte from the prefetch queue until cycle 7 (the T3 state of the second prefetch). I'm not sure why it takes so long.

The BIU usually starts a new prefetch 1 cycle after a byte is grabbed from the queue (if the queue is full and a byte is grabbed on cycle 0 then cycle 1 will usually be a T1 state for the next prefetch). Exceptions I've seen so far: "POP rw", "POP segreg", "POPF" and "RET" (start bus cycle for the stack fetch on cycle 3), "IN AL, DX" and "IN AX, DX" (start bus cycle for port IO on cycle 3). In all these cases, the CPU knows from the opcode that it's going to need a fetch, has the address in a register, and knows which register from the opcode.
 
Last edited:
It's been a while since I looked at this thread. Has there been any updates in the past year?

The EU doesn't grab a byte from the prefetch queue until cycle 7 (the T3 state of the second prefetch). I'm not sure why it takes so long.
I confess, it's been a while since I've looked into any x86 vintage :(. How do you know and/or infer which cycles that the EU is grabbing a byte?
 
It's been a while since I looked at this thread. Has there been any updates in the past year?

Not really. The next step is to use data from this sniffer to figure out how to make an 8088 simulator cycle-exact, but other projects have taken precedence. I've used the device to answer some questions from a couple of people over at Vogons who are trying to do exactly that, but I think they've got a way to go yet. At some point I expect the cycle-exact simulator will draw me in again and I'll do an exhaustive set of experiments (unless someone else beats me to it by running their own exhaustive set of experiments on the XT Server).

I confess, it's been a while since I've looked into any x86 vintage :(. How do you know and/or infer which cycles that the EU is grabbing a byte?

From the QS0 and QS1 pins of the CPU, which output this information for the benefit of the FPU.
 
Thanks for the update! Sometime soon (hopefully within the month), I'll run some of my own tests if your XT server is up.

Real life has been in the way, but since I initially read this post last year, I've been meaning to test HOLD/HLDA behavior under a variety of conditions. I find the inconsistent number of cycles that DMA for DRAM refresh very interesting (it's more "hidden state" in addition to prefetch queue), and wonder if that applies to other channels as well...
 
Back
Top