Image Map Image Map
Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 23

Thread: 8088/ISA bus sniffer device

  1. #11
    Join Date
    Mar 2011
    Location
    Atlanta, GA, USA
    Posts
    1,593

    Default

    NM, I didn't even read the 2nd page with my own responses. So you are locked to OSC and every 3 ticks you are doing: in port A (1), store ind w/ inc (2) - at an effective rate of 4.77 MHz and you have to run your code 16 straight times to get a full capture?

    Should work. It's very Macgyver'ish. And equally insane..

    Since this thread, I have built a real-time continuous bus sniffer for a AT&T 3B2. It works pretty well at a bus speed of 10 MHz. Could be adapted for ISA if there is every any interest. Consists of 3 quick switches, CPLD, and Cypress FX2LP USB peripheral. I imagine with enough development, you could not only snoop on the bus, but reconstruct graphics and sound on the host PC in real time.
    Last edited by eeguru; October 12th, 2015 at 12:49 AM.
    "Good engineers keep thick authoritative books on their shelf. Not for their own reference, but to throw at people who ask stupid questions; hoping a small fragment of knowledge will osmotically transfer with each cranial impact." - Me

  2. #12

    Default

    Quote Originally Posted by eeguru View Post
    How can you achieve a sample rate of 14.318 MHz? Assuming you unrolled the entire capture sequence and had mux values pre-loaded in registers, wouldn't you need at least 7 cycles to bring in two bytes and update the mux selection? (in port A (1), store ind ram w/ post inc (2), in port B (1), store ind ram w/ post inc (2), out port mux select (1))

    At 20 MHz, that's only 2.86 MHz sampling rate for 1/8th of the samples. To get all 104 bits is 357 KHz sampling rate. And every 13 bits will be sampled at least 45 degrees out of phase (with 8 vs 5 bits also phase shifted). I'm very confused. Please enlighten!
    The secret sauce is that I'm not sampling all the bits in any given cycle - I only read 6 or 7 bits at a time, at a rate of 4.77MHz (I have unrolled loops on the microcontroller but they take 3 cycles to read the value of a port and store it to RAM). The microcontroller is using OSC for its clock (14.318MHz). So I'm actually running the same 8088 code 48 times (8 mux values times 2 ports times 3 cycle phases). Each iteration I do a HLT to synchronize with the PIT and reset the PIT counters and DRAM refresh DMA address so that everything is identical on each run. Currently it'll break if I access CGA memory since the wait states will be different on different runs, but I'll fix that at some point.

    Admittedly this isn't very useful for watching what happens when an external peripheral (e.g. keyboard, disk, joystick, mouse, serial port, parallel port) is accessed but I'm mostly interested in figuring out the timings for the 8088 itself so I'm not too bothered about that.

  3. #13

    Default

    Quote Originally Posted by eeguru View Post
    Since this thread, I have built a real-time continuous bus sniffer for a AT&T 3B2. It works pretty well at a bus speed of 10 MHz. Could be adapted for ISA if there is every any interest. Consists of 3 quick switches, CPLD, and Cypress FX2LP USB peripheral. I imagine with enough development, you could not only snoop on the bus, but reconstruct graphics and sound on the host PC in real time.
    Very interesting! I do actually have a Cypress CY7C68013A dev board here - I'm thinking I should be able to use it for capturing RGBI output from the CGA, but I haven't figured out how to do anything with it yet. I also have an AD9708/AD9280 dev board that I'm hoping I'll be able to hook up to the Cypress to do continuous real time analog signal capture and signal generation on the cheap (off the shelf solutions for this seem to be really expensive for 20+MHz). I have been using a composite video capture card and a VGA card respectively for this task but they both miss data during horizontal and vertical sync.

  4. #14

    Default

    The ISA bus sniffer is now generating (at least somewhat) comprehensible output: here is what it looks like now.

    I've removed some of the lines that didn't provide useful information, fixed the ones that were wrong and added some more information on the right, showing:
    • CPU bus phase (T1, T2, T3, Twait, T4 or idle)
    • DMA controller bus phase (S0, S1, S2, S3, Swait, S4 or idle)
    • Bus transfers: direction, address, data, segment, type (normal, instruction fetch, DMA, interrupt acknowledge, port IO).
    • Queue operation (E = empty, I = initial byte, S = subsequent byte).
    • Instruction bytes and disassembly.


    It was tricky to find the right place to put the instruction disassembly - if we put it at the "I" line for the instruction then we might not have all the bytes, but putting it right before the "I" line for the following instruction would mean that the program needs to "look into the future" (to see if there's an "I" on the following line which we haven't processed yet) which complicates things. In the end I put it on the last queue operation line (final "S" or "I" for one-byte instructions) which is a time when we know for sure that the CPU is executing that instruction. My disassembler knows how long instructions are so it's wasn't hard to find the last byte.

    The disassembly isn't "perfect" because we don't have access to all the information. In particular, we only know the physical address of an instruction and not its segment:offset pair, so IP-relative instructions (8 and 16 bit jumps, calls and loops) are shown as (e.g.) "LOOP IP-3B" instead of the offset you'd normally see in a disassembly.

    I've switched to sampling at 4.77MHz instead of 14.318MHz. The extra samples were useful for debugging the system but ultimately not particularly enlightening (they either duplicated other samples or displayed inconsistently due to sampling around the time that lines are changing). This does mean I don't see the ALE signal but that doesn't seem to be very useful (a better way to detect T1 seems to be by watching for the S0-2 status lines going out of the "passive" state).

    I've already learned some interesting things:
    • DRAM refresh DMA accesses that I had always assumed took 4 cycles actually seem to take 5 or 6 depending on where in the CPU's bus cycle the access begins.
    • Port writes seem to be incurring a cycle of wait state which I did not expect.
    • It looks like sometimes the EU is able to suppress a prefetch cycle from starting even when it seems like there should be space in the queue (though perhaps this just means that the bytes aren't removed from the prefetch queue until sometime after the CPU sends its "queue status" information).
    • I have timed this routine as taking exactly 288 cycles per iteration, but none of these instances seem to be exactly that length. I think it just takes a while for it to settle down so that the DRAM refreshes fall into the optimal timeslots. I may experiment with doing some longer runs in order to see this happening.
    • In between accesses (i.e. in state T3 generally) the data bus usually floats to FF, but sometimes to FD or DD instead.
    • Not all the lines necessary to determine the DMA controller state are available on the FPU socket or ISA bus. I've fudged it a little so that the DMAs look sensible but they might not show up right for memory-to-memory copies. I may tap some lines from the motherboard to fix this, if it becomes a problem.


    You can try this out! Grab these files:
    https://github.com/reenigne/reenigne...faults_bin.asm
    https://github.com/reenigne/reenigne...lts_common.asm
    https://github.com/reenigne/reenigne...race/trace.asm
    https://github.com/reenigne/reenigne...race/build.bat
    http://yasm.tortall.net/Download.html (NASM should work too with minor modifications).

    Replace the code under "testRoutine:" with the code that you'd like to get a trace of (it needs to run exactly the same each time it's run, so be sure to initialize all registers and memory you use - I made that mistake and it has me scratching my head for a bit). Build trace.bin and upload the code to http://www.reenigne.org/xtserver - within seconds you should get a link to a trace like the one I've linked to above.

    Code accessing the CGA probably won't work right yet - I think it should just be a matter of using my "lockstep" macro instead of waiting for IRQ0, then resetting the refresh DMA address. The result may not give consistent results from run to run but it should at least give consistent results for different iterations of the same run, which is all that's needed to get a good trace. I may have a play about with this tomorrow.

  5. #15
    Join Date
    Dec 2011
    Location
    NJ
    Posts
    809
    Blog Entries
    13

    Default

    Quote Originally Posted by reenigne View Post
    The secret sauce is that I'm not sampling all the bits in any given cycle - I only read 6 or 7 bits at a time, at a rate of 4.77MHz (I have unrolled loops on the microcontroller but they take 3 cycles to read the value of a port and store it to RAM). The microcontroller is using OSC for its clock (14.318MHz). So I'm actually running the same 8088 code 48 times (8 mux values times 2 ports times 3 cycle phases). Each iteration I do a HLT to synchronize with the PIT and reset the PIT counters and DRAM refresh DMA address so that everything is identical on each run. Currently it'll break if I access CGA memory since the wait states will be different on different runs, but I'll fix that at some point.
    I guess my FPGA idea didn't sit all that well with you :P? Ah well, at least I don't have to build this anymore.

    How sure are you that the prefetch queue will be in the same exact state each time?

    I'm guessing you could model your XT server program as three different basic blocks:
    A- Whatever was executing before custom code was loaded
    B- Reset the DMAC, PIT, etc
    C- Your custom code, jumps back to B.

    Given that C jumps back to B, can you prove that when we get back to the beginning of C, that the prefetch queue state (number of bytes filled and byte contents) and the execution unit state (current opcode/operand of an instruction being parsed, and the current cycle number within the current instruction being executed*) are the same every loop?

    If the beginning of C has the same internal processor state every single execution, then yes your scheme works ; same input for the same internal state will produce the same output!

    I would personally like to know when the EU decides to start grabbing bytes from the queue. Does it wait for a full instruction and grab one byte each cycle? Does the EU take a byte immediately on the first cycle of the BIU starting to fetch the next byte? Does the EU ever delay grabbing a byte, until, say, the second or third or fourth cycle of the BIU fetching the next byte? I imagine this information can be gleamed from the bus activity alone if enough instruction types are tested.

    * An example: MUL takes 40+ cycles. Therefore the state inside the EU might reasonably be; opcode/operand parsing done, cycle 25 out of 40+ of actually doing the MUL.
    Looking for: Needham's Electronics PB-10 Microcontroller Adapter (looking for one since early 2012!).

  6. #16
    Join Date
    Dec 2014
    Location
    The Netherlands
    Posts
    2,024

    Default

    Quote Originally Posted by cr1901 View Post
    How sure are you that the prefetch queue will be in the same exact state each time?
    By using a long-running instruction with no memory access before starting the test, such as a mul or div, you can make sure that the CPU has had ample time to fill the prefetch-buffer.
    Alternatively, a jump should always empty the entire buffer.
    So you can force it to be in the full or empty state.

  7. #17

    Default

    Quote Originally Posted by cr1901 View Post
    I guess my FPGA idea didn't sit all that well with you :P?
    Not so much that as the fact that I had already built the microcontroller version at that point. I still mean to get into FPGAs at some point, perhaps for eventually doing something similar with a 286.

    Quote Originally Posted by cr1901 View Post
    How sure are you that the prefetch queue will be in the same exact state each time?
    I halt the CPU (which stops prefetching) and wait for the timer interrupt (jumps, calls, returns and interrupts all clear out the queue). The trickier part is getting the same refresh DMAs to happen at the same times, which involves completely resetting timer 1 and DMA channel 0. In order to avoid DRAM decay during this process, I need to increase the timer 1 frequency temporarily to refresh all DRAM rows before halting and after restarting.

    I'm currently not doing anything special to get the CPU into a known state with respect to the CGA clock - I think I can do this just by ensuring that timer channel 0 runs for a multiple of 4 timer cycles each iteration (there are 3 CGA clock cycles per 16 CPU cycles).

    Quote Originally Posted by cr1901 View Post
    I'm guessing you could model your XT server program as three different basic blocks:
    A- Whatever was executing before custom code was loaded
    B- Reset the DMAC, PIT, etc
    C- Your custom code, jumps back to B.
    Yes, that's about right.

    Quote Originally Posted by cr1901 View Post
    Given that C jumps back to B, can you prove that when we get back to the beginning of C, that the prefetch queue state (number of bytes filled and byte contents) and the execution unit state (current opcode/operand of an instruction being parsed, and the current cycle number within the current instruction being executed*) are the same every loop?
    Yes, I think so (and it seems to be giving sensible results). There are a couple of ways of doing this - one is to empty the prefetch queue (with a control flow instruction) and the other is to fill it (with a long-running instruction like a MUL). I'm using the former method here as it kills two birds with one stone (the other being to get the CPU into a known state with respect to the PIT).

    Quote Originally Posted by cr1901 View Post
    I would personally like to know when the EU decides to start grabbing bytes from the queue. Does it wait for a full instruction and grab one byte each cycle?
    We can tell without sniffing that it can't wait for the full instruction to be in the prefetch queue, because the queue is only 4 bytes long and some instructions (even discounting prefixes) can be 6 bytes.

    Looking at traces, it seems that the EU is capable of grabbing a byte from the queue each cycle, but does not always do so. For example, in the three-byte instruction "add ax,9999", the EU grabs the instruction byte on cycle 0 and the two immediate bytes on cycles 2 and 3. Looks like a mod/rm byte (if there is one) will be grabbed on cycle 1, though.

    Quote Originally Posted by cr1901 View Post
    Does the EU take a byte immediately on the first cycle of the BIU starting to fetch the next byte?
    Looking at a JMP instruction, we see a "queue is emptied" signal from the CPU. Let's call the cycle that happens on cycle 0. Immediately afterwards, the bus starts doing the prefetch for the destination address. That takes cycles 1 through 4 inclusive (bus states T1-T4). Then the bus starts prefetching the second code byte (i.e. destination+1), which takes cycles 5-8. The EU doesn't grab a byte from the prefetch queue until cycle 7 (the T3 state of the second prefetch). I'm not sure why it takes so long.

    The BIU usually starts a new prefetch 1 cycle after a byte is grabbed from the queue (if the queue is full and a byte is grabbed on cycle 0 then cycle 1 will usually be a T1 state for the next prefetch). Exceptions I've seen so far: "POP rw", "POP segreg", "POPF" and "RET" (start bus cycle for the stack fetch on cycle 3), "IN AL, DX" and "IN AX, DX" (start bus cycle for port IO on cycle 3). In all these cases, the CPU knows from the opcode that it's going to need a fetch, has the address in a register, and knows which register from the opcode.
    Last edited by reenigne; March 9th, 2016 at 12:51 AM.

  8. #18
    Join Date
    Dec 2011
    Location
    NJ
    Posts
    809
    Blog Entries
    13

    Default

    It's been a while since I looked at this thread. Has there been any updates in the past year?

    Quote Originally Posted by reenigne View Post
    The EU doesn't grab a byte from the prefetch queue until cycle 7 (the T3 state of the second prefetch). I'm not sure why it takes so long.
    I confess, it's been a while since I've looked into any x86 vintage . How do you know and/or infer which cycles that the EU is grabbing a byte?
    Looking for: Needham's Electronics PB-10 Microcontroller Adapter (looking for one since early 2012!).

  9. #19

    Default

    Quote Originally Posted by cr1901 View Post
    It's been a while since I looked at this thread. Has there been any updates in the past year?
    Not really. The next step is to use data from this sniffer to figure out how to make an 8088 simulator cycle-exact, but other projects have taken precedence. I've used the device to answer some questions from a couple of people over at Vogons who are trying to do exactly that, but I think they've got a way to go yet. At some point I expect the cycle-exact simulator will draw me in again and I'll do an exhaustive set of experiments (unless someone else beats me to it by running their own exhaustive set of experiments on the XT Server).

    Quote Originally Posted by cr1901 View Post
    I confess, it's been a while since I've looked into any x86 vintage . How do you know and/or infer which cycles that the EU is grabbing a byte?
    From the QS0 and QS1 pins of the CPU, which output this information for the benefit of the FPU.

  10. #20
    Join Date
    Dec 2011
    Location
    NJ
    Posts
    809
    Blog Entries
    13

    Default

    Thanks for the update! Sometime soon (hopefully within the month), I'll run some of my own tests if your XT server is up.

    Real life has been in the way, but since I initially read this post last year, I've been meaning to test HOLD/HLDA behavior under a variety of conditions. I find the inconsistent number of cycles that DMA for DRAM refresh very interesting (it's more "hidden state" in addition to prefetch queue), and wonder if that applies to other channels as well...
    Looking for: Needham's Electronics PB-10 Microcontroller Adapter (looking for one since early 2012!).

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •