PDA

View Full Version : ISA maximum sustained transfer rate.



mR_Slug
February 26th, 2017, 12:11 PM
I have seen many sources state that the max transfer rate on the ISA bus as something along the lines of "typically 1-2MB/s" But cant really find any explanation as to why.

I found this: (InfoWorld Jan 25, 1993 - Steve Gibson)

https://books.google.co.uk/books?id=2zsEAAAAMBAJ&pg=PA29&lpg=PA29&dq=InfoWorld+Jan+25,+1993+-+Steve+Gibson:&source=bl&ots=wivjSODOuD&sig=nIlfqNF-RQAhmPLNrm9YWJzWyrw&hl=en&sa=X&redir_esc=y#v=onepage&q=InfoWorld%20Jan%2025%2C%201993%20-%20Steve%20Gibson%3A&f=false

That states that the theoretical max transfer rate as being 5.3MB/s for 16-bit ISA @ 8MHz. Summarized:

2 Bytes are transferred over the bus at a time, but it takes about 3 cycles to send them. At 8Mhz:
(2 Bytes x 8MHz)/3 = 5.333Mbps.

But this is a far cry from 1-2MB. From what I understand, say moving data from a SCSI card to a NIC, this would be halved again, because you have SCSI>>CPU>>NIC. I think bus mastering can in theory speed thing up, but i cant find a good source on ISA bus-mastering.

Benchmarks are usually slightly slower than a theoretical maximum sustained transfer rate, but I think I'm missing part of the picture. What are the other variable(s)?

Can anyone point me in the right direction?

Chuck(G)
February 26th, 2017, 12:59 PM
The best I've ever gotten from an 8 bit transfer was just shy of 1MB/sec. Double that for 16 bits.

What I've seen for ISA bus timings looks like this:

https://d3s.mff.cuni.cz/~ceres/sch/osy/text/isa-write-cycle.png

pearce_jj
February 26th, 2017, 01:35 PM
This is the book to answer your questions:

https://www.mindshare.com/Books/Titles/ISA_System_Architecture_(3rd_Edition)

DMA will do the job for 8-bit boards but the controllers never scaled up with bus speed, so quicker to go via CPU with 16-bit 8MHz systems.

mbbrutman
February 26th, 2017, 01:38 PM
From the IBM PC XT Technical Reference:


"Normal memory read and write cycles take four 210ns clocks for a cycle time of 840ns/byte.
Microprocessor-generated I/O read and write cycles require five clocks for a cycle time of 1.05us/byte.
DMA transfers require five clocks for a cycle time of 1.05us/byte.


So the bandwidth for memory reads and writes is 1.13 megabytes per second, which is based on an 8 bit bus at 4.77Mhz.

If you used I/O ports or DMA instead there would be a 20% penalty because of the fifth cycle. And of course this assumes that your devices can keep up with these speeds and not insert extra cycles.


From the IBM PC AT Technical Reference:

On an AT only 3 clock cycles are required for a bus transfer. At 6Mhz that means 2 bytes are transferred every 500ns. So that bandwidth is 3.8 megabytes per second. At 8Mhz that works out to around 5 megabytes per second.

There are complications; 8 bit operations to 8 bit devices take 6 clock cycles, not 3. 16 bit operations to 8 bit devices take 12 clock cycles. And the DMA controller operates at 3Mhz so anything it does takes 5 clock cycles.


And of course all of this depends on the code you are running. The instructions you are executing also take up bus cycles. So this favors using the "REP" prefix if available for an instruction because that allows you to keep the CPU from accessing the bus to read more instructions. And DMA refresh gets in the way too.

Chuck(G)
February 26th, 2017, 02:13 PM
Don't forget that if you're storing words to odd addresses, that eats cycles also.

I've read over and over that 5.33 MB/sec is the maximum, but I've never run into a peripheral that actually does this. As I mentioned, I can get a bit above 2MB/sec on an AT bus asserting 0WS and using REP INSW instructions, but could never get even close to 5.33 MB/sec. I've never tried bus mastering transfers, but I suppose that's another option.

Does anyone have any real-world examples?

eeguru
February 26th, 2017, 05:24 PM
It would be 3 cycles for an independent memory access - 2 for the access and 1 rest. That's in one direction. For example, you could never hit 5.33 MB/s doing rep insw on a true 286 with a direct coupled ISA bus. The result of the inport would still need to be stored back to RAM. The only place where a 5.333 rate would be possible outside of a VLB/cache type setting is during DMA where the I/O and MEM operations could share cycles.

Chuck(G)
February 26th, 2017, 05:33 PM
I assume that by "DMA" you don't mean the 8237 type--that's limited by the 8237 and is generally slower in 16-bit mode than programmed I/O. Are you talking about bus mastered DMA?

nc_mike
February 26th, 2017, 06:32 PM
Would anyone know the maximum sustained transfer rate with an Inboard/386 installed in a PC/XT? I know that the Inboard takes over boot from the base system BIOS shortly after boot. I've got an Inboard in my PC/XT with a 4MB RAM daughter card (5MB total) with most of it running at Extended memory (I've also upgraded the CPU with a 133-pin compatible 486 running at 40MHz (33MHz effective limited a bit by the standard oscillator).

Mike

mR_Slug
February 26th, 2017, 07:32 PM
Thank you all for your responses. I am reading that ISA book at the moment, so i still have a lot to understand. This is going to be a long post. This is as I understand it (so far):

The XT:
4.77MHz is ~210ns, 210ns = ~4.76Mhz

It takes 4 cycles to read or write 1 byte to/from RAM. (840ns)
It takes 5 cycles to read or write 1 byte to/from I/O. (1050ns)
It takes 5 cycles to read or write 1 byte Via DMA. (1050ns)

(1 Bytes x 4.77MHz)/4 = 1.1925 MB/s ~ 1.19MB/s ~ 1.1905 ~ (1 Bytes x 1000/210ns)/4
(1 Bytes x 4.77MHz)/5 = 0.954 MB/s ~ 0.95MB/s ~ 0.952 ~ (1 Bytes x 1000/210ns)/5

1.13 megabytes, i assume you mean 1.19? unless i missed something.

XT PIO transfer:
So if we are going to move data from a card (above port 254) to another card, using PIO (is that the correct term?)


Loop: mov dx, 378h ;Point at LPT1: data port (2 cycles*)
in al, dx ;Read byte from printer port. (5 cycles)
mov dx, 278h ;Point at LPT2: data port (2 cycles*)
out dx, al ;Write byte in AL to ptr port. (5 cycles)
jmp Loop (15 cycles, what!*)

>* I am assuming "dx, 378h" is equivalent to reg, reg. 2 cycles according to:
http://zsmith.co/intel_m.html#mov
15 cycles!
http://zsmith.co/intel_j.html#jmp
Code example based on:
https://courses.engr.illinois.edu/ece390/books/artofasm/CH06/CH06-4.html#HEADING4-144

So using this method we have 2+5+2+5+15 = 14 + 15 (seriously!). Lets just ignore the loop instruction. If we just copy-paste the first four lines enough times, we can effectively optimize it out of the equation So:
14 cycles, per byte moved.

So (1 byte x 1000/210ns)/14 = ~0.340MB/s Max transfer rate.

XT DMA:
Ok i am having problems with this one, lots of diagrams, no code, sounds complicated.

AT
8MHZ is 125ns, lets just stick with 8MHz for the time being.

It takes 3 cycles to read or write 2 bytes to/from I/O. (375ns)

(2 Bytes x 8MHz)/3 = 5.33 MB/s = (2 Bytes x 1000/125ns)/3

AT PIO:
So if we are going to move data from a card (above port 254) to another card, using PIO:


mov dx, 378h ;Point at LPT1: data port (2 cycles*)
in ax, dx ;Read word from printer port. (3 cycles)
mov dx, 278h ;Point at LPT2: data port (2 cycles*)
out dx, ax ;Write word in AL to ptr port. (3 cycles)

>* Again I am assuming dx, 378h is equivalent to reg, reg. 2 cycles according to:
http://zsmith.co/intel_m.html#mov

So using this method we have 2+3+2+3 = 10 cycles, per word moved.
So ((2bytes x 1000)/125ns)/10 = 1.600MB/s Max transfer rate.

a 286/386 at 16MHz (bus is just half speed 8MHz)
Ok now we have basically a double speed CPU, most instructions would be twice as fast, but if we access an 8MHz ISA bus, any instruction should, take the same time in ns, right? So:



ISA: CPU (2x ISA):
mov dx, 378h ;Point at LPT1: data port (1 cycles) (2 cycles)
in ax, dx ;Read word from printer port. (3 cycles) (6 cycles)
mov dx, 278h ;Point at LPT2: data port (1 cycles) (2 cycles)
out dx, ax ;Write word in AL to ptr port. (3 cycles) (6 cycles)

I hope that makes sense. The CPU does the mov instruction (2 CPU cycles), within the time one ISA bus cycle has elapsed.

So using this method we have 1+3+1+3 = 8 cycles, per word moved.
So ((2bytes x 1000)/125ns)/8 = 2.000MB/s Max transfer rate.

32MHz CPU:

we have 0.5+3+0.5+3 = 7 cycles, per word moved.
So ((2bytes x 1000)/125ns)/7 = 2.286MB/s Max transfer rate.

64MHz CPU
we have 0.25+3+0.25+3 = 6.5 cycles, per word moved.
So ((2bytes x 1000)/125ns)/6.5 = 2.462MB/s Max transfer rate.


I think I'm on the right track, the figures look right, but that could just be coincidence.

Mike, does the Inboard operate at 16MHz, and the bus still at 4.77MHz, I think this is correct:
4.77Mhz / 16MHz = ~0.3, so the cpu should be able to perform the mov instructions in (previously 2 bus cycles, in 0.6 cycles, so:

0.6+5+0.6+5 = 11.2
So (1 byte x 1000/210ns)/11.2 = ~0.425MB/s Max transfer rate.

Of course, if anyone finds any errors in my calculations, please please let me know.

njroadfan
February 26th, 2017, 08:02 PM
Bus mastering on ISA cards wasn't really supported. It was a hackjob at best (multiple bus masters should be avoided), but Adaptec and other SCSI adapter makers figured out how to do it. Using DMA, 2.5MB/sec was observed over a ISA SCSI card (vs. 1MB/sec with PIO IDE). See: http://www.os2museum.com/wp/booting-is-hard/

EISA (and MCA for that matter) fixed all these problems, as the bus explicitly supports multiple bus masters out of the box. Some ISA SCSI cards apparently supported enhanced bus mastering DMA functions in EISA systems as well. Hardly anything outside of the floppy controller and soundcards used the 8237 for DMA, its just too damned slow.

Steve Gibson's article was grossly misguided though. The ISA bottleneck was quite apparent with video cards by 1993 and plenty of VLB cards readily crushed their ISA equivalent in benchmarks. 5400rpm hard drives became much more common in 1993 as well.

Chuck(G)
February 26th, 2017, 09:21 PM
mR_Slug,

It's very rare to go I/O port to I/O port. I/O is usually I/O space to memory or vice versa. This is where the 80186/286/V20 I/O variants come handy. You can do a REP INSB/INSW and do a whole bunch of accesses with one instruction.

There are other work-arounds Consider the XTIDE and the "Chuck mod". Since the ATA protocol uses 16-bit transfer, the XTIDE latches one byte and saves it at at different I/O address. If we re-arrange the I/O port mapping, we can use the 8088 BIU to do an operation to 2 I/O ports with one instruction. If you're using a V20, it's even better, because you can issue a single REP INSW to do the whole operation, which should be the upper limit on 8-bit I/O.

All of this implies that you have some sort of buffered I/O, so you don't have to check for data available. If it's a loop-on-data-not-ready, input when ready, all bets are off.

I have a bus-matering NIC from Ansel. It's 10BaseT, so I don't think that it matters much--it uses the AMD LANCE chip.

pearce_jj
February 27th, 2017, 04:21 AM
In terms of upper limit - we can go twice as fast with DMA than REP INSW, at least peak rate, since data is driven directly from the IO device to memory (not copied through the CPU). The overhead of configuring the controller does reduce the real gains though, even so we are talking in terms of disk performance with 4.77MHz V20, 400KB/s Max with PIO and over 550KB/s with DMA.

Chuck(G)
February 27th, 2017, 08:58 AM
James, yes, 8-bit DMA is faster, but we've got two (at least!) separate discussions going here.

16-bit DMA is slower than 16-bit programmed I/O. There, you're dealing with the issue of what amounts to a 16-bit architecture shoehorned into two 8-bit DMA chips that have their roots in the 8085 era.

Using the OP's programmed I/O example, you can't do programmed 8-bit transfers faster than REP INSW. DMA is a different story. Of course, calling REP INSW over an 8 bit bus is a little bit of a chimera--to the casual observer, it appears to be 16 bit I/O, but is done over an 8 bit bus.

No matter how you cut it, the 8 bit DMA transfer limit does create problems and can be more complex than it would first appear. For example, I have a system here where I run 3 floppy controllers simultaneously (multithreaded) each controller has its own port, IRQ and DMA channel. You can write three 2D floppies at the same time, but not three HD ones, no matter how you program the 8237. At first blush, this would not seem to be a problem, as the HD data rate is 500Kbit/sec., which works out to be 62.5KB/sec., so three controllers would be a very moderate 187.5KB/sec total. But it won't work--you'll get "lost data" errors every time. I'm not sure why this happens, but it isn't a function of the CPU.

reenigne
February 27th, 2017, 12:27 PM
<off-topic>


Bus mastering on ISA cards wasn't really supported. It was a hackjob at best (multiple bus masters should be avoided), but Adaptec and other SCSI adapter makers figured out how to do it.

Interesting - just from looking at what signals are exposed on the ISA bus, I didn't think it was possible at all! Do you have a link to any details about how it was done?

</off-topic>

What follows applies to the 8-bit (PC/XT) variant of the ISA bus rather than the 16-bit (AT) variant.

I haven't played with it much, but my understanding is that the 8237 DMA controller in block transfer mode normally takes 4 cycles per byte rather than 5. This chip also has a "compressed timing" command which reduces most transfers to 2 cycles per byte (3 when there's a change to address bits 8-15). On a 4.77MHz machine, this would put the theoretical transfer limit at 2.38MB/s. There may be practical considerations which reduce that a bit, though.

gslick
February 27th, 2017, 12:55 PM
<off-topic>

Interesting - just from looking at what signals are exposed on the ISA bus, I didn't think it was possible at all! Do you have a link to any details about how it was done?

</off-topic>



ISA bus master transfers as used for example by an Adaptec 1542 SCSI controller are implemented by programming the DMA controller channel in Cascade Mode. Then add-in controller can drive the address lines on the bus instead of the DMA controller.

There should be lots of information on the net with details of how Cascade Mode works with the DMA controller.

Here's one link I found with a very quick search:
https://docs.freebsd.org/doc/2.1.7-RELEASE/usr/share/doc/handbook/handbook248.html

Chuck(G)
February 27th, 2017, 02:49 PM
What follows applies to the 8-bit (PC/XT) variant of the ISA bus rather than the 16-bit (AT) variant.

I haven't played with it much, but my understanding is that the 8237 DMA controller in block transfer mode normally takes 4 cycles per byte rather than 5. This chip also has a "compressed timing" command which reduces most transfers to 2 cycles per byte (3 when there's a change to address bits 8-15). On a 4.77MHz machine, this would put the theoretical transfer limit at 2.38MB/s. There may be practical considerations which reduce that a bit, though.

I have never experienced, nor have been able to design a peripheral using 8237 8-bit DMA that does better than about 1MB/sec. Just doesn't exist. Probably because, as James noted, a DMA transfer involves moving between I/O space and memory.

AlexC
February 27th, 2017, 04:06 PM
I don't know if this helps, but I have a 10MHz NEC V20 XT clone with an Orchid EMS card plugged into an 8-bit ISA slot and I've been messing around with memory timings. Best I can get, as reported by QEMM's Manifest, is 1,075KB/sec. The EMS board is set to zero wait state and has 70ns SIMMs installed. There's probably some overhead with the driver, etc.

I don't know if Manifest measures anything useful, but if so this tends to support a real-world limit over 8-bit ISA of around 1MB. For reference, main system board RAM on this machine is clocked at around 1.4MB/sec in Manifest.

Chuck(G)
February 27th, 2017, 05:06 PM
<sideways topic>
Instead of using the old, slow 8237/8257 DMAC intended for the 8085 family, I wish Intel would have come out with a real DMA controller such as that integrated into the 80186. No "64K" boundary issues; go from I/O port to I/O port, or memory-to-memory with no problems. Full 20 bit address capability.

The 8089 wasn't it. It was what amounts to a separate processor with its own instruction set and very expensive at that. Applicability past the 8086 is doubtful.

But then, Intel was very slow in getting a range of 16-bit capable peripheral chips for the x86 platform.

As a side note, the 80186 belongs to a different generation and the DMA speed there (for a 10 MHz clock) is quoted at 1.25M (bytes for 80188 or words for 80186) per second.

</sideways topic>

mR_Slug
February 27th, 2017, 06:18 PM
With regard to the input and output cards, yes I am looking at it from the perspective of buffered I/O. Specifically, however fast, you read data from the input card it will refill it's buffer with another word. The output card can be written to at any speed also. This eliminates issues with, as Chuck(G) mentioned, with loop-on-data-not-ready, input when ready etc.

I will have to check out the "Chuck mod", and the REP INSW instructions. Not sure i understand what "compressed timing" command is.

reenigne, the book linked to by pearce_jj has a section on bus mastering, also available here:
https://archive.org/details/ISA_System_Architecture

DMA (as performed by the 8237, NOT bus-mastering DMA)
AT:
@8Mhz, the DMA controller operates at 4MHz. the clock-cycle time is 250ns (1000/4Mhz). "All DMA data-transfer bus cycles are 5 clock cycles...or 1.25 microseconds" -AT tech ref. i.e. 1.25us = 1250ns, 1250ns/5=250ns

One ISA bus cycle at 8MHz is: 125ns. So in terms of ISA-bus-cycles@8MHz, ONE DMA cycle takes the same time as 2 ISA bus cycles. 5 DMA cycles is 1250ns, which is 10 ISA bus cycles Right?

I cant find any assembly, for programming the 8237, but AFAIK, counting the instructions, is irrelevant anyway, as it is all setup. The 8237 does the transfer, we know this takes 5 DMA cycles, or 10 ISA bus cycles, per word moved.

From the ISA System architecture book, there are 4 modes; Single Transfer Mode, Block Transfer Mode, Demand Transfer Mode and Cascade Mode. Block Transfer Mode, if I understand correctly, is the fastest (theoretically). Lets say the DMA controller is setup to initiate a block transfer, and it never stops. This will block memory refresh, so this wont actually work on a 286. However I'm trying to keep this as simple as possible, it's sufficient to find the upper limit:

So ((2bytes x 1000)/125ns)/10 = 1.600MB/s Max transfer rate with DMA on an 8MHz AT, with no RAM refresh.

XT:
It takes 5 cycles to read or write 1 byte Via DMA. (1050ns)

Using the same setup as for the AT, (1 byte x 1000/210ns)/5 = ~0.952MB/s Max transfer rate.

<side note>
AlexC gets 1,075KB/sec on 10MHz XT system (10MHz bus?)
(1 byte x 10MHz)/5 = ~2.00MB/s Max transfer rate. So it looks like my DMA calculations are either way off, the bus is slower, or some other factor?
</side note>

PIO mode I/O port to memory:
XT:
I cant understand the memory timing of "9+EA" for the 86/88, can anyone explain it?

AT:
(Note, I am not well versed in assembly)


mov dx, 378h ;Point at LPT1: data port (2 cycles)
in ax, dx ;Read word from printer port. (3 cycles)
mov 1000h, dx, ; (3 cycles*)
*mem,reg is 3, reg,mem is 5 (so memory to register is slower, didn't know that)
http://zsmith.co/intel_m.html#mov

I had originally added instructions to increase the memory address, however lets just say that 1000h is a memory-mapped peripheral/card. That is, you write a word and the card transmits it immediately. You can then write to the same address again.

The first instruction is setup, so really the last two are all that's needed, giving 6 cycles, per word moved.

So ((2bytes x 1000)/125ns)/6 = 2.667MB/s Max transfer rate.

If the CPU speed is increased, as the memory address 1000h is on the ISA bus, as far as I can tell this wont increase the transfer speed. If we are talking ISA to real memory, we have to include an increment (say 4 cycles) to the memory address used in the MOV instruction. Slower than 2.667MB/s on an AT. But if the CPU/RAM were 10 times faster, then those additional 4 cycles and the 3 for the actual memory access could occur in 1/10 of the time. e.g.:

3 cycles (the IN instruction) + (3+4)/10 = 3 + 0.7 = 3.7 AT bus cycles.

So, ((2bytes x 1000)/125ns)/3.7 = 4.324MB/s Max transfer rate.

I think i'm starting to understand this now, cue post explaining I haven't:-)
sorry for long post.

AlexC
February 27th, 2017, 06:29 PM
<side note>
AlexC gets 1,075KB/sec on 10MHz XT system (10MHz bus?)
(1 byte x 10MHz)/5 = ~2.00MB/s Max transfer rate. So it looks like my DMA calculations are either way off, the bus is slower, or some other factor?
</side note>

I don't know if the bus speed is 10MHz, only that the CPU runs at that speed (or perhaps 9.54 since it's a turbo XT, so 2x 4.77?).

There could be several other factors involved. As noted, I don't know how accurate Manifest is, though I'd be slow to criticize Quarterdeck's coding since they did some very clever stuff with memory. But the EMS card itself may well have limitations.

Since the RAM-to-CPU speed is only measured at 1.4MB/sec, I guess that defines an upper limit for performance on this particular machine.

<yet another side note>
Incidentally, the reasoning in this thread is why I came to the conclusion some time ago that it's not worth using a software disk cache on an XT machine. The extra overhead of transferring the data over the 8-bit bus to an EMS card negates any performance benefit. In all my tests, using different caches including Norton's NCACHE2, I've never seen an overall performance increase greater than about 1% from using a cache, and usually there's actually a decrease. Buffers in main system RAM help, but not a cache using memory on an add-in card.
</yet another side note>

eeguru
February 27th, 2017, 07:22 PM
I'm not sure what the ultimate goal is - beyond just a simple understanding. But if it is to improve any performance for upcoming hardware designs, I can say on a 4.77 PCjr with memory mapped I/O and rep movsw (cx=0100h) each sector transfer, I get about 400 KB/s with BIOS routines optimized a bit beyond Tomi's original implementation. That is about 50% of the theoretical (counting 3 cycles from IDE and 3 cycles back to memory) which includes DOS, BIOS, and unrelated interrupt overhead per sector (eg, systick, dram refresh, etc). I'd say that's pretty good. And as James points out, he gets a bit better than that using DMA (for 8-bit reasons already discussed).

njroadfan
February 27th, 2017, 07:24 PM
The ISA bus runs at the same speed as the CPU in Turbo XTs and ATs. Faster machines decoupled it and standardized on 8.33Mhz (thank EISA for codifying that).

reenigne
February 28th, 2017, 01:08 AM
Probably because, as James noted, a DMA transfer involves moving between I/O space and memory.

I think that's not quite right (or at least confusingly worded). While the 8237 asserts IOW at the same time as MEMR (and correspondingly IOR for MEMW), the motherboard does not translate this into an IO port access. It can't, because the address on the bus is a memory address, not an IO port address. So the device that is doing the DMA just reads/writes the data from/to the data bus directly (and knows to do so because it receives the DACK signal for the DMA channel it's using) - there's no IO port space access because there's no second bus cycle with an address in the IO port space.

Nit-picking, I know, but I have seen this distinction cause some confusion in the past.

eeguru
February 28th, 2017, 09:14 AM
The ISA bus runs at the same speed as the CPU in Turbo XTs and ATs. Faster machines decoupled it and standardized on 8.33Mhz (thank EISA for codifying that).

I'm quite certain 8.33 was never a 'standard' in the ISA world - with the exception of the much later LPC bus. 8.33 came from 25 MHz 386's dividing CLK2 by 6 and 33 MHz CLK2 by 8. So it was common for those systems when they debuted. But among cheap clone boards, the ISA bus ran at whatever speed the board or chipset designers could easily divide from existing clocks up to and including 8.5. Many in the 486 DXn era went back to 8.00. And I have several later boards that let you choose a half dozen frequencies.

8 is the most common that I've seen.

Chuck(G)
February 28th, 2017, 11:07 AM
I think that's not quite right (or at least confusingly worded). While the 8237 asserts IOW at the same time as MEMR (and correspondingly IOR for MEMW), the motherboard does not translate this into an IO port access. It can't, because the address on the bus is a memory address, not an IO port address. So the device that is doing the DMA just reads/writes the data from/to the data bus directly (and knows to do so because it receives the DACK signal for the DMA channel it's using) - there's no IO port space access because there's no second bus cycle with an address in the IO port space.

Nit-picking, I know, but I have seen this distinction cause some confusion in the past.

I know it's confusing, but the data to be transferred has to be placed on the bus when DREQ is asserted, the the DMAC places the destination address on the bus and asserts MEMW in the case of a read-from-device into memory operation. It's a bit of a ballet and I suspect that there's lost motion in that dance.

eeguru
February 28th, 2017, 11:59 AM
I know it's confusing, but the data to be transferred has to be placed on the bus when DREQ is asserted, the the DMAC places the destination address on the bus and asserts MEMW in the case of a read-from-device into memory operation. It's a bit of a ballet and I suspect that there's lost motion in that dance.

Don't you mean DACK?

Chuck(G)
February 28th, 2017, 01:15 PM
Don't you mean DACK?

It depends on your point of view. A device with data asserts DREQ and waits for a DACK, then puts the data on the bus until DACK drops. But it's DREQ that starts the whole chain of events. You don't get to DACK without DREQ. In any case IOR is asserted after DACK, then the data appears on the bus.

pearce_jj
February 28th, 2017, 10:45 PM
Re bus speeds - wait states should be added for 8 bit boards to provide an 800ns ish total cycle time, which can be eliminated by asserting /ZWS (B8 )

Trixter
March 1st, 2017, 09:04 AM
<yet another side note>
Incidentally, the reasoning in this thread is why I came to the conclusion some time ago that it's not worth using a software disk cache on an XT machine. The extra overhead of transferring the data over the 8-bit bus to an EMS card negates any performance benefit. In all my tests, using different caches including Norton's NCACHE2, I've never seen an overall performance increase greater than about 1% from using a cache, and usually there's actually a decrease. Buffers in main system RAM help, but not a cache using memory on an add-in card.
</yet another side note>

This is 99% correct and mirrors my own observations. The 1% use case that justifies a write-back cache is when you are doing a lot of seek-heavy operations, like deleting 100 files out of a subdirectory; for that, a write-back cache with delayed or background writes definitely helps. On my own 8088 system, I have batch files that load/unload ncache2 that I use when I know I'm going to do a ton of directory operations -- even with the load/unload overhead, it is still a net win.

Chuck(G)
March 1st, 2017, 09:10 AM
Depends on the I/O device, in my experience. Track caching/buffering of floppy drives can substantially increase performance--at the expense of reliability if write buffering is used.

AlexC
March 1st, 2017, 12:38 PM
I have batch files that load/unload ncache2 that I use when I know I'm going to do a ton of directory operations -- even with the load/unload overhead, it is still a net win.

Yes, that makes sense. I do something similar when wiping large directories. But other than that I prefer not to use write caching. As Chuck(G) hints, reliability is an issue when it's enabled. I lost too much data that way back in the day, though fortunately nothing irreplaceable.

pearce_jj
March 1st, 2017, 12:48 PM
One odd thing is quite how bad Windows 10 is with USB connected flash media, including CompactFlash cards. It's actually materially quicker to do file operations on a 486 with CompactFlash, rather than pull the card and stick it in a modern machine. I suspect sometimes even an XT actually, provided of course no ZIP operations are involved.

BUFFERS=99 is basically all you need on an XT, to enable DOS to do its thing with partial sector transfers.

AlexC
March 1st, 2017, 12:55 PM
BUFFERS=99 is basically all you need on an XT, to enable DOS to do its thing with partial sector transfers.

Interesting. I've found that with anything more than about 40 buffers, performance tends to deteriorate, as it does with fewer than about 25. Not sure why, or whether it's a quirk of my machines. 30 seems the sweet-spot for them.

I also noticed a huge disk performance improvement between DOS 3.30 and 5.00 - that extra 8KB or so of system RAM is put to good use.

And... apologies to the OP as this has wandered off-topic a little.

Trixter
March 1st, 2017, 02:02 PM
Interesting. I've found that with anything more than about 40 buffers, performance tends to deteriorate, as it does with fewer than about 25. Not sure why, or whether it's a quirk of my machines. 30 seems the sweet-spot for them.

The sweet spot is one track's worth, which will vary from drive to drive. You're going to do a single read from a disk, you might as well grab the entire track while you're at it. Too many buffers means that, for a small read, you're reading multiple tracks, which is multiple seeks. On my ST-225, I'd use BUFFERS=17 (ST-225 has 17 sectors per track).

I could be wrong; corrections welcome.


And... apologies to the OP as this has wandered off-topic a little.

(To get it back on track) To the OP: Why do you want to know this info? Building a board?

pearce_jj
March 1st, 2017, 10:32 PM
DOS buffers are used only in partial cluster transfers.

mR_Slug
March 3rd, 2017, 09:03 PM
Why do you want to know this info? Building a board?

No not really intending to build a custom board. I would kinda like to shoe-horn an 8MHz 8086-2 into an XT motherboard, as AFAIK the chip was available in '83. Then I would have the *fastest* XT available in 1983:-) I guess you would need to implement the bus conversion logic externally. Its not something I'm seriously considering though.

My main reason is sheer interest. I've seen many sources simply state "typically 1-2MB/a", but with no explanation. Benchmark are a good practical way to determine real-world speed, but that gap between 5.33MB/s and 2MB/s has always interested me.

From a theoretical standpoint, if you were to take an XT, with a max transfer rate of ~1MB/s, It is just about possible to pass, for example DVD video through it. However, if the maximum rate were only 0.5MB/s, it just isn't possible for a normal DVD bit-rate.
<side note>
BTW Trixter, your demos are just phenomenal. I don't know how you even plan something like those.
</side note>

A more practical example, would be determining if a given configuration of a late '80s 286/386 system is likely to have any bottlenecks. A fast ESDI hard drive and a 10mb/s NIC, say both about 1.25MB/s, and the bus starts to become a bottleneck. I read somewhere about disabling wait states on the ISA bus, and there was also the ubiquitous 0ws memory in late '80s systems. In addition there were also systems with 10MHz buses. Put all that together and you can double the transfer rate.

Instead of writing long posts, I have written a page with some examples. Not finished yet, still got to do DMA and understand that chuck mod, and i think some other bit and pieces:
http://108.59.254.117/~mR_Slug/ISA/ (requires javascript)

Please check it out, if you find any errors let me know. My knowledge of asm is minimal. The most I've done is try to write a character device driver for the tape port on the PC...unsuccessfully, in a response to a thread on here. Learned quite a bit about the request header though and segment addressing.

Chuck(G)
March 3rd, 2017, 10:01 PM
There were 8086-based XT compatibles in 1983. Consider, for example, the Stearns PC, running an 8086 with a full 16-bit bus.

pearce_jj
March 4th, 2017, 12:20 AM
I have written a page with some examples. Not finished yet, still got to do DMA and understand that chuck mod, and i think some other bit and pieces:
http://108.59.254.117/~mR_Slug/ISA/ (requires javascript)

It does DMA already, just set the CPU cycles and IO cycles accordingly, i.e. 0 and 5 respectively (demand mode) gives 0.954MB/s.

This might help you: https://www.lo-tech.co.uk/xt-cfv3-dma-transfer-mode/

Re Chuck Mod - this is specific to the original XTIDE. The design used the standard ATA port addressing, i.e. A0, A1 and A2 on the ATA interface were connected directly to their corresponding address lines on the ISA bus. This creates a problem since a peculiarity of the ISA bus is that it's kind-of three dimensional with the introduction of 16-bit ISA in so much as a 16-bit bit transfer performed at base+0h will not assert A0 for the high byte. Or put another way, a 16-bit device cannot be interrogated by an 8-bit interface without a MUX. Therefore, the design used A3 to operate the MUX so the high byte of a 16-bit IO read (triggered by a read via base+0h) stored on the card could then be read by another read from base+8h.

However this meant that the board ran slow because the coding looked something like:

mov cx, 256
.TransferLoop:
in al, dx
stosb
add al,8
in al, dx
sub al,8
stosb
loop .TransferLoop

By changing the hardware so that the mux is operated by A0, we can instead use a trick in that a 16-bit port IO will be converted to two 8-bit transfers by the bus interface AND the second transfer asserts A0. Therefore the code can be streamlined a lot, on a V20 to this:

mov cx, 256
rep insw

Far less code and the slow jumps are eliminated, so transfer speed increases a lot.

Hope that helps.

eeguru
March 4th, 2017, 06:16 AM
Here is another way to increase performance on a stock 8088 using memory transfers so the trick above can be done with 'rep movsw' (since 'rep insw' isn't available).

https://www.retrotronics.org/jride

Scali
March 4th, 2017, 07:02 AM
Here is another way to increase performance on a stock 8088 using memory transfers so the trick above can be done with 'rep movsw' (since 'rep insw' isn't available).

https://www.retrotronics.org/jride

It's an elegant solution, and great for a PCjr (it does not have a DMA controller).
However, it won't be as fast as the DMA approach on an 8088 system.
The rep insw-trick would probably be exactly as fast as rep movsw on any CPU that can actually outperform the DMA controller, which would be a 286 or better.
So I don't think memory-mapping would have any benefit for the regular XT-IDE.

flashedbios2012
March 4th, 2017, 08:14 AM
I don't know for sure but I am going to guess its 100Mb. After All they made ethernet cards for ISA, and they wouldn't have done so if the card exceeded the bus speed, they wouldn't have built hardware with a bottleneck limitation like that

Chuck(G)
March 4th, 2017, 08:39 AM
There are darned few 100BaseT ISA cards, which exist for the sake of compatibility. At best, they deliver about twice the throughput of a 10BaseT ISA card. Nowehere near 100Mbps (that's "bits" not "bytes" by the way) network speed.

mR_Slug
March 8th, 2017, 12:52 PM
http://108.59.254.117/~mR_Slug/ISA/ (requires JavaScript)

Ok, I have added the DMA section. It is in a separate box to the CPU/bus section, as it is not really related to the CPU.

I have been looking at bus-mastering, and from what I can tell, if the 8237A is put into cascade mode, the add-in DMA-controller/Bus-mastering-controller on a card (Say for example an 8MHz 82C37A on an XT-IDE), should be able to operate at any speed. You are of course limited by the system RAM speed, and refresh.

As a though experiment, consider if part of the system RAM were on an ISA card, and it were several orders of magnitude faster than the typical speed in a PC/XT/AT.

As I understand it, there is nothing to stop you using an 8MHz DMA controller on a bus that normally runs at 4.77MHz (when the CPU has control of it), because the bus speed is now controlled by the DMA controller.

Theory of operation:

As I understand it, The master (motherboard) DMA controller is programmed in cascade mode, such that external DMA controller (on the card) takes over control of the bus:


The CPU is effectively switched off, HLDA (hold acknowledge) asserted.
The master (motherboard) DMA controller, is now just waiting for the cascaded DMA controller (bus master on the card) to assert its DRQ line. That is, it is acting solely as a switch, and that switch only needs to be asserted when the DMA transfer is complete.
The external DMA controller can (in theroy) operate at any speed, 2x4.77MHz, 8MHz or 90MHz. There is no requirement/need for it to even be in sync with the bus speed set by the CPU.

Synchronization does make servicing memory refresh much simpler. (a double speed DMA controller just needs to add a wait state between each memory access).

At very high speeds, such as 90MHz, there are likely to be other issues e.g. EMI etc. However there is nothing in the design of the cascaded DMA controller approach, that limits the speed the DMA controller operates at.


Examples, the Adaptec AHA-1540B (c1990) *can* operate at 8.0MB/s. see page 10, http://108.59.254.117/~mR_Slug/ISA/aha1540b_um.pdf

The popular C&T 82C206 Integrated Peripheral controller (c1986) has an 8MHz 8237A controller. As far as I can tell, it defaults to 4MHz operation.

Am I correct or have I missed something?


Other notes:

I read somewhere about disabling wait states on the ISA bus...

ahh that was Chuck(G) post #5. Also that Sterns PC, is an interesting design, similar to the AT&T 6300. Would like to get my hands on some of those early compatibles, particularly the Columbia MPC 1600-4 with the Z80-based HDD controller. I just checked, and I suppose you could go one further in '83 and use a 286.

Link to info on early compatibles:https://books.google.com/books?id=e-gI2W-3JwkC&pg=PA121&lpg=PA121&dq=Stearns+PC+pc+mag&source=bl&ots=VZMqpC3q9u&sig=dmIJQpfHGvmDF_U4thBUQK7LmBw&hl=en&sa=X&ved=0ahUKEwjio6bQ5MfSAhVKB8AKHZhZBWMQ6AEIKTAC#v=on epage&q=Stearns%20PC%20pc%20mag&f=false

I am beginning to look at the REP methods etc, I think I understand it, Specifically the "chuck" mod, seems to combine the REP method of not requiring the fetch for an instruction, AND exploits the BIU. Very clever.

I'm not sure I understand the memory timing of "9+EA" as explained here: http://zsmith.co/intel_m.html#mov for the 86/88 MOV instruction. Near the bottom of: http://zsmith.co/intel.html it explains the Effective address. In the context of the code below, is that 9+0?


mov bx, 300h ;Point at 300h (2 cycles) setup, not counted.
in ax, dx ;Read word (3 cycles)
mov [bx], dx, ;Write to RAM (3 cycles)

rittwage
March 8th, 2017, 12:53 PM
I don't know for sure but I am going to guess its 100Mb. After All they made ethernet cards for ISA, and they wouldn't have done so if the card exceeded the bus speed, they wouldn't have built hardware with a bottleneck limitation like that

You're serious? Who is this seemingly honorable "they" you speak of? :)