PDA

View Full Version : Observing a CPU bug in action



Trixter
March 4th, 2016, 11:47 AM
I stumbled onto the 8086-80286 "only the two most recent prefixes are honored after an interrupt" bug doing something visual and could reproduce it, so I thought I'd make a short video about it:


https://www.youtube.com/watch?v=6FC-tcwMBnU

Chuck(G)
March 4th, 2016, 01:11 PM
Since ES: is assumed for the destination, and DS: for the source, a couple of pushes and pops on the segregs before and after the movsb should handle things quite nicely. The rep/movs combination was a minefield on the early steppings of the 186 also. If executed while a DMA transfer was in progress, SI and DI could get clobbered.

Trixter
March 4th, 2016, 01:51 PM
There were two fixes we used. The size-optimized fix was:


@@again:
seges movsb
loop @@again

The speed-optimized fix gives the user one of two ways to work around the problem: Either disable interrupts around the REP MOVS, or spend time rearranging register contents so it can be a normal DS:SI -> ES:DI REP MOVS copy leaving interrupts enabled. User gets to pick via a define in the assembler source.

Chuck(G)
March 4th, 2016, 03:09 PM
It mostly depends on how much data you're moving as to which triumphs. I suspect a major code overhaul would eliminate the need for any fixup code at all. :)

Trixter
March 4th, 2016, 07:43 PM
When I timed the code with varying input, this:



cli
es: rep movsb
sti


...was faster than this:



mov bp,ds
mov bx,es
mov ds,bx
rep movsb
mov ds,bp


...which was faster than this:



push ds
push es
pop ds
rep movsb
pop ds


The first two are given as a configuration option in the code for the user to choose which one they want to use. The one that disables interrupts never takes a CX higher than 127 due to what the code does, so the maximum number of cycles interrupts could be disabled at any one time is (127*4)+(4*4)=524 cycles. The user is warned about both tradeoffs in the comments around the compile directives they can alter.

Chuck(G)
March 4th, 2016, 08:14 PM
Does (movsb) rep movsw gain anything in this case?

Trixter
March 4th, 2016, 08:49 PM
Normally I'd say no, since the 8088 is the target. But if the 8086 were the target, then yes, branching to a rep movsw section would help for longer copies (like, 16 bytes or more). However, the code in question is decompression code, and the compression method rarely results in match lengths over 10 bytes for typical inputs, so a check-and-branch to handle it better would take more time than it saves.

An alternate to a branch would be something that handles everything, like this:


shr cx,1
rep movsw
adc cx,cx
rep movsb

However, the code in question makes extensive use of the carry bit, which means I'd have to preserve carry before the above sequence, and restore it afterwards, and that also takes more time than it saves.

Speed optimization, like compression algorithms, is a minefield of trade-offs.

Scali
March 5th, 2016, 02:59 AM
Does (movsb) rep movsw gain anything in this case?

rep movsb and rep movsw certainly perform differently, even on 8088.
I found that my CGA clone was slightly too slow to run Codeblasters' CGA demo properly on a 4.77 MHz system.
You could see that the bottom scanline of the scroller was not updated at the time the CRT hit that part of the screen.
The demo was apparently designed to *just* finish updating the scroller (they probably hand-tuned the size of the rasterbars at the top to be as big as possible). However, this CGA clone apparently inserted a few waitstates more than a real CGA card does.
So I disassembled and studied the code, and found that it did rep movsb for the scroller.
By rewriting it to rep movsw movsb (it was an odd number of bytes to be copied, namely 79 bytes per scanline), I saved just enough cycles to make it run perfectly on the clone CGA card.
See blog and code here: https://scalibq.wordpress.com/2014/11/22/cgademo-by-codeblasters/

pearce_jj
March 5th, 2016, 04:15 AM
rep movsw is the reason that the Lo-tech storage adapters are able to perform better that other types. But not all early hardware correctly implements the byte transfer order (AT&T PC6300 I think was one).

Krille
March 5th, 2016, 08:11 AM
I think this bug only exists on the 8088/8086, not on anything newer.


The size-optimized fix was:


@@again:
seges movsb
loop @@again


This will lead to an off-by-one error for every interrupt(ion). All the workarounds I've seen return to the string instruction with CX unchanged.
This might be a better way;

@@again:
seges movsb
inc cx
loop @@again

Also;


When I timed the code with varying input, this:



cli
es: rep movsb
sti



This will not work because the prefixes are in the wrong order. The CPU only "remembers" the last prefix so the above code will use DS as the source segment when returning from an interrupt.

EDIT: Scratch this last one, I'm stupid.

Scali
March 5th, 2016, 08:18 AM
This will lead to an off-by-one error for every interrupt(ion). All the workarounds I've seen return to the string instruction with CX unchanged.

Are you sure?
Note that this code does not use the 'rep' prefix at all. It uses loop *instead* of rep, therefore the issue does not exist.

Chuck(G)
March 5th, 2016, 09:02 AM
rep movsw is the reason that the Lo-tech storage adapters are able to perform better that other types. But not all early hardware correctly implements the byte transfer order (AT&T PC6300 I think was one).

On the 6300, it was the hardware 16-bit-to-8 bit BIU implemented in external hardware that was the problem. It wasn't so much the MOVSW (which works okay), but the IN AX,DX instruction. The Olivetti engineers didn't quite get it right.

Krille
March 5th, 2016, 09:57 AM
Are you sure?
Note that this code does not use the 'rep' prefix at all. It uses loop *instead* of rep, therefore the issue does not exist.

Aargh :headslap: You're right of course. I need to work on my speed reading. :)

Scali
March 6th, 2016, 02:44 PM
I decided to see what the original 8088/8086 manual says: http://matthieu.benoit.free.fr/cross/data_sheets/8086_family_Users_Manual.pdf
And it actually documents this behaviour, see page 2-42.
So I don't think we can call this a 'bug' as such. It is a quirk, but a documented one, so the CPU works as advertised.

Trixter
March 6th, 2016, 02:56 PM
I'd call it a "documented bug" :-) especially since they "fixed" it on later processors.

I hadn't realized it was documented; thanks for the reference.

Scali
March 6th, 2016, 10:54 PM
I'd call it a "documented bug" :-) especially since they "fixed" it on later processors.

Yea, I guess it wasn't a 'bug' until they introduced a CPU that had different behaviour.
In fact, you could even argue that the new behaviour is a bug, since it is not backward-compatible with 8088/8086 :)
But they probably documented that just as nicely. I guess I'd have to check the 80186, 286 and possibly 386 manuals as well, to see where they started reporting different behaviour.

Edit: I'm not entirely sure, but I think I found it in the 286 manual: http://bitsavers.informatik.uni-stuttgart.de/pdf/intel/80286/210498-005_80286_and_80287_Programmers_Reference_Manual_1 987.pdf
At the part discussing the rep-instruction, they mention overriding ds:si to es:si, but make no mention of any special case.
Go further to the chapter about interrupts, and on page 5-5 it says:
"(the saved value of CS:IP will include all leading prefixes)"
But this part of the manual seems to deal with exceptions, not specifically rep movsw etc.
In appendix C-1 they list changes from 8088/8086, and they do point out that:
"Any interrupt on the 80286 will always leave the saved CS:IP value pointing at the beginning of the instruction that failed (including prefixes). On the 8086, the CS:IP value saved for a divide exception points at the next instruction."
But that is the only difference in interrupt handling that they specify.
So the manual isn't entirely clear about this, but it sounds like the 286 behaves differently than 8088/8086 in this case.

Did you test and verify whether the 286 has the bug or not?

Trixter
March 8th, 2016, 11:44 AM
Did you test and verify whether the 286 has the bug or not?

I did not. I don't have easy access to a 286 right now, but it would be pretty easy to someone to test: Set DS:SI = ES:DI, then perform a REP ES MOVS with CX=FFFF while IRQ 0 is set to fire at a high rate and test whether or not CX=0 when done. If the bug exists, the REP will be dropped and CX wouldn't have counted all the way down to 0.

Scali
March 8th, 2016, 10:53 PM
Ah yes, I have a 286 as well of course (a late model Harris 286-20). I could do a little test myself if I get round to it.

vol.litwr
December 23rd, 2018, 04:36 AM
Ah yes, I have a 286 as well of course (a late model Harris 286-20). I could do a little test myself if I get round to it.
There are still no results. :(

Interestingly that http://tcm.computerhistory.org/ComputerTimeline/Chap37_intel_CS2.pdf (page 631) has the next text about 8086
During the execution of a repeated primitive operation the operand pointer registers (SI and DI) and the operation count register (CX) are updated after each repetition, whereas the instruction pointer will retain the offset address of the repeat prefix byte (assuming it immediately precedes the string operation instruction). Thus, an interrupted repeated operation will be correctly resumed when control returns from the interrupting task.
So the described behaviors don't guarantee the correct execution of the sequence of two prefixes too.