Image Map Image Map
Page 2 of 2 FirstFirst 12
Results 11 to 18 of 18

Thread: 8088 optimization question (eliminating branches)

  1. #11

    Default

    Thanks for the replies - appreciated!

    @commodorejohn: a destructive operation is perfectly good actually; I should've made my example overwrite the background. Re: cost of branching, on the 8088 the major hit may just be the conditional jump itself (16 cycles when taken). Losing the prefetch queue is a lesser concern, but then again fetching instructions from RAM is still 4 cycles/byte.

    I believe I'll unroll major chunks of the loop regardless of implementation, so the bigger worry would be the cost of processing each element. And if we do a destructive copy, the difference doesn't look like much: we can skip one read in the branching version, but the non-branching version would still needs two reads + more arithmetics (NEG, SBB, AND, OR)... and since reads are also expensive (12 for LODSW, more for MOV) the branching version might actually have a slight edge. Guess I'll have to try both and see.
    int10h.org :: :: :: blog

  2. #12
    Join Date
    Aug 2006
    Location
    Chicagoland, Illinois, USA
    Posts
    6,087
    Blog Entries
    1

    Default

    Remember reenigne's rules of optimization:

    1. Speed is roughly measured in number of I/Os, including the instruction opcode bytes.
    2. Always profile the code, as there are non-obvious factors that can affect #1. (This is an Abrash rule as well :-)

    To those, I'll add my own:

    1. Branchless is only faster if you're eliminating more I/Os than the worst-case branch (see reenigne #1).
    2. REP MOVS/STOS (with mov cx,num setup) break-even is 4 output words, so if you're storing/copying 8 bytes or more, do the REP; if less, see if you can get away with individual MOVS/STOS.

    (Don't know how large some opcodes are? Run debug, a)ssemble, type some instructions, hit enter twice, then u)nassemble to check their size.)

    Looking at commodorejohn's implementation, the branchless version is clever, but it only eliminates one jump so the jumping version should be faster. Worst case on the jump is 16 cycles, but the branchless operations take up 10 bytes (40 cycles).

    Writing directly to video memory ;-) ;-) ;-) will minimize the speed gains of one over the other, but I still think the branching one will be faster once you benchmark it.
    Offering a bounty for:
    - A working Sanyo MBC-775, Olivetti M24, or Logabax 1600
    - Music Construction Set, IBM Music Feature edition (has red sticker on front stating IBM Music Feature)

  3. #13

    Default

    Hah, thank you for those. I wasn't prepared for the I/Os sneaking up on me - seems like both of your First Rules are validated by the conclusions in this thread.

    I routinely do the DEBUG -a/-u dance for instruction sizes; however, shameful as it is, my idea of "profiling" still boils down to modifying the border color for each interesting block of code and guesstimating things visually. You could say I'm not yet attuned to the ways of Zen.

    My original problem isn't very relevant any longer, because the premise and scope of what I'm trying to do have changed (and much for the better, in terms of speed). Not that the discussion hasn't helped- eye-openers are always a good thing.

    As for doing operations like these directly to video memory... oh, not by a long shot. That would just wrap around to pure evil.
    int10h.org :: :: :: blog

  4. #14
    Join Date
    Jan 2007
    Location
    Pacific Northwest, USA
    Posts
    32,518
    Blog Entries
    18

    Default

    Are you interested in some profiling routines? I've got some that I wrote years ago.

    My first one was resident in a CDC 6600 PPU--a PP can directly read the CP P-counter, so it was mostly a matter of bookkeeping.
    Last edited by Chuck(G); November 11th, 2019 at 04:57 PM.

  5. #15

    Default

    Nothing shameful about profiling by modifying the border colour - I quite often do that too, and I have all sorts of tools for measuring how long a bit of code takes down to the cycle.

    As for the original question - I haven't generally found branchless techniques to be particularly helpful on 8088. The cost of a taken branch (and refilling the prefetch queue) is high compared to an untaken branch but by the time you've executed a couple of 2-byte instructions to do the same thing in a branchless way, you've lost any advantage. Definitely try to arrange things so that the untaken branch is the common case if you can (but not at the expense of adding an extra unconditional jump).

  6. #16

    Default

    Quote Originally Posted by Chuck(G) View Post
    Are you interested in some profiling routines? I've got some that I wrote years ago.
    My go-to would've been Abrash's Zen Timer, which is quite general-purpose, though if you have your own to share I'm always interested in looking at different approaches.

    Quote Originally Posted by reenigne View Post
    I haven't generally found branchless techniques to be particularly helpful on 8088. The cost of a taken branch (and refilling the prefetch queue) is high compared to an untaken branch but by the time you've executed a couple of 2-byte instructions to do the same thing in a branchless way, you've lost any advantage. Definitely try to arrange things so that the untaken branch is the common case if you can (but not at the expense of adding an extra unconditional jump).
    Yep, that's what I'm beginning to realize. I should really go back to some of my existing code and apply some of these insights - especially in 'precalc' situations where I wasn't too rigorous... and perhaps still had an inertial tendency to optimize for size, rather than speed. The latter is definitely trickier.
    int10h.org :: :: :: blog

  7. #17

    Default

    The advice about rearranging the clauses of a branch is also very good. I've lost count of the number of times I've been trying to get my head around the best way to do something and realized that if I just switch around which of the possibilities I'm testing is the fall-through case, everything makes so much more sense.
    Computers: Amiga 1200, DEC VAXStation 4000/60, DEC MicroPDP-11/73
    Synthesizers: Roland JX-10/SH-09/MT-32/D-50, Yamaha DX7-II/V50/TX7/TG33/FB-01, Korg MS-20 Mini/ARP Odyssey/DW-8000/X5DR, Ensoniq SQ-80, E-mu Proteus/2, Moog Satellite, Oberheim SEM
    "'Legacy code' often differs from its suggested alternative by actually working and scaling." - Bjarne Stroustrup

  8. #18

    Default

    I believe this is the fastest branchless way to do it:

    Code:
    	; DS:SI => foreground
    	; DS:BX => background
    	; ES:DI => output
    	; CX = length
    
    	; make BX relative to SI
    	sub bx, si
    	dec bx
    	dec bx			;BX = bg_array - fg_array - 2
    
    	mov bp, 1		;constant
    	cld
    
    l1	lodsw			;load foreground into AX
    	cmp ax, bp		;set carry if transparent
    	sbb dx, dx		;DX = mask
    	and dx, [bx + si]	;DX = (fg == 0? bg : 0)
    	or ax, dx		;AX = either fg or bg
    	stosw
    	loop l1
    But the most straightforward branching version should be faster in any case:

    Code:
    l1	lodsw
    	test ax,ax
    	jnz l2
    	mov ax, [bx + si]
    l2	stosw
    * 2 bytes shorter: -8 clocks
    * jump taken, mov skipped: +16 -19

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •