Image Map Image Map
Page 1 of 2 12 LastLast
Results 1 to 10 of 18

Thread: 8088 optimization question (eliminating branches)

  1. #1

    Default 8088 optimization question (eliminating branches)

    Consider the following scenario: let "background" and "foreground" be two layers of visual data, composed of 16-bit values. I'd like to merge them as follows - for each offset, if foreground contains zero, it is considered transparent (=use the background value). Otherwise, use the value in foreground.

    Something like this:

    Code:
    for (i=0; i<size; i++)
    {
      merged[i] = (foreground[i] == 0) ? background[i] : foreground[i];
    }
    I'm looking to implement this in 8086 assembly. If I test each word value for zero and use conditional jumps, it's straightforward... but also very, very slow.
    Is there a faster, more optimized way to do this? Hopefully without branching, if at all possible? (The choice of zero to represent transparency is arbitrary, BTW).

    One way is to "compile" the layers, as in the "compiled sprites" technique sometimes used for fast animation... i.e. instead of plain data, generate the layers as code that performs the merging in-place (with the data as immediate values), and simply does nothing in the 'transparent' case. But that wouldn't work so well, because my use-case requires working with arbitrary pieces of each layer (so the code wouldn't have a fixed entry point, and the expense of compensating for this could nullify a lot of the gain).
    int10h.org :: :: :: blog

  2. #2
    Join Date
    Jan 2007
    Location
    Pacific Northwest, USA
    Posts
    32,707
    Blog Entries
    18

    Default

    My first and easiest step would be unrolling the loop, or at least a few iterations of it. It is possible to do the selection without branching, but I'm not sure that it will save you many cycles. Running the loop in reverse and initializing the index to its final value gets rid of a compare at the bottom of the loop (decrement and loop until zero).

    What are the possible values of "foreground"?

  3. #3

    Default

    Quote Originally Posted by Chuck(G) View Post
    My first and easiest step would be unrolling the loop, or at least a few iterations of it. It is possible to do the selection without branching, but I'm not sure that it will save you many cycles. Running the loop in reverse and initializing the index to its final value gets rid of a compare at the bottom of the loop (decrement and loop until zero).
    True, some unrolling will get rid of one major cycle-eater (loop - 18 cycles), but doing a conditional jump on each word is still 16 cycles a pop (if taken; 4 otherwise). I have the nagging feeling that I'm missing some trick that could be a major time-saver... if it's possible to do the selection without branching, as you say, please do share your idea - it might help. I haven't started on the implementation yet; just doing the math first to see if it's feasible at an acceptable speed.

    My "compiled" approach could sort of work, I think, if the do-nothing (transparency) cases constituted single jumps to the nearest do-something (nonzero) case, which would mov the foreground value as an immediate. That would take care of arbitrary entry-points, but arbitrary exit-points would still need self-modification, or more conditional checks... a bit hairy, but possibly still faster than a branch on each value?

    What are the possible values of "foreground"?
    Any 16-bit value. 0000h through FFFFh are all valid - 0000h is sort of an arbitrary choice for transparency, but thinking about it, every value with a high byte of 00h could also be considered transparent (if that makes things easier). These are text-mode character/attribute pairs, if it helps visualize what I'm trying to do.
    Last edited by VileR; November 10th, 2019 at 09:20 AM.
    int10h.org :: :: :: blog

  4. #4
    Join Date
    Jan 2007
    Location
    Pacific Northwest, USA
    Posts
    32,707
    Blog Entries
    18

    Default

    Well, I'm willing to look at the code that you've come up with and suggest speedups.

    Your code would be two instructions on a vector machine.

  5. #5

    Default

    If you have a limited number of items to analyze and they are numerically close, it makes sense to make a jump table. This way you only make one offset calculation and then one or two jumps, depending on coding.
    Another way to look at things is to efficiently encode things as bits. One can then do a series of ANDs, ORs and NOTs to create logical branching without actually branching. Remember, a 32 bit processor has 32 states possible treating each bit as a possible state. It has billions of states of using combinations of bits. Do remember that most compilers ( like C ) think of logical compares as branching statements. The thinking is that if you write your code well, you'll put things that are most often or most important for speed at the beginning of the branching statements. Still, if you have a case statement with more than 8 possible branches, there is usually a better way to handle it.
    I think both Chuck and I are wondering what transparent means to decision making ( a branch ).
    Dwight

  6. #6
    Join Date
    Jan 2007
    Location
    Pacific Northwest, USA
    Posts
    32,707
    Blog Entries
    18

    Default

    FWIW, if you're really looking to do things branch-less, consider the following bit of code (ax has the value to be tested):

    Code:
        xor     bx,bx
        sub    bx,ax
        sbb    bx,bx
    (bx) will be FFFF if ax is nonzero; 0000 otherwise. You can then use boolean operations (AND) to select the value. But I'm not sure that it will be any faster than a short jump.

  7. #7

    Default

    Yeah, I don't think going through major contortions to avoid a short branch is really worth the trouble until you get into heavily pipelined CPUs (Pentium or higher, 486 to a lesser extent,) where the cost for a missed branch becomes heavier. Still, the trick Chuck mentions is a nifty one, especially since x86 is one of the architectures where move operations don't set the flags, so any test requires a separate instruction. Might as well fold the test and select into the same set of operations, in that case. Lemme see, example implementations:
    Code:
    	; branch-less version
    	std			; prepare to decrement!
    	; pre-condition: DS and ES are set to the segments for source & dest.
    	; limitation: foreground & background must be in the same segment
    	mov di, DESTINATION	; destination address (end of destination)
    	mov bx, FOREGROUND	; foreground address (start of foreground - 2)
    	mov bp, BACKGROUND	; background address (start of background - 2)
    	mov si, SIZE		; field size
    loop	mov ax, [bx + si]	; get foreground value
    	mov cx, [bp + si]	; get background value
    	xor dx, dx		; clear DX
    	sub dx, ax		; set DX to -AX
    	sbb dx, dx		; set DX to (AX == 0) ? 0x0000 : 0xFFFF
    	and cx, dx		; pass background only if foreground = 0
    	or ax, cx		; select whichever was not zeroed
    	stosw			; place it in the destination field
    	sub si, 2		; decrement the source index
    	jnz loop		; continue
    	
    	; branching version
    	std			; prepare to decrement!
    	; pre-condition: DS and ES are set to the segments for source & dest.
    	; limitation: foreground & background must be in the same segment
    	mov di, DESTINATION	; destination address (end of destination)
    	mov bx, FOREGROUND	; foreground address (start of foreground - 2)
    	mov bp, BACKGROUND	; background address (start of background - 2)
    	mov si, SIZE		; field size
    loop	mov ax, [bx + si]	; get foreground value
    	test ax, ax		; is foreground zero?
    	jnz .st			; if no, store what we have
    	mov ax, [bp + si]	; if yes, get background value
    .st	stosw			; place it in the destination field
    	sub si, 2		; decrement the source index
    	jnz loop		; continue
    Note that the branching version is actually three instructions shorter. Whether it's faster in practice? I dunno, try it and see.

    It would be easier still to optimize if this were a destructive operation (i.e. background = (foreground == 0) ? background : foreground.) But depending on your application, that might necessitate making a separate copy of the destination layer anyway.
    Computers: Amiga 1200, DEC VAXStation 4000/60, DEC MicroPDP-11/73
    Synthesizers: Roland JX-10/SH-09/MT-32/D-50, Yamaha DX7-II/V50/TX7/TG33/FB-01, Korg MS-20 Mini/ARP Odyssey/DW-8000/X5DR, Ensoniq SQ-80, E-mu Proteus/2, Moog Satellite, Oberheim SEM
    "'Legacy code' often differs from its suggested alternative by actually working and scaling." - Bjarne Stroustrup

  8. #8
    Join Date
    Jan 2007
    Location
    Pacific Northwest, USA
    Posts
    32,707
    Blog Entries
    18

    Default

    I don't know if it would shorten things any, given your problem, but that 3-instruction snippet can be simplified to 2, thus testing the contents of ax->ffff if nonzero, 0 otherwise.

    Code:
        neg     ax
        sbb     ax,ax
    FWIW, "neg" is an oft-overlooked instruction. It can be used to set the carry flag if a register is nonzero. Of course, it destroys the content of the register.

  9. #9

    Default

    Yeah...problem is, you still need the value of AX, unless (as mentioned) you're doing a destructive copy.
    Computers: Amiga 1200, DEC VAXStation 4000/60, DEC MicroPDP-11/73
    Synthesizers: Roland JX-10/SH-09/MT-32/D-50, Yamaha DX7-II/V50/TX7/TG33/FB-01, Korg MS-20 Mini/ARP Odyssey/DW-8000/X5DR, Ensoniq SQ-80, E-mu Proteus/2, Moog Satellite, Oberheim SEM
    "'Legacy code' often differs from its suggested alternative by actually working and scaling." - Bjarne Stroustrup

  10. #10
    Join Date
    Jan 2007
    Location
    Pacific Northwest, USA
    Posts
    32,707
    Blog Entries
    18

    Default

    That's what I thought--hence, the first version.

    The second version is handy when computing FORTRAN logical IFs and other such nonsense. F77, IIRC, didn't allow for "shortcutting" compound IFs--you computed the values of all sub-expressions. It's easy to do a magnitude comparison, just subtract and a sign-extended right shift.

    e.g. IF (I.GT.J.AND.K.NE.0)...

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •