Image Map Image Map
Results 1 to 6 of 6

Thread: My variation on masked blitting in Mode X

  1. #1

    Default My variation on masked blitting in Mode X

    Hey all I've been experimenting on different ways to implement transparent sprites in mode X and I came up with a method that I haven't seen before so I thought I might share.

    The image is stored in a standard planar format, with palette index zero representing a transparent pixel, and blitted in the following manner:

    Code:
    outb(SC_INDEX, MAP_MASK);
    for each plane:
    {
    	mov cx, VGA_SEGMENT
    	mov es, cx
    	lds si, planeData 	;the bitmap data 
    	mov ah, planeMask	;the bit mask to enable this plane
    	mov bx, bmpHeight 	;image height
    	mov dx, SC_DATA	
    	
    	rowLoop :
    	mov cx, bytes_per_line ;load the pixel counter
    
    	pixLoop:
    	xor al, al 		;AL = 0
    	cmp al, ds : [si] 	;if ds:[si] > 0 CF = 1 else CF = 0
    	sbb al, al		;AL = 0 - CF, AL = 0 if 0, 0xFF if 1
    	and al, ah		;combine with the plane mask
    	out dx, al		;output the new mask setting
    	movsb			;plot the pixel
    	loop pixLoop		;continue as long as there are pixels left
    
    	add si, lineDiff	;add the remaining distance to the edge of the bitmap
    	add di, screenDiff	;add the remaining distance to the edge of the screen
    	dec bx			;continue as long as there are rows left
    	jnz rowLoop
    }
    The outer parts are psuedocode because the code I wrote this for is complicated in a way that isn't really relevant here. The interesting idea is generating a mask based on the pixels value and using the vga card mask to avoid branching. This could also work with a pre-computed mask but I didn't really want to waste the memory for that in my application.

    Advantages:
    - No branching required
    - clipping is fairly straightforward
    - can be stored in the same format as opaque sprites

    Disadvantages:
    - Reads the each pixel twice, I tried keeping the data around in a register but juggling registers made it slower on my target (286) but might be worth it on 8088.
    - Requires an out instruction for each pixel, slow on protected mode 386/486

    Well that's the gist of it, I'm curious to hear what you guys think of it or if there are any improvements you can think of?
    Currently working on new DOS game, Chuck Jones: Space Cop of the Future, Check out my Dev Blog. WARNING: contains rocket powered El Caminos

    Vintage Computers:
    Unitron Apple II clone, 2x Commodore Vic-20, Commodore 64, Commodore 128, Amiga 500, Macintosh Plus, Macintosh SE, AST Premium 286, 3 386sx PCs, Atari TT030

  2. #2

    Default

    I like the SBB trick too, but the overhead to eliminate a single jump seems not to be worth it.
    Accessing video registers or memory is usually slower than normal RAM, so it's better to avoid this.

    Code:
    	jmp pixLoop
    
    	align 2
    
    skipPixel:
    	inc di		;2
    	dec cx		;2
    	jz endPixL	;3 (assuming CX > 0)
    pixLoop:
    	lodsb		;5
    	test al,al	;2
    	jz skipPixel	;3/8
    	stosb		;3
    	loop pixLoop	;9
    endPixL:
    According to the 286 manual, this should take 22 clock cycles in both paths. Your loop takes 30, and writes to both video RAM and I/O every time, so VGA wait states will have a greater effect. Maybe Trixter or Scali could offer further comments?

  3. #3
    Join Date
    Mar 2011
    Location
    Atlanta, GA, USA
    Posts
    1,351

    Default

    Wouldn't compiled sprites be faster/easier?
    "Good engineers keep thick authoritative books on their shelf. Not for their own reference, but to throw at people who ask stupid questions; hoping a small fragment of knowledge will osmotically transfer with each cranial impact." - Me

  4. #4

    Default

    Quote Originally Posted by eeguru View Post
    Wouldn't compiled sprites be faster/easier?
    Faster yes... easier at mode X resolutions? Comes down to the sprite size.

    A method I use from time to time is to store first a word width offset based on the screen size, then how many bytes to write, then the data for a 'section'. This is highly inefficient if you have dithering across planes, but brutally efficient for images with only a few holes in it. To that end I often have the first word act as a indicator to say which encoding I'm using.

    Code:
    	lodsw
    .segmentLoop
    	add di, ax
    	lodsw
    	mov  cx, ax
    	rep movsb
    	lodsw
    	or   ax, ax
    	jnz  .segmentLoop
    Being the heart of it. I load the first pass on the assumption no check is needed (also allowing the first offset to be zero), add it to DI, load the count of how many to output, then rep movsb the data over. Load the next offset, if it's non-zero keep going. This method also works well in plain-old mode 13.

    Again, if every other byte is write / don't write this can be slow, but if you have more than 2 bytes of non-transparency one after the other this is WAY faster. Again a header byte can be used to alternate between this and a more conventional "0 as transparent" technique.

    It can also result in the sprites being much smaller in memory since any run of more than 4 transparent bytes ends up 4 bytes.
    From time to time the accessibility of a website must be refreshed with the blood of owners and designers. It is its natural manure.
    CUTCODEDOWN.COM

  5. #5

    Default

    I did several tests on my 286 and even made a partially unrolled version which used lodsw and stosb, this made the average cycles closer to the version with the branch however given my testing it ended up still being slower, by a significant margin, than using a branch, probably due to the time required to fill the prefetch queue given the increase in code size. Anyways now I'm just using the old branch version, but I haven't given up on this method yet, I may still come up with a way to improve it. If you were interested here's the 2x unrolled version:

    Code:
    rowLoop :
    	mov cx, bytes_per_line
    	shr cx, 1
    	jz lastByte
    pixLoop:
    	lodsw
    	mov bl, al
    	xor al, al
    	cmp al, bl
    	sbb al, al
    	and al, bh
    	out dx, al
    	mov al, bl
    	stosb
    	xor al, al
    	cmp al, ah
    	sbb al, al
    	and al, bh
    	out dx, al
    	mov al, ah
    	stosb
    	loop pixLoop
    lastByte:
    	test bytes_per_line, 1
    	jz endLine
    	xor al, al
    	cmp al, ds : [si]
    	sbb al, al
    	and al, bh
    	out dx, al
    	movsb
    endLine:
    	add si, lineDiff
    	add di, screenDiff
    	dec heightCount
    	jnz rowLoop
    Currently working on new DOS game, Chuck Jones: Space Cop of the Future, Check out my Dev Blog. WARNING: contains rocket powered El Caminos

    Vintage Computers:
    Unitron Apple II clone, 2x Commodore Vic-20, Commodore 64, Commodore 128, Amiga 500, Macintosh Plus, Macintosh SE, AST Premium 286, 3 386sx PCs, Atari TT030

  6. #6

    Default

    Quote Originally Posted by PgrAm View Post
    closer to the version with the branch however given my testing it ended up still being slower, by a significant margin, than using a branch
    Well that's a lot of code... and if it's slower on a 286 it's going to be hell on a 8088. Remember, fetch is the enemy so if you can do it in less code, it's almost always faster. ESPECIALLY if there are wait states on the memory. There's a reason on 286's that 0 wait paid such high dividends.

    Ballparking that code you're lookiing at 100 to 150 clocks per pixel inside the loop just from all the operations -- upwards of twice that on a 8088. Given that something more like this "inside the loop":

    Code:
    .loop:
    	lodsb
    	or  al, al
    	jz  .next
    	mov  es:[di], al
    .next:
    	inc  di
    	loop .loop
    ... is going to come in well under 30 clocks when the jz is not taken, and only 10 more clocks when it is? Yeah... that. Even on a 8088 using a jump inside it is 61 clocks accounting for the BIU. That means -- in theory -- a 4.77mhz 8088 running the jump version would be faster than a 6mhz AT running the "dick around with the ports" approach.
    From time to time the accessibility of a website must be refreshed with the blood of owners and designers. It is its natural manure.
    CUTCODEDOWN.COM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •