• Please review our updated Terms and Rules here

Fast line drawing and RGB in 8 BPP

resman

Veteran Member
Joined
Jan 1, 2014
Messages
603
Location
Lake Tahoe
Nothing new here, in fact this is from projects 30+ years ago that I never had a chance to complete satisfactorily due to time constraints.

Everything is posted on GitHub: https://github.com/dschmenk/Bresen-Span

First up is a span based Bresenham line implementation. Being able to fill horizontal or vertical spans of pixels at a time instead of the simpler pixel at a time was a performance win once upon a time. Calculating the span lengths before iterating through the line always seemed more difficult than necessary. So I spent time simplifying, clarifying and ensuring it hit the exact same pixels as the pixel-at-a-time implementation. There is a C version and an 8086 assembly version using function pointers to write the pixel spans so that the generic line routine can be used with many different backends. As an example, there is a triangle filling demo that uses the line drawer to save the edges of a triangle into a left and right edge array, then use the horizontal span routine to fill the scan lines of the edge array. Or, optionally, just draw the edges of the triangle.

Second is an anti-aliased line drawer using Wu's algorithm (and probably independently discovered by anyone who ever tried to implement an AA line routine). I still may not be completely happy with the precision of the resulting line, but it looks pretty good for near horizontal/vertical lines. Only a C version exists, but a 8086 assembly version would be pretty straight forward.

Lastly, to play with all of this, there is tiny library to use VGA mode 13h (320x200 @ 8BPP) that programs a 3-3-2 RGB color cube and uses dithering to improve the result a little. Lines can be draw with a dithered pattern brush, or the closest solid match to the 3-3-2 RGB palette. The TESTLINE.EXE demo program will time the pixel-at-a-time line drawer and the span-at-a-time line drawer with both the dithered pattern lines or solid line depending on the disable dithering '-d0' flag passed in on the command line. The triangle filling demo is FILLTRI.EXE with two flags: '-d0' disables dithering and '-f0' disables filling.

The best reference to investigate these algorithms and many others would be Michael Abrash's "Black Book" of Graphics Programming: http://www.drdobbs.com/parallel/graphics-programming-black-book/184404919

P.S. I recently acquired a VGA card for my original Compaq Deskpro. This seemed like a good opportunity to revisit that 30 year old itch and scratch it a bit. I have much faster machines with VGAs installed, but like the old saying: "It's more fun to drive a slow car fast than a fast car slow". I hope you enjoy my excursion into graphics history.
 

I don't have anything even remotely intelligent to say about the algorithms but when reading the assembly code I noticed something like this in quite a few places;

Code:
	shl	ax,1
	or	ax,ax
	jge	x_abs
JGE jumps if the sign flag (SF) equals the overflow flag (OF). The OR instruction (as well as the other bitwise logical instructions like XOR, AND and TEST) always clear the OF so the JGE is essentially reduced to a JNS. Knowing that we can optimize this into just two instructions;
Code:
	shl	ax,1
	jns	x_abs
 
Thanks Krille! My 8086 assembly language neurons shriveled up about 20 years ago. I have a lot to re-learn. I'm guessing there is much low-hanging fruit to harvest from my code. One of the nice things about putting the code out there is getting great feedback. And I've already found a couple of stupid errors and want to revisit the alpha blending code.

Dave...
 
Last edited:
I would think -- even using assembly -- that the setup's use of divide and multiply being hundreds of clocks APIECE would take longer than the pixel at a time method for a 240px tall (or shorter) display once you swap from putpixel or hspan handling X and Y to using memory offsets and adding width instead of Y... at least if you're targeting the 8088/8086.

something along these lines:

Code:
void line(int x1, int y1, int x2, int y2, char color) {

	int disX, disY, temp, counter;
	char *p = (char *)0xA0000;	
	
	if (x1 > x2) {
		temp = x1;
		x1 = x2;
		x2 = temp;
	}
	if (y1 > y2) {
		temp = y1;
		y1 = y2;
		y2 = temp;
	}
	
	\\ remember to leverage loose evaluation when possible!
	if (!(disX = x2 - x1)) vline(x1, y1, y2);
	if (!(disY = y2 - y1)) hLine(x1, x2, y1);
	
	p += x1 + y1 << 6 + y1 << 8;
		
	if (disX > disY) {
	
		counter = (temp = disX) + 1;
		do {
			*p = color;
			if ((temp -= disY) < 0) {
				temp += disX;
				p += 320;
			}
			p++;
		} while (x1 < x2);
		
	} else {
	
		counter = (temp = disY) + 1;
		do {
			*p = color;
			if ((temp -= disX) < 0) {
				temp += disY;
				p++;
			}
			p += 320;
		} while (--counter);
		
	}
	
} // line

At least on a 8088 would be 20 or 30 pixels deep before your setup even finishes JUST because of the five multiplies and three or so divides (though yes, shift as divide could help). Remember, once you get below the 286 (though you didn't say your target) you're looking at 180 to 260 clocks (give or take, off the top of my head) for EACH of those operations. You're a thousand or more clocks deep before you even start drawing.

Even so, once you get down to the ASM level where you would/should be leveraging things like stosb I don't expect your more complex setup to actually pay real dividends in speed, particularly as the amount of pixels in a "scan" dwindles. Shorter lines too would throw the balance in favor of the per-pixel approach.

I dunno though, I'd have to play with it live. Just guessing here for now.

-- edit -- even though it only saves a couple clocks, try to stick to do/while as well.
 
I would think -- even using assembly -- that the setup's use of divide and multiply being hundreds of clocks APIECE would take longer than the pixel at a time method for a 240px tall (or shorter) display once you swap from putpixel or hspan handling X and Y to using memory offsets and adding width instead of Y... at least if you're targeting the 8088/8086.

Right, so there are a large number of instances where the span-at-a-time won't result in the fastest line drawing. Even in my contrived testbed, drawing lines around the origin in all directions only produced an overall improvement of less that 20%. In-lining the pixel/span plotting within the line drawing is definitely the way to go for flat out performance. If, however, there was a desire to swap out the pixel/span plotting routine with different versions for ROPs or blending modes it could be beneficial to have a call into a function per pixel/span. But even there, the benefits of a span drawer approach zero (or less) as the slope of the line approaches 1.0 and greater. There is a more sophisticated version of the span drawer that includes diagonals. But now you have to write a horizontal span, vertical span and diagonal span routine. In truth, when I was developing the two C routines, the in-lined per pixel version was faster than my call per span version by a significant margin - even with the overhead of dithering the pixels.

And it gets worse. Once your CPU gets faster than the I/O bus that will quickly become the bottleneck. If the line is long enough and horizontal enough, there could be a span drawer that would combine adjacent pixels into 16 or 32 bit aligned accesses (assuming the bus interface didn't already do that). Once again, lines with a slope approaching 1.0 and beyond wouldn't reap any benefits.

At least on a 8088 would be 20 or 30 pixels deep before your setup even finishes JUST because of the five multiplies and three or so divides (though yes, shift as divide could help). Remember, once you get below the 286 (though you didn't say your target) you're looking at 180 to 260 clocks (give or take, off the top of my head) for EACH of those operations. You're a thousand or more clocks deep before you even start drawing.

Divide and multiply by two are assumed to be shifts, but yes, the div and mul instructions are a hefty penalty on the 8086. Even more so on a 6502 without such instructions so I left the long division code in there with a #define. I should re-visit the 8086 code to see if the assembly version of long division would be any faster for short lines.

Even so, once you get down to the ASM level where you would/should be leveraging things like stosb I don't expect your more complex setup to actually pay real dividends in speed, particularly as the amount of pixels in a "scan" dwindles. Shorter lines too would throw the balance in favor of the per-pixel approach.

I dunno though, I'd have to play with it live. Just guessing here for now.

-- edit -- even though it only saves a couple clocks, try to stick to do/while as well.

Well, I can't really argue against your statements. The only real answer would be to time the actual use case and pick the best approach. This was really an itch I never fully scratched. I think it would be safe to say that the span-at-a-time approach wasn't the game changer for the majority of applications.
 
I was playing with this idea more, and came to the conclusion that if the slope is more than 6:1, span at a time wins on the horizontal, and if it's more than 10:1 it wins on the vertical, assuming 8088 as the target... but only if the length is more than 50 pixels.

The divide to figure out slope is a bit heavy, but since you could do the>< comparison first on the X and Y distances, the larger one is more than 50 do the divide to figure out the slope, then route to the appropriate routine.

That divide does slow down longer runs > 50 pixels, but overall by using "the right tool for the job" in each case you get a general speedup... at least with random line draws.
 
Thanks for spending so much time looking at this. Since I've been timing a V-30, the division didn't impact the speed as much as it would have on an 8088/86.

I'm in the process of bringing the 4 BPP dither code over to draw RGB lines where the span line drawing should help amortize the planar EGA/VGA setup a bit. I will be able to time that on a 4.77 MHz 8088.
 
Here is a WIP update to show what's been developing.The 8BPP RGB library has been extended to support CGA, EGA, along with the VGA. The EGA 4 BPP RGB dither code came over from the other project. I've put the updated binaries on GitHub: https://github.com/dschmenk/Bresen-Span/tree/master/bin

From the readme:
Here is an update to the x86 sample programs using the span-based and anti-alias line drawing routines. This is a work-in-progress for a graphics library I'm writing to support the VGA, EGA, and CGA adapters in a consistent way. I got tired of rewriting tests and demos for all my retro PCs with their different graphics cards. So this new library works by finding commonality between the graphics adapters and writing an RGB API to hide the color depth differences. I brought over the 16 color, 4BPP EGA dithering algorithm to supplement the 256 color, 8BPP VGA dithering algorithm used here previously. The CGA dithereing works similarly, but with even more constraints - I'll get to that shortly.

Finding commonality: The one graphics resolution in common with all three adapters that has the most color depth on each is 320x200. That is the only supported resolution for this library. An attempt is made to map RGB colors to each graphics mode.

CGA: 320x200x2 maps RGB values to monochrome pixels. The reason is that the colors available on the CGA won't map to RGB values, so it is treated as a 4 level grey scale. On monochrome monitors such as the Compaq dual-mode monitor, the colors do map to a greyscale. When connected to an RGB monitor or composite video, the colors are combined to attempt a grey color.

EGA: 320x200x4 maps RGB to 16 IRGB colors. When connected to a monochrome monitor that maps colors to grey values, the driver will map the IRGB colors to the grey levels output by the monitor.

VGA: 320x200x8 maps RGB colors to a 3-3-2 RGB colormap. For monochrome, a 64 level grey ramp is programmed into the color registers.

The library will auto-detect the graphics adapter and select the mode with the most colors. The library has a few options to control the adapter selection and the color mapping mode. By default, the library wil dither the RGB colors to the best colors the adapter can disply. FOr CGA, this is monochrome ony. The EGA and VGA can output to either color or monochrome monitors. The dithering can be disabled to use a best-match color, greatly reducing color fideltity, but with a slight speed improvement for EGA and VGA. The librarby can also be forced to use a lesser color depth than what the hardware may support.

Sample programs: These programs just demonstrate the abilities of the library. There are common command line options to control the libraries features.

-m : monochrome mode
-n : no dithering, use best match
-d2 : use 2 BPP mode
-d4 : use 4 BPP mode
-d8 : use 8 BPP mode (default)

trifill.exe has an additional flag to control filling.

-f0 : disable fill mode

showpbm.exe also takes a parameter, the PBM file to display. There are a couple .PNM images (24BPP RGB PBM images) included.

Screenshot of FILLTRI.EXE
FILLTRI.jpg
 
This ought to be the last update for a while. I went ahead and implemented page flipping in the GFX library for all devices, then broke out the library into a separate directory on GitHub: https://github.com/dschmenk/Bresen-Span/tree/master/src/gfxlib

Page flipping is supported by allowing two pages in graphics memory to be rendered and displayed (except the CGA). The CGA will allocate a main memory buffer and do a copy to video memory during a page flip. Normally this is without issue unless you are expecting the contents of the previous front page to now be the contents of the back page. Look at LINETEST.EXE to see where this breaks down at the end of the program. The sample programs work well on real hardware and DOSBox.

TRIFILL.EXE got an additional flag to control double buffering.

-b : buffer the rendering to back page and flip on VSYNC to display

In order to get two pages of 8BPP in the VGA required going into modeY ( mode X at 320x200x8 ). The routines had to be updated to use the planar registers, but was pretty straight forward. The benefit is that the horizontal span and clear routines can go faster (4 pixels/byte write). So the span-at-a-time line drawing really shines in this mode. I also simplified the anti-alias pixel routines for the EGA and CGA. There was no need to get too excessive on the blending operation. Now it simply ORs the source * alpha with the destination. Much faster, especially on the EGA, and looks no different.
 
Page flipping is supported by allowing two pages in graphics memory to be rendered and displayed (except the CGA). The CGA will allocate a main memory buffer and do a copy to video memory during a page flip. Normally this is without issue unless you are expecting the contents of the previous front page to now be the contents of the back page.

There is a simple fix for that: Allocate two pages in main memory. Whenever there is a pageflip, copy to CGA RAM to display the new visible page, and then swap your "hidden" and "visible" pointers to your main memory pages.
 
There is a simple fix for that: Allocate two pages in main memory. Whenever there is a pageflip, copy to CGA RAM to display the new visible page, and then swap your "hidden" and "visible" pointers to your main memory pages.

Indeed, but that is a lot of memory copying for relatively little benefit. Easier to just be aware of the "feature".
 
There are many places where you have code looking like this;
Code:
        mov     reg,reg        ; Or 'mov     reg,[mem]'
        and     reg,imm

This can often, but not always*, be made shorter by changing it into this;
Code:
        mov     reg,imm
        and     reg,reg        ; Or 'and     reg,[mem]'

For example, this;
Code:
        mov     bl,dl
        and     bl,07h
should be replaced with this;
Code:
        mov     bl,07h
        and     bl,dl
This;
Code:
        mov     bl,dl
        shr     bl,1
        shr     bl,1
        shr     bl,1
        and     bl,07h
with this;
Code:
        mov     bl,38h
        and     bl,dl
        shr     bl,1
        shr     bl,1
        shr     bl,1

Yeah, you get the idea. :)

*When the destination register is the accumulator (AX or AL) then there's no benefit, it's the same size.
 
Thanks Krille! The nuances of 8086 assembly are slowly returning. Your pointers are greatly appreciated.
 
There is a simple fix for that: Allocate two pages in main memory. Whenever there is a pageflip, copy to CGA RAM to display the new visible page, and then swap your "hidden" and "visible" pointers to your main memory pages.

Ah, I just re-read your post (hopefully with comprehension this time). So, yes, I see what you are saying. In the case of rendering to the front buffer, it would just need to copy that back to the off-screen buffer should a page flip occur.
 
Thanks to Trixster and Krille for their valuable insights. I also added a quick and dirty, if not somewhat clever, bitblt routine that also enabled a text output routine. Updates on GitHub. *Now* I should be done for awhile.

Screen Shot 2019-04-25 at 2.10.55 PM.jpg
 
At the risk of wearing everyone out with my updates, I've been busy adding programs and demos of graphics routines I wrote from the late '80s and early '90s. They've been updated to use this GFX library in 16 bit real mode code. They are documented and available on GitHub.

But since this is about the span line drawing routine, there are some interesting things I've used it for that aren't directly tied to drawing lines. By replacing horizontal and vertical span routine pointers with different functions, additional features can be implemented.

The first is to accumulate the edges of lines that would be drawn, then fill between the edges, providing a convex polygon fill routine. Sample implementation in gfxmode.c.

Another is to linearly interpolate between values to fill an array for color conversion in the 8BPP color dither code, in gfx8.c.

But the most fun is to stretch and shrink an image. By replacing the span routines for the vertical stretch/shrink routines, then inside those routines, replacing the span routines for the horizontal stretch/shrink routines. Snippets from showpbm.c:

Code:
/*
 * Image stretch with help of span line routine
 */
void hstretchrgb(int xl, int xr, int y)
{
    unsigned int r, g, b;

    r = gamma[getc(pbmfile)];
    g = gamma[getc(pbmfile)];
    b = gamma[getc(pbmfile)];
    do
    {
        redscan[xl] = r;
        grnscan[xl] = g;
        bluscan[xl] = b;
    } while (++xl <= xr);
}
void hshrinkrgb(int x, int yt, int yb)
{
    unsigned int r, g, b, n;

    r = g = b = 0;
    n = yb - yt + 1;
    do
    {
        r += gamma[getc(pbmfile)];
        g += gamma[getc(pbmfile)];
        b += gamma[getc(pbmfile)];
    } while (++yt <= yb);
    redscan[x] = r / n;
    grnscan[x] = g / n;
    bluscan[x] = b / n;
}
void vstretchrgb(int xl, int xr, int y)
{
    hspan = hstretchrgb;
    vspan = hshrinkrgb;
    line(left, 0, right, pbmwidth-1);
    do
    {
        renderscanrgb(xl);
    } while (++xl <= xr);
    hspan = vstretchrgb;
}
void vshrinkrgb(int x, int yt, int yb)
{
    unsigned int n, p;

    memset(redaccum, 0, 320 * sizeof(unsigned int));
    memset(grnaccum, 0, 320 * sizeof(unsigned int));
    memset(bluaccum, 0, 320 * sizeof(unsigned int));
    hspan = hstretchrgb;
    vspan = hshrinkrgb;
    n = yb - yt + 1;
    do
    {
        line(left, 0, right, pbmwidth-1);
        for (p = 0; p < 320; p++)
        {
            redaccum[p] += redscan[p];
            grnaccum[p] += grnscan[p];
            bluaccum[p] += bluscan[p];
        }
    } while (++yt <= yb);
    for (p = 0; p < 320; p++)
    {
        color(redaccum[p] / n, grnaccum[p] / n, bluaccum[p] / n);
        pixel(p, x);
    }
    vspan = vshrinkrgb;
}

int main(int argc, char **argv)
{
    ...
    
    left   =
    top    = 0;
    right  = 319;
    bottom = 199;
    hspan = vstretchrgb;
    vspan = vshrinkrgb;
    line(top, 0, bottom, pbmheight-1);
    ...
}
 
Back
Top