The trick is to generate a table of squares. To compute 2*x*y, look up (x+y)*(x+y) and subtract x*x+y*y. The table can incorporate the shift needed for the fixed-point correction and the mask needed to ensure only even addresses are generated. The real and imaginary components of c can likewise be pre-masked to have even bit-patterns.
Unfortunately the bailout condition is still tricky - I'm trying to do it with a single conditional jump but it's not quite there yet:
mandel_wip.jpg
(waiting at airport)
Is that quarter square multiplication? I've heard of it when working on the 6502, but never used it; there's a nice writeup here: http://www.llx.com/~nparker/a2/mult.html
I actually got as far as realising that given that with 9.7 fixed point the entire upper byte is wasted (as the -2..2 range can be represented in the lower byte), it's effectively 1.7 instead. The extra precision's only needed for multiplication, which we won't be doing any more. This means that an ultra-simple brute force multiplication lookup table is only 64kB (and technically as multiplication is commutative we could get away with 32kB). That'd make a multiplication as simple as mov al, x; mov ah, y; mov bl, es:[ax]!
You wouldn't get overflow testing for free, though. I wonder if it would be feasible to use a NaN value as a marker? 0xff? That would correspond to 1.99609375, which is close enough to 2 that we could probably get away with it...
It's related. We don't need to look up (x-y)^2 though because we're already computing x^2+y^2. That means that we don't need to quarter the squares (or halve them, since it's 2*x*y we're trying to compute rather than x*y).
I think you'll find that you need a few more bits than that. One because you need to distinguish between negative and positive numbers. Another because you actually need intermediate results with a magnitude of 4 (to distinguish between points with c close to -2). So I don't think a full multiplication lookup table is practical.
I was trying to do something similar with my squares table, but it didn't give good results. I think that's because we need higher intermediate results when computing (x+y)^2.
(waiting at a different airport)
Oh, curses, I completely forgot about the sign bit...
However, I was inspired to hack up a simple program (in C) too see how hilariously bad a Mandelbrot rendered with 2.6 fixed point would look. The answer is, not that hilariously bad at all.
out.png
The grey box marks the +/- sqrt(2) boundary, because of course with 2.6 you can't square numbers bigger than that without overflow.
And here it is in 3.5, which gives a range of -4..4 and so avoids the box at the expense of a much grainer image:
out.png
I mean, it's terrible, but it's not nearly as terrible as I'd thought it was going to be, and can be done entirely using byte operations.
Here's what mine looks like now:
This is using a multiplier of 0x600 (so, about 9.6 bits of fractional precision) and a bailout radius of sqrt(0x1c00/0x600) = 2.16. It could probably be tuned to get a bit more precision, but probably not quite 10 bits worth. There are 12 instructions per iteration giving something on the order of 39000 iterations per second, so I think that should work out at around 10 seconds or so for the whole image, without the solid-guessing or main-lake optimizations. With those added in, we could be looking at around 1 second which puts us within sight of Xaos-style real-time zooming.
Next step is to actually write the rest of the code in x86 assembly to see how it performs in the real world.
That's really nice (CGA palette notwithstanding) --- proper contours, too! Although that I'll admit that the weird balls produced by the rectangular bounding bailout does have a 1970s disco feel to them... what architecture are you targeting?
I take it the 1024-colour composite CGA version will be coming out any day now. </trollface>
The image above was just generated by a simple Windows program and some CGA rendering routines that I use for prototyping such things. The real thing will run on a 4.77MHz 8088, naturally.
I hadn't thought of that, but it wouldn't be difficult to do now you mention it... The biggest problem is trying to decide which colours to use for each of the 32 iterations.
14.9 seconds - I wasn't too far out, for a back-of-the-envelope estimate!
Fractint 15.1 takes 53.2 seconds for the same coordinates and iteration count using the default guessing, or 117.9 seconds without guessing (to make it fair) or 120.8 seconds with periodicity checking also turned off (to make it fairer still). So the basic algorithm is 8.1 times faster than Fractint's. I would not have believed that was possible last week!
Fractint 16 might be faster but I can't seem to get it to run on my XT. Not sure if it's incompatible with DOS 3.3 or if it's incompatible with my method of tunnelling the disk interrupt over the keyboard port.
Here's the 14.9 second version, modified to use composite with a custom palette. It's 2236 bytes at the moment but much of that is reserved space, the pre-unrolled loop and a table of addresses for each Y coordinate. With a little bit of work it would probably fit in a boot sector. Although it's using composite it continues to treat the screen as 320x200 rather than 160x200 - this gives a slightly nicer image (it's not really correct to treat composite CGA as a normal bitmapped mode with a horizontal resolution of 160 pixels). The images are even nicer if I treat the screen as 640x200 but then it would be slower.
I've also worked out a recursive-subdivision guessing algorithm that doesn't make any mistakes with this "top level" image and that it should take the calculation time down to 4.3 seconds. That doesn't include the "main lake" optimizations but perhaps 1 second was a bit on the optimistic side. To whet your appetite, here's a teaser about how it works (albeit a slightly misleading one - we only calculate the corners of these boxes, not the sides):
Bookmarks