Image Map Image Map
Page 4 of 6 FirstFirst 123456 LastLast
Results 31 to 40 of 52

Thread: Wanted: 8086 mandelbrot generator?

  1. #31
    Join Date
    Feb 2017
    Location
    Zürich, Switzerland
    Posts
    81

    Default

    (waiting for a ferry while sipping coffee in a waterside taverna on the south coast of Crete)

    If the subdivison guessing only checks the corners of each rectangle, how does it handle pointed features which extend into a rectangle via the middle of a face without touching the corners? e.g. the big north/south points of Seahorse Valley. I thought it was necessary to trace around the bounds of each rectangle.

    (I'm trying to get the 2.6 fixed point version working on a 6502, targeting a BBC Master with 128kB of RAM; poor wifi and copious quantities of free raki aren't helping...)

  2. #32

    Default

    Quote Originally Posted by hjalfi View Post
    If the subdivison guessing only checks the corners of each rectangle, how does it handle pointed features which extend into a rectangle via the middle of a face without touching the corners? e.g. the big north/south points of Seahorse Valley. I thought it was necessary to trace around the bounds of each rectangle.
    Turns out I had a bug in the code that was checking that the guessing didn't make any mistakes - the images aren't identical after all. Seahorse valley itself isn't a problem (just because of where the block boundaries happen to land for this image) but there are a few pixels missing in the valleys between the main cardioid and the second largest circles, that you can see in the previous image. It's not too bad (and Fractint's default guessing algorithm isn't perfect either) but I would still like to try and fix it.

    I did previously make a fractal plotter (for modern PCs) which used this guessing algorithm without making any (visible) mistakes - the secret sauce there was to forbid a block from appearing adjacent to a block that is less than half of the size of the first block. That worked very nicely for that program but I'm not sure if it would translate well to an 8088 program - it requires a fairly complicated and memory-hungry data structure.

    Another possibility is to sample the mid-points of the sides of each block as well as the corners. I'll look into how much that slows it down soon.

    For now, though, I'm doing a bit of a hack: explicitly checking if each point is in the main cardioid or largest circle (using the equations for those shapes) and treating those points as a different "iteration count" (albeit mapped to the same colour) than points that were actually iterated and didn't bail out. That fixes the problem for all but 6 or 7 scattered points which aren't noticable. These checks don't seem like they're going to add any speed, though: they only save us 3637 iterations over the 12462 pixels that we actually compute, so will only be faster if I can do them in less then about 29% of a single iteration on average (which seems unlikely).

    Quote Originally Posted by hjalfi View Post
    (I'm trying to get the 2.6 fixed point version working on a 6502, targeting a BBC Master with 128kB of RAM; poor wifi and copious quantities of free raki aren't helping...)
    Sounds like a very fun holiday!

  3. #33
    Join Date
    Feb 2017
    Location
    Zürich, Switzerland
    Posts
    81

    Default

    After spending time on mountains thinking about recursion, I got home and knocked out this.

    https://bbc.godbolt.org/?model=Maste...l.ssd&autoboot

    The writeup is here: http://cowlark.com/2018-05-21-bbc-micro-mandelbrot/

    That's running on a 2MHz 65C02 with 128kB RAM. It's about 90x180 (in hindsight, I should have gone for high-res mode and put up with only having three colours). It takes twelve seconds to render so that's 1350 pixels per second or 1500 cycles per pixel.

    I'm tracing all the way around each subdivision, because I can use video memory as a very cheap cache and avoid any kind of complex data structure. (My video memory is exactly the same speed as system memory.)

    I don't think I'll try the 16-bit version --- 16-bit arithmetic on a 6502 is a thoroughly miserable experience...

    Edit: Urgh, 128kB, not 12MB. The latter is much less interesting.
    Last edited by hjalfi; May 22nd, 2018 at 12:43 AM.

  4. #34

    Default

    Very nicely done! I think it would be interesting to try the 16-bit "table of squares" algorithm on the same hardware. It might not be as slow as you imagine: the bottleneck on the 8088 is fetching the (instruction and data) bytes from memory, which it can do at a rate of about 1.13MHz. The 6502 has essentially the same bottleneck, but (in a BBC) can fetch bytes at 2MHz if I understand correctly. So what you lack in registers and 16-bit arithmetic you may be able to make up for with raw speed.

    Here's the 8088 code I'm using for my version. Registers are:
    si = x (real part of z)
    bx = y (imaginary part of z)
    sp = 0x1c00 (bailout radius). We set up SS so that interrupts can still occur and the stack functions as normal, but it's slightly faster to be able to use a register than an immediate constant.
    ds:0 = points to the table of squares divided by 0x600 (32768 2-byte entries, all with the least significant bit 0).
    cx = number of iterations remaining (the iteration loop is unrolled but we still need to know where we got to when we bail out).
    es = a (real part of c)
    dx = b (imaginary part of c)

    Code:
      mov di,[si]  ; x*x
      mov bp,[bx]  ; y*y
      lea ax,[di+bp] ; x*x+y*y
      cmp ax,sp
      jae escaped
      dec cx
      mov bx,[si+bx] ; (x+y)*(x+y)
      sub bx,ax  ; 2*x*y
      add bx,dx  ; 2*x*y+b -> new y
      mov si,es
      add si,di
      sub si,bp  ; x*x-y*y+a -> new x
    So that's seven 16-bit additions/subtractions (including the comparison) and three 16-bit table lookups. This takes about 130.3 cycles per iteration on average. So if you can do it in 54 cycles or less on a 2MHz machine, you'll be beating the 8088 at 4.77MHz. Sound doable?


    I'm still working on my subdivision code (getting there). And keeping track of iteration counts in a quadtree is starting to look more practical, not just for avoiding incorrect guessing but also for real-time zooming. I've also been thinking about doubling the precision (by using a table of squares that is 16384 entries of 4 bytes instead of 32768 entries of 2 bytes, and doing 12 lookups per iteration instead of 3) so that we can zoom in further.

  5. #35
    Join Date
    Feb 2017
    Location
    Zürich, Switzerland
    Posts
    81

    Default

    I actually looked into additions a while back, for Cowgol (my self-hosted Ada-like language for 8-bit machines, plug plug). A simple 3op 16-bit addition of values in zero page is 20 cycles (clc; lda left+0; adc right+0; sta dest+0; lda left+1; adc right+1; sta dest+1), so seven adds makes 140 for just the additions. It should be possible to cheat with the comparison by ignoring the low byte, which makes it a simple lda x2_plus_y2+1; cmp #0x1c; bcs escaped, which is only seven cycles.

    The lookups are going to particularly expensive, sadly. The 6502 only supports 8-bit indexing so I'd need to assemble a pointer and dereference it, plus code to page in the appropriate RAM bank. (Also, how does your above get away without scaling the offsets into the lookup table, if the entries are words?)

    The 6502 hates 16-bit arithmetic, sadly. (I did a writeup. http://cowlark.com/2018-02-27-6502-arithmetic/)

    OTOH, I now feel that by using an 8-bit table of squares I should be able to drastically improve my current version. The table will be only 128 bytes. This would not only avoid the delay as it precalculates the multiplication table but would also avoid the overhead of having to map the appropriate memory bank and calculate the address, which is significant (see https://github.com/davidgiven/bogoma...andel.asm#L601).

    Code:
    ldx zr
    lda table, x
    sta zr2
    ldx zi
    lda table, x
    sta zi2
    clc
    adc zr2 ; A = zr^2 + zi^2
    sta zr_zi_2
    bvs escaped ; using overflow bit
    
    dec iterations ; Y is unused elsewhere, so that'd be a good candidate here --- only saves 3 cycles per iteration though
    
    clc
    lda zr
    adc zi ; A = zi+zr
    tax
    lda table, x ; A = (zi+zr)^2
    sec
    sbc zr_zi_2 ; A = 2*zi*zr
    clc
    adc ci
    sta zi
    
    clc
    lda zr2
    adc cr
    sec
    sbc zi2 ; A = zr^2 - zi^2 + cr
    sta zr
    ...or something.

  6. #36

    Default

    Quote Originally Posted by hjalfi View Post
    Also, how does your above get away without scaling the offsets into the lookup table, if the entries are words?
    By making sure that the values in the lookup table (and the values of x, y, a and b) are in the right format for using as lookup table offsets - i.e. with the least significant bit zeroed. Since all we do with these values is add them, subtract them, and look them up in the table, and all these operations preserve the "LSB 0" property, everything is automatically in the right format.

  7. #37
    Join Date
    Feb 2017
    Location
    Zürich, Switzerland
    Posts
    81

    Default

    Oh, yeah, of course --- the only other place numbers can come from is the initial coordinates, and your scaling factor will ensure the LSB is 0 there, too... that's very neat.

    Converting my code to use your table-of-squares algorithm works fine, and reduces the time from 12 seconds to 4.5. Which is nice. However... this has completely changed the bailout behaviour. I can't represent a 4, so I'm bailing out whenever a square overflows, which leads to this:

    grab.jpg

    ...which isn't very satisfactory. The set itself is about the right shape, although there's something a bit weird at Scepter Valley.

    The most visually pleasing image I've come up with is this:

    grab.jpg

    ...which is badly distorted but still kind of Mandelbrotish. This is done by pick and choosing where to do overflow detection and just letting everything else wrap round, which is totally cheating.

    I bet this is happening because I'm running out of bits when calculating (x+y)*(x+y) before subtracting x*x+y*y...

  8. #38

    Default

    I had a go at writing a 6502 version with some more precision. I guess 54 cycles was a bit too optimistic, but I think 131 is possible by ignoring bank switching and using a 16kB table.

    Code:
      ldy #$01         ; y = 1 so that we can use indirect indexed for high bytes
    loopTop:
      clc                                                                        1 2
      lda (zr)         ; a = low(zr^2)                                           2 5
      tax              ; x = low(zr^2)                                           1 2
      adc (zi)         ; a = low(zr^2) + low(zi^2) = low(zr^2 + zi^2)            2 5
      sta zr_zi_2_0+1  ; (1)                                                     2 3
      lda (zr),y       ; a = high(zr^2)                                          2 5
      adc (zi),y       ; a = high(zr^2) + high(zi^2) + c = high(zr^2 + zi^2)     2 5
      cmp #$07         ; test bailout                                            2 2
      bvs escaped      ;                                                         2 2
      sta zr_zi_2_1+1  ; (2)                                                     2 3
    
      clc              ;                                                         1 2
      lda zr           ; a = low(zr)                                             2 3
      adc zi           ; a = low(zr) + low(zi) = low(zr+zi)                      2 3
      sta zr_zi        ;                                                         2 3
      lda zr+1         ; a = high(zr)                                            2 3
      adc zi+1         ; a = high(zr) + high(zi) + cf = high(zr+zi)              2 3
      and #$3f                                                                   2 2
      ora #$80         ; Fix up high byte of table address                       2 2
      sta zr_zi+1      ;                                                         2 3
    
      sec              ;                                                         1 2
      txa              ; a = low(zr^2)                                           1 2
      sbc (zi)         ; a = low(zr^2) - low(zi^2) = low(zr^2 - zi^2)            2 5
      adc #$99         ; a = low(zr^2 - zi^2) + low(cr) = low(zr^2 - zi^2 + cr)  2 2
      tax              ; not ready to overwrite zr yet                           1 2
    
      lda (zr),y       ; a = high(zr^2)                                          2 5
      sbc (zi),y       ; a = high(zr^2) - high(zi^2) = high(zr^2 - zi^2)         2 5
      adc #$99         ; high(cr) (+1 for sbc correction)                        2 2
      and #$3f                                                                   2 2
      ora #$80         ; Fix up high byte of table address                       2 2
      sta zr+1         ; new high(zr)                                            2 3
      stx zr           ; new low(zr)                                             2 3
    
      sec              ;                                                         1 2
      lda (zr_zi)      ; a = low((zr+zi)^2)                                      2 5
    zr_zi_2_0:
      sbc #$99         ; immediate value overwritten by (1)                      2 2
      adc #$99         ; low(ci)                                                 2 2
      sta zi           ; new low(zi)                                             2 3
    
      lda (zr_zi),y    ; a = high((zr+zi)^2)                                     2 5
    zr_zi_2_1:
      sbc #$99         ; immediate value overwritten by (2)                      2 2
      adc #$99         ; high(ci) (+1 for sbc correction)                        2 2
      and #$3f                                                                   2 2
      ora #$80         ; Fix up high byte of table address                       2 2
      sta zi+1         ; new high(zi)                                            2 3
    
      dec iterations   ;                                                         2 5
      bpl loopTop      ;                                                         2 3
    This is untested, so there are probably bugs. The code itself (83 bytes) and the variables (7 bytes) must all be in the zero page for best speed. This is the first time I've tried to write any 6502 assembly code, so I expect someone more experienced could shave off a few more cycles.

    I think that with the "corners only" guessing algorithm the 8088 version might be able to get down to 4 seconds, which suggests that at the same resolution (320x201) this 6502 version could get to 9.6 seconds, or 4.8 seconds for 160x201.

  9. #39
    Join Date
    Feb 2017
    Location
    Zürich, Switzerland
    Posts
    81

    Default

    So this is using 14-bit 5.9 fixed point, with bits 15 and 16 set to %10 so that numbers are also addresses to their own squares? And the low bit is always 0 as before, so they don't need scaling? That's very neat --- yes, it should totally work. It avoids bank switching *and* avoids lots of expensive adds when doing the table lookups. Also I see self-modifying code, always a good sign!

    (It's interesting to note that in this code the position of the sign bit and the decimal point's completely irrelevant, except for the bailout test.)

    As the number representation can only store values up to 32, then squaring anything bigger than sqrt(32) = ~5.7 will always lead to signed overflow. Most of the middle of the table is wasted, therefore, simply containing Inf (about 31.995)... it'd be nice to avoid that, but it's too late at night for me to see how.

    I have a table generator working; this should be easier to bolt onto my current code. I'll be really interested to see how well it works.

  10. #40

    Default

    Quote Originally Posted by hjalfi View Post
    (It's interesting to note that in this code the position of the sign bit and the decimal point's completely irrelevant, except for the bailout test.)
    Yes, this is why (in the 8088 version at least) I can get away with a non-integer number of fractional bits (i.e. a representation of "1" - 0x600 - which isn't a power of 2). For the 6502 version, it might be desirable to use a power of 2 anyway to speed up populating the table.

    Quote Originally Posted by hjalfi View Post
    As the number representation can only store values up to 32, then squaring anything bigger than sqrt(32) = ~5.7 will always lead to signed overflow. Most of the middle of the table is wasted, therefore, simply containing Inf (about 31.995)...
    For the table positions where the square is too large to represent, just the low bits should be stored. That way, if the result of (zr+zi)^2 - (zr^2 + zi^2) + ci doesn't overflow then it will be calculated correctly, even if some of the intermediate results overflow.

    Since we can calculate larger values of zr^2 + zi^2 than we can look up in the table, we might be able to get one or two bits of extra precision by sacrificing range. I'll have a play about with this.

    Quote Originally Posted by hjalfi View Post
    I have a table generator working; this should be easier to bolt onto my current code. I'll be really interested to see how well it works.
    Feel free to have a go at the bolting if you'd like - I'm not sure when I'll have time to try it myself.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •