Kulor's Guide to Mode 7 Perspective Planes

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: Kulor's Guide to Mode 7 Perspective Planes

Post by none »

rainwarrior wrote: Sun Aug 14, 2022 8:58 pm That's 1x, 2x and 4x side by side. 1x is 130% CPU load per scanline, 2x is 90%, 4x is 70%.
I've taken a look at your code and it's funny you ran into the some of the same issues I had too.

I also need to bit shift a lot wit h the reciprocal table which makes the thing kind of pointless, because per component division (doing the division in hw 4 times) is actually faster because it doesn't require all the bit shifting. The only thing I can imagine that gives a really big performance boost there is a multiplication LUT that has the bit shifting pre baked. It would also do away with the sign correction issue. But that would require a lot of ROM of course.

You can avoid the bit shifts for the A and C multiplies though. precision is good enough with A and C prescaled to the correct range. but with B and D it's not easy. I'd try saving 1 or two per component by choosing radix for cos(a)/sin(a) differently for example (-128..127 instead of -256..256 should be decent enough and would also save half of your sine table size).

Apart from that, there's a number of small things you can do to improve performance.

you can move the sign checks out of your inner loop. basically what i said in my last post, but with your implementation, it is easier (you do not need to split the screen into regions for that because your numerators do not change). you can make your inner loop a macro and make a few different implementations for each combination of signs in your scale 0..3 values, then choose one via jump table.

if you do not want to do that for some reason (because of code size perhaps), you can still move your "lsr z:temp+4" line to the place where you have the no ops now

Code: Select all

	lsr z:temp+4
        bcs :+
        lda f:$004216
        lsr lsr lsr lsr
        bra :++
        :
        lda f:$004216
        lsr lsr lsr lsr
        eor #$ffff
        ina
        :        


also maybe

Code: Select all

	lda temp + 4
        and #1
is faster than reloading temp + 4 after each iteration but I didn't count cycles.

another thing is, and I don't remember the exact reason (maybe a LoROM vs HiROM issue?), but I don't think using far addressing is actually necessary and if you could use absolute adressing, or even move the zero page to somewhere where it can see the hardware registers, that could also save a few cycles.

if you made pv_zr a 24 bit variable and using 24+16 bit addition with carry (bcs + ina) for the interpolation, the bit shifts could be prebaked into pv_zr, maybe saving a few cycles over 4 times lsr. or, a small saving can be gained by not reloading pv_zr for doing the interpolation. just do sbc pv_zr_inc once just before the loop, then you can do the adc + sta first, and then transfer to x.

another tiny saving, if you build the entire table in reverse (start with the last scanline and work towards the first) you can use y for the loop condition and don't need the additional line counter in temp + 2.

there's one other final thing I've thought of that can maybe work. It might sound ridiculous, but I've actually tried it out and it looks quite promising.

instead of 1/z LUT, you build 2 LUTs, each with ~1024 entries at least for decent accuracy, but more are better. The 2 tables you need are
  • ln(x)
  • e^x
With logarithms, multiplication and division become addition and subtraction.

This means if you precalculate ln(pv_scale) instead of pv_scale, and look up ln(z) in each scanline, you can find the matrix parameters basically by computing e^(ln(pv_scale) - ln(z)). Requiring only subtraction + table lookup per matrix component, but without giant multiplication LUT and also scaling nicely so that no thousands of bit shifts are required.

It doesn't work as well for me because I need to interpolate the numerators so I'd need 9 table lookups in total per scanline, but with your implementation, it could work much better because 4 of them can be precalculated.

demonstration:

Code: Select all

var radix = 128;  // for signed 16 bit numerator
var radix2 = 128; // for unsigned 8 bit denominator

function shift(a, b){return(b > 0 ? floor(a >> b) : floor(a << -b))}

function cos(a) {return(Math.cos(a))}
function sin(a) {return(Math.sin(a))}
function floor(a) {return(Math.floor(a))}
function highbyte(a) {return(floor(a / 256))}
function high10bit(a) {return(floor(a / 16))}
function clamp(a) {return floor(a<0?0:a>255?255:a)}
function clamps(a) {return floor(a<-32768 ?-32768:a>32767 ?32767 :a)}
function clampu(a) {return floor(a<0 ?0:a>65535 ?65535 :a)}

function clamp2(a) {return floor(a<0 ?0:a>1023 ?1023 :a)}


function lut_ln(a) {
  return clamp2(Math.log(high10bit(a) * 64) * 64)
}

function lut_ln2(a) {
  return clamp2(Math.log(high10bit(a) / 4) * 64)
}

function lut_epower(a) {
  return clamp2(Math.E ** (a / 64));
}

function lndivide2(a, b) { 
  return lut_epower(lut_ln(a) - lut_ln2(b)); }

function lndivide(a, b) { 
  if(a < 0) return -lndivide2(-a, b); else return lndivide2(a, b); 
}

// setup

var FOV = 90;
var forward = 128 / Math.tan(FOV * (Math.PI * 2 / 360) / 2);

var yaw = (var1 + framecount * 0.1) * Math.PI / 180;
var pitch = var2 * Math.PI / 180;

var camera_x = framecount;
var camera_y = 0;
var camera_z = var3;


// mode 7 stuff

// constant across the frame

var dx = cos(yaw) * camera_z;
var dy = sin(yaw) * camera_z;

var ax = forward * -sin(yaw) * cos(pitch);
var ay = forward * cos(yaw) * cos(pitch);
var az = forward * sin(pitch);

var bx = sin(yaw) * sin(pitch);
var by = -cos(yaw) * sin(pitch);
var bz = cos(pitch);

// scale values for subpixel precision
// and simulate fixed point math

dx = clamps(dx * radix); dy = clamps(dy * radix);
ax = clamps(ax * radix); ay = clamps(ay * radix); az = clampu(az * radix2);
bx = clamps(bx * radix); by = clamps(by * radix); bz = clampu(bz * radix2);

camera_x = floor(camera_x); camera_y = floor(camera_y);

// per scanline

var cx = clamps(ax + (scanline - 112) * bx);
var cy = clamps(ay + (scanline - 112) * by);
var cz = clampu(az + (scanline - 112) * bz);


camera_x = floor(camera_x * 16 / camera_z);
camera_y = floor(camera_y * 16 / camera_z);

var point_center_x = camera_x + floor(lndivide(cx, cz)) * (radix2 / radix);
var point_center_y = camera_y + floor(lndivide(cy, cz)) * (radix2 / radix);

var offset_x = floor(lndivide(dx, cz)) * (radix2 / radix);
var offset_y = floor(lndivide(dy, cz)) * (radix2 / radix);

m7a = offset_x;
m7b = point_center_x;
m7c = offset_y;
m7d = point_center_y;
m7x = 0;
m7y = 0;
m7hofs = -128;
m7vofs = -camera_z - scanline;

return [m7a, m7b, m7c, m7d, m7x, m7y, m7hofs, m7vofs];
User avatar
rainwarrior
Posts: 8732
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Kulor's Guide to Mode 7 Perspective Planes

Post by rainwarrior »

You're commenting on the "first pass" of my code. I was trying to get it working correctly first, optimize later. Some of these things are already changed (current code), but I appreciate the suggestions nonetheless.
none wrote: Mon Aug 15, 2022 6:04 pmYou can avoid the bit shifts for the A and C multiplies though. precision is good enough with A and C prescaled to the correct range. but with B and D it's not easy. I'd try saving 1 or two per component by choosing radix for cos(a)/sin(a) differently for example (-128..127 instead of -256..256 should be decent enough and would also save half of your sine table size).
So, to explain: I have Z * A/B/C/D as an 8x8=16 multiply. Z is 2.6, A is 1.7, after multiplying I need to LSR the 3.13 result 5 times to get 3.8 to have the correct scale for M7A.

I tried A of 2.6, reducing the precision by 1 bit, and eliminating 1 LSR. I looked at the result... for a still frame it's fine, but spinning around in rotation it's noticeably "bumpier" as we spin. I think the accuracy of the sin/cos drops off dramatically after that.

An alternative, instead of 5 x LSR, I could do 2 hardware multiplies as a combined 8x16=16. With 16-bit A of 1.10 (top 5 bits = 0), the result of the multiply would already be 3.8, no shifting needed. This would be slightly slower than 5 x LSR but might permit extra accuracy? Not sure how much that accuracy could really help, though. Final Fantasy VI does 8x16 in a manner like this.
none wrote: Mon Aug 15, 2022 6:04 pmyou can move the sign checks out of your inner loop
Yes. I'm not going to do this in my example for the sake of code clarity. In a shipped game, I'd definitely consider having 4 copies of the loop for the permutations of negations. (FF6 and F-Zero both do this.) As-is, it's fast enough for me demonstration, but I'll leave a comment about it.

Edit it: Hang on... with the 8x16 idea maybe the multiply could also handle the negate! I'm surprised that FF6 doesn't do this...?? I'll have to investigate whether it's a speed improvement with that consideration. Edit 2: no, never mind, that doesn't account for the sign-extend.
none wrote: Mon Aug 15, 2022 6:04 pmyou can still move your "lsr z:temp+4" line to the place where you have the no ops now
The nops were just because it was the first pass of the code. I've since filled them in.
none wrote: Mon Aug 15, 2022 6:04 pmanother thing is, and I don't remember the exact reason (maybe a LoROM vs HiROM issue?), but I don't think using far addressing is actually necessary and if you could use absolute adressing, or even move the zero page to somewhere where it can see the hardware registers, that could also save a few cycles.
It's sort of related to LoROM vs HiROM. I want this example code to be able to run from either, and I didn't want to have to be picky about the arrays existing in LoRAM either.

In my second pass, what I did was set DB=0 and use a long Y indexing for the array stores. The hardware access (x9) get faster, the array accesses (x4) get slower, but overall a win. ...but the other way this helps is you can't STX/STY to a far address, but abs address you can, which is the next point below...

Requiring the tables in LoRAM and making sure the code runs from a LoROM memory area would save a handful of cycles with abs array stores instead of long-Y.
none wrote: Mon Aug 15, 2022 6:04 pmif you build the entire table in reverse (start with the last scanline and work towards the first) you can use y for the loop condition and don't need the additional line counter in temp + 2.
Once I was filling in the nops while waiting for hardware results, using X to write the hardware registers was pretty useful, so I no longer have a free register for countdown.
none wrote: Mon Aug 15, 2022 6:04 pmWith logarithms, multiplication and division become addition and subtraction.
I've seen log tables for multiplication in some applications, e.g. Yamaha FM chips... (Though mostly I'm just familiar with it from the log table books and slide rules people used to use before calculators.) It might be an interesting research route, but I don't think I will try it for this project. Would love to hear if you find it useful.
User avatar
rainwarrior
Posts: 8732
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Kulor's Guide to Mode 7 Perspective Planes

Post by rainwarrior »

So my own Mode 7 example demo is finished, now. It shows a few other common styles of mode 7 effect in addition to the perspective plane. Open source, because the point was to make an example people could borrow or build on.

Source and ROM: dizworld
dizworld.png
dizworld.png (25.63 KiB) Viewed 419 times

If you want to make general comments on the dizworld demo, please use the thread about this demo, so that we don't clutter the current thread with unrelated stuff. Though if a comment is still on the topic of perspective planes, it might be fine to continue discussion here.
User avatar
rainwarrior
Posts: 8732
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Kulor's Guide to Mode 7 Perspective Planes

Post by rainwarrior »

I felt like seeing what it might look like with better precision so I modified my demo as an experiment: branch w/ROM here
dizworld_16bit_comparison.png
On the left, my current method, on the right, the experiment.

In my current version the "1/1/Z * A" step is an 8-bit x 8-bit calculation. In the experiment this is 16-bit x 16-bit. There is a noticeable difference, though it isn't huge. The existing "ripple" was tolerable, but it does smooth away significantly. It also allows a significantly steeper angle of tilt, as my 8-bit implementation had to clamp at 2:1 vertical to horizontal scale. (The "X" demo looks a bit different as a result, since it was being clamped.)

The "1/1/Z" table doubles in size to 8k, of course... I haven't tried increasing its lookup precision past 12-bit but maybe with the 16-bit increase adding another bit or 2 might give slightly more improvement?

The new version does not run at 60fps, though. I haven't bothered to optimize it, but even if I did it probably wouldn't be practical above 30fps... which might be a fine concession to make, depending on the game.

For the purposes of comparing them, it helps to turn on overclocking. Mostly I wanted to answer for myself a "what if" about what it could look like if we took the precision limits off. I'm pretty happy with the 8x8 version, but I'm happier to know how much I'm sacrificing (or not) for that efficiency.
Post Reply