Well, I've been trying to find time to actually work out an optimized multi-layer decompressing wall texturing loop, but whenever I get some free time my brain is fried. I might as well make some noise here in the interim...
ehaliewicz wrote: ↑Mon May 02, 2022 8:44 pmHow are you planning on achieving good performance if you need to decompress textures, when RAM is too slow to draw textures from?
Because while writes to RAM are one- or two-cycle fire-and-forget instructions, a single-byte RAM
read is a two-byte instruction with (if I understand the documentation correctly) a dead cycle and a mandatory 5-cycle wait state, consuming 8 cycles during which the GSU's processor core can do nothing else. Unless the RAM buffer was already busy with a write, in which case it will take even longer. ROM reads, by contrast, require you to set R14, do something else for at least four cycles to avoid a wait state, and read the byte, for a total compute time cost of as little as one cycle (if you don't count setting up the address, which you also have to do for RAM reads on top of the cost of the actual instruction).
This is all assuming you're running in the I-cache. If you're loading code from ROM or RAM, that complicates matters, but mostly it slows down code (but not memory accesses) by a factor of 5. All numbers assume 21.4 MHz mode with no clock trickery (a single-byte RAM read is 6 cycles in 10.7 MHz mode, only one cycle faster than a word read).
Also, the bigger issue (I think) is that RAM is too
small to draw textures from. Remember, the standard texture size in Doom is 128x128. You'd be constantly decompressing new textures at full resolution, and I doubt it would be a net win even if the actual drawing loop ended up radically faster.
...
The Super FX can render colormapped opaque wall pixels to a column-major linear bitmap from an
uncompressed texture in ROM at 12 cycles per pixel:
Code: Select all
1 with R8
1 add Rtexstep ; update texcoord
1 getb ; obtain pixel colour
1 to R14
1 merge ; request raw texel from ROM
1 stw Rpixptr ; draw pixel to framebuffer
1 from Rcmapoff
1 to R14
2 getbl ; combine colormap offset and light level with raw texel to look up pixel colour
1 loop ; decrement pixel counter and branch
1 inc Rpixptr ; increment framebuffer pointer
Obviously this requires an appropriate colormap to be stored in-bank. Also, since we've saved a cycle by using
stw instead of
stb for the pixel write, this loop can only be used with walls at least two pixels high, since there needs to be a once-through tail after the loop that uses
stb so as to not overwrite the pixel underneath the one it's supposed to render... I suppose there could be a branch before the loop that skips the loop body if the pixel counter is 1...
Reading the texel from RAM instead causes the same routine to take 19 cycles (and use an additional register):
Code: Select all
1 with R8
1 add Rtexstep ; update texcoord
1 getb ; obtain pixel colour
1 to Rtexptr
1 merge ; compose texel RAM address
1 to R14
8 ldb Rtexptr ; obtain raw texel from RAM
1 stw Rpixptr ; draw pixel to framebuffer
1 with R14
1 add Rcmapoff ; combine colormap offset and light level with raw texel to look up pixel colour
1 loop ; decrement pixel counter and branch
1 inc Rpixptr ; increment framebuffer pointer
If the cached texture and the framebuffer are in different RAM banks, the loop goes up to 25 cycles and uses two more additional registers, because
ramb is two cycles and needs to be set up with
from Rn for a total of three cycles and a register each way (I counted this wrong in my previous post). It might be possible to optimize this a bit by grouping accesses, but the loop would get considerably more complicated, and as I said my brain is fried.
...
Note that I am not using
plot for this. The Super FX and its RAM setup were designed for horizontal rasterization. Unless I've been labouring under a grave misapprehension, drawing vertical columns with
plot would bottleneck at 80 cycles per pixel in 21.4 MHz mode. This is because SNES CHR format forces the plotting circuitry to read and then write an entire 8-pixel sliver, at 5 cycles per byte both ways, for every incompletely-plotted sliver. DOOM-FX appears to suffer from this.
If you're doing a lot of column drawing, I figure it's a lot more efficient to just draw to a bytemap, and use
plot to convert it to CHR in bulk later. This should add about 11 cycles per copied pixel if you manage the RAM buffer well, which is why it may be advantageous to use Mode 7 for part or all of the display so as to skip the conversion step for at least part of the framebuffer.
Also, it seems hard to achieve per-run costs on textures that aren't just runs of single color pixels?
I'm not specifically talking about RLE, although I think it should be possible to set up a decoder that handles literals fairly quickly. Lots of compression techniques (including all three that I'm considering for walls) work in groups with header bytes. There's a bit of extra cost decoding each texel if it's not a solid-colour run, but typically processing the header bytes constitutes the bulk of the work.
Take none's dictionary method. It's fairly costly to read a pointer and a run length and pop over to the position denoted by the pointer to start loading texels. But once you've done that, you're just reading texels (and tracking the run length), which shouldn't be massively more expensive than an uncompressed texturing loop.
What worries me is distant walls. If the header processing has to happen for every pixel or anything close to it, the method is very slow. Mipmapping, with a small uncompressed texture accompanying the large compressed one, could help, but that's a bit silly and hurts the compression ratio...
To be honest, any cartridge made nowadays will be "inauthentic" in several ways, and plenty of games used bankswitching back then, so I don't think it would have been impossible at the time.
To me, authenticity isn't about using battery-backed SRAM instead of F-RAM. It's about the capabilities of the cartridge as seen by the programmer. If it's 512 KB of ROM and an 8 KB save RAM, I don't care whether it uses voltage translation to avoid frying 3.3V parts; it's sufficiently authentic for my purposes.
If pin 21 is real and an easy bankswitching method could be implemented, I wonder if it wouldn't be reasonable to just not compress the textures, except for possibly doing live compositing of switch strips. I doubt I'm going to get 12 cycles per pixel on average using any sort of compression method, although without a full-capability prototype of the decompressing texture mapper (and at least some attempt at compressing data to get some statistics) it's hard to say how close I could get.
Still, using that much ROM on a game with a heavy-duty coprocessor would harm the authenticity factor even if it's technically feasible with period hardware. Early Nintendo 64 games had 8 MB of ROM with no special chips, and no commercial SNES game was even that large. I guess it's a question that should wait until I have a better idea of how much performance the compression would cost me...