SNES Doom Source Released! Now What?

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
ehaliewicz
Posts: 18
Joined: Thu Oct 10, 2013 3:30 pm

Re: SNES Doom Source Released! Now What?

Post by ehaliewicz »

93143 wrote: Sun Apr 24, 2022 3:16 pm Is there a good way to do coarse frustum culling without having to transform every vertex?

I guess if you want very coarse culling you could just check if either vertex of a linedef is in either of the quadrants the view frustum is in. This would only require a single coordinate comparison per vertex, and no multiplication.


Yes, I have a similar coarse frustum culling scheme based on what quadrant the normal of a wall is. Depending on the orientation of the wall, it takes either one or two comparisons to discard a wall.

However, I noticed this doesn't make too much of a difference in performance once I started moving to a PVS based solution, perhaps most walls that would be coarsely rejected are already rejected by the PVS. Your mileage may vary.
Attachments
frustum_cull.jpg
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: SNES Doom Source Released! Now What?

Post by none »

93143 wrote: Wed Apr 13, 2022 3:14 pm I don't think that's a particularly elegant solution...
I agree, it was just the best i could put together ad hoc to get a rough idea on how much compression you can get this way. i.e. if you ignore the table size, you get a rough upper bound of what is possible.
Did you use the full run-time texture set or the WAD patch set?
I assembled the wall textures defined in the TEXTURE1 lump (Doom 2 has no TEXTURE2 lump) from the patches.
I've also tried Doom 1 (with both lumps), compression ratio is similar.
I doubt Randy would have settled on RLE if it was that bad. Just looking at some of these wall textures, it seems impossible that RLE could be so ineffective. Are you encoding columns and allowing runs of literals?
Yes I encoded columns. RLE stopped working nicely if you first remove duplicate 8/16/32 byte runs. But I'd rather double check this (not sure if I made a mistake with the encoder).

For the uncompressed data (without removing duplicates first), RLE is much more efficient. The best RLE scheme I tried was using two 4-bit run lengths (repeat count / literal count encoded in a single byte). Then I get compression ratios similar to the dictionary approach.

I think the main problem is that while there are a lot of runs, most of them are very short, and that's why the flag based method still keeps working after making the dictionary (because it can encode a few short runs in one command, and the longer runs have all gone away).
That's unfortunately not good enough.
Yes, and it does not even account for the switches.

Independent of the algorithm used, lossy compression is also an option (e.g. removing some single stray pixels or allowing some pixels to have a slightly wrong color).

Also, using palettes could maybe work - most textures only use fewer than 32 colors. If you define a few palettes there would be nearly no overhead in the renderer, because you need to do the colormap lookup anyways for the lighting you can just store additional colormaps per palette. You could then shave of 2 or 3 bits of the color bytes for use with the compression scheme.

Also, I think there are a few textures which are particularly bad cases (the one with the distorted faces for example).
But with sorted linedefs, the "complicated test" is literally just projecting one vertex in one axis and comparing it with one portal edge (at least once you get started - with a triangular subsector or a linear search from the portal edge it's true all the time).
I suppose you're right now that i think about it. However, the perspective division is trickier than you might expect because you need to special case vertices at the camera plane (which project to positive / negative infinity) and behind the camera (which lie on the opposite side on the projective plane), which results in a lot of branches.
Besides, I think most subsectors in Doom are either triangles or quadrilaterals, so there isn't a ton of wasted time. Particularly if you can select a likely candidate a priori.
With triangular sectors, the additional "sector internal" lines can be mostly skipped over because they do not change anything with the clipping window itself (they need little more than one additional table lookup per line). Only the vertices can split the clipping window, and there are no "internal" vertices - all the vertices have an "outer" wall attached to them. I.e., if you think about it, it's just a data structure for doing something quite like that (selecting a likely candidate vertex from an "imaginary" bigger subsector), but in the next steps it does not loop over the vertices but instead it selects a likely (better) candidate again. A given clipping window, in the end, gets split into the same number of pieces by the same vertices.

I also have a partially working map converter now (it still screws up maps that have "broken" sectors that are not completely closed).

Triangle count is usually about 2 ~ 2.5 times the original subsector count - for example, entryway's 194 subsectors are replaced by 431 triangles.

I have changed the logic in my renderer, now it compares the vertices to the portal boundaries in world space (this just requires the translation part of the transform and 1 cross product per portal edge) and then only does the rotation and projection for those vertices which actually are inside a portal. This is mainly for stability however.
ehaliewicz wrote: Fri Apr 22, 2022 6:21 pm Anyway, the sorted linedef idea is to calculate all potentially visible walls for each sector, ala quake style PVS, and then sort those linedefs such that from any position in the sector, any wall B drawn after wall A cannot occlude wall A. There should always be a valid sorting order. As far as I can tell, this is probably the fastest way to handle visibility in "2.5D" worlds.
I think I will try to implement your method in my renderer to see how it compares.

Do you organize the PVS in such a way that you can discard some of the lines quickly? I imagine, in a lot of cases there is not much dependecy between the lines, so you should have some degree of freedom in what order exactly you choose.

What kind of data structure do you use for clipping the back walls against the front (solid) walls? And, how would you go about merging together adjacent floor segments for textured floors quickly - for example if you have a wall that is partly behind another wall, you wouldn't want to compute the uv coordinates for each scanline twice?

I've found some ways to reduce clipping costs that exploit the fact that with convex subsectors, it is easy to order rendering strictly from left to right. I.e. I can stash and reuse rendering parameters that have previously already been calculated. Since they need to be kept in RAM anyways (because I'm out of register space for the inner rendering loop), this doesn't cause much overhead (I attach a pointer to the set of parameters I have used to every line that I render so when I encounter this line again through another clipping window, it is easy to reload the parameters). So basically the only thing that adds real overhead is that the graph needs to be traversed more often.
ehaliewicz
Posts: 18
Joined: Thu Oct 10, 2013 3:30 pm

Re: SNES Doom Source Released! Now What?

Post by ehaliewicz »

none wrote: Mon Apr 25, 2022 4:45 pm
ehaliewicz wrote: Fri Apr 22, 2022 6:21 pm Anyway, the sorted linedef idea is to calculate all potentially visible walls for each sector, ala quake style PVS, and then sort those linedefs such that from any position in the sector, any wall B drawn after wall A cannot occlude wall A. There should always be a valid sorting order. As far as I can tell, this is probably the fastest way to handle visibility in "2.5D" worlds.
I think I will try to implement your method in my renderer to see how it compares.

Do you organize the PVS in such a way that you can discard some of the lines quickly? I imagine, in a lot of cases there is not much dependecy between the lines, so you should have some degree of freedom in what order exactly you choose.

What kind of data structure do you use for clipping the back walls against the front (solid) walls? And, how would you go about merging together adjacent floor segments for textured floors quickly - for example if you have a wall that is partly behind another wall, you wouldn't want to compute the uv coordinates for each scanline twice?
I only have a simple list of wall indexes per sector, the only sorting I use is for visibility, I haven't tried doing any other kind of sorting on top, as the visibility sorting is tricky enough to get correct.
No textured floors here, it's running on a 7MHz 68000. You can make a floormapping engine, as some people in the genesis community have done, but it takes about all of the cpu, so you either get textured walls, or textured floors. I'm lucky enough to have gotten per-pixel distance light falloff working on solid color floors without too much of a performance hit.

For clipping, before I used the sorted pvs I used standard portal clipping, which didn't scale to complex maps. With the sorted pvs, I use a simple flag per column to clip walls in screen space. It's not as efficient as something like a span or coverage buffer, but those techniques have overhead, like recalculating uv coords, which also affected the portal clipper, and are also not nearly as simple to implement.

Another trick I got from KK is a simple camera-space frustum culler, which uses a fixed 90 degree fov, so only two comparisons have to be performed for each left or right side of the screen. Basically, if the x coordinate of a transformed vertex is greater than the y (i.e. depth), it is off the screen to the right, and if the -x of a vertex is greater than the y, it is off the screen to the left.
This allows me to get away with 16-bit UV coords, rather than 32-bit which I needed to use previously because wall coords got huge off to the left and right side of the screen after perspective division.
none wrote: Mon Apr 25, 2022 4:45 pm I've found some ways to reduce clipping costs that exploit the fact that with convex subsectors, it is easy to order rendering strictly from left to right. I.e. I can stash and reuse rendering parameters that have previously already been calculated. Since they need to be kept in RAM anyways (because I'm out of register space for the inner rendering loop), this doesn't cause much overhead (I attach a pointer to the set of parameters I have used to every line that I render so when I encounter this line again through another clipping window, it is easy to reload the parameters). So basically the only thing that adds real overhead is that the graph needs to be traversed more often.
Yeah I have a similar scheme like that for when I fall back to using the raw portal graph or a non sorted pvs, that would reuse vertexes and bail out early, but the performance benefit of that was much much less than the sorted pvs once I started testing against more complex maps.

I do however bail out when the screen is entirely full, which helps cap how many walls are rendered. I also keep track of when an opaque wall is rendered to either edge of the visible viewport and shrink the viewable bounds in these cases, which further reduces the number of processed but clipped walls, in a similar way to the stamdard portal clipper.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

ehaliewicz wrote: Sun Apr 24, 2022 4:32 pm Yes, I have a similar coarse frustum culling scheme based on what quadrant the normal of a wall is. Depending on the orientation of the wall, it takes either one or two comparisons to discard a wall.
Is that in unrotated coordinates with axes snapped to the wall ends? Is (P) the camera? What am I looking at here?
However, I noticed this doesn't make too much of a difference in performance once I started moving to a PVS based solution, perhaps most walls that would be coarsely rejected are already rejected by the PVS. Your mileage may vary.
Well, if this ever ends up getting done, and if the S-CPU ends up having to do geometry preprocessing to figure out what to send to the Super FX, I suspect performance will be more important than ROM. If the S-DD1 is electrically compatible with the Super FX, there can be up to 18 MB of FastROM available, and if it isn't there are still 6 MB (plus the MSU1 - fingers crossed it won't be necessary). Note that these are CPU ROM numbers; they do not include the Super FX ROM, meaning they shouldn't have to include any textures.

One of the items on my wish list was local two-player mode, co-op if possible. This could substantially change the balance between the two processors, since every render task except for the actual pixel plotting would need to be done twice. Even if the S-CPU reliably finishes ahead of the GSU in single-player mode, doubling its workload for two-player mode could result in substantial slowdown, and loading more of the job onto the faster chip might be better in that scenario.

On the other hand, a particularly ROM-heavy level format is unlikely to fit alongside the textures in the Super FX's ROM, which tops out at 4 MB if pin 21 is what we think it is, or 2 MB if it isn't. The GSU might well have to execute a significantly more expensive algorithm to make up for this...
ehaliewicz wrote: Tue Apr 26, 2022 10:10 am Another trick I got from KK is a simple camera-space frustum culler, which uses a fixed 90 degree fov, so only two comparisons have to be performed for each left or right side of the screen. Basically, if the x coordinate of a transformed vertex is greater than the y (i.e. depth), it is off the screen to the right, and if the -x of a vertex is greater than the y, it is off the screen to the left.
This assumes the vertices have been transformed into camera space, though, which is expensive on the S-CPU.

Doom doesn't do a camera-space transform - it calculates the relative angle to each vertex and uses a lookup table to get its screen coordinate. Comparing angles rather than coordinates might be faster on the S-CPU, even though it involves a division. It also provides nearly free backface culling, and works just as well with arbitrary FOV angles...

And if I'm not actually rendering on the S-CPU, I shouldn't need Z (which is basically free with the camera-space transform but not with Carmack's "polar coordinates" shenanigans). I'd just need to watch the precision to avoid culling stuff that should be onscreen - the S-CPU is also not the best at division...

none wrote: Mon Apr 25, 2022 4:45 pmFor the uncompressed data (without removing duplicates first), RLE is much more efficient.
Okay, that makes more sense. I wonder if a less granular dictionary method would leave enough runs on the table to make RLE worthwhile...

I've really got to test that idea of mine and see if it works.
Also, I think there are a few textures which are particularly bad cases (the one with the distorted faces for example).
Apparently SP_FACE1 is 76 colours. It doesn't really look like it to someone who's used to working with mostly 4bpp palettes...
However, the perspective division is trickier than you might expect because you need to special case vertices at the camera plane (which project to positive / negative infinity) and behind the camera (which lie on the opposite side on the projective plane), which results in a lot of branches.
...this is a Super NES. Both the S-CPU and the GSU2 have far more trouble with the actual math than with branching.
With triangular sectors, the additional "sector internal" lines can be mostly skipped over because they do not change anything with the clipping window itself (they need little more than one additional table lookup per line). Only the vertices can split the clipping window, and there are no "internal" vertices - all the vertices have an "outer" wall attached to them. I.e., if you think about it, it's just a data structure for doing something quite like that (selecting a likely candidate vertex from an "imaginary" bigger subsector), but in the next steps it does not loop over the vertices but instead it selects a likely (better) candidate again. A given clipping window, in the end, gets split into the same number of pieces by the same vertices.

I also have a partially working map converter now (it still screws up maps that have "broken" sectors that are not completely closed).

Triangle count is usually about 2 ~ 2.5 times the original subsector count - for example, entryway's 194 subsectors are replaced by 431 triangles.

I have changed the logic in my renderer, now it compares the vertices to the portal boundaries in world space (this just requires the translation part of the transform and 1 cross product per portal edge) and then only does the rotation and projection for those vertices which actually are inside a portal.
I might have to take a look at Wolfenstein 3D or something just to make sure I'm not completely out to lunch. I'm picking up some great-sounding ideas from this conversation, but if I were to jump in and try to implement either of your methods right now, I'm sure I would trip on something basic and get nowhere.

Also, regarding Doom, I think I'm starting to get to the point where the solution space branches in a way that can't be resolved without testing. I've been poking at the problem of turning a PVS into a screen graph, and the optimal way to do it may depend on how far the S-CPU has to go before handing off the rest of the task to the Super FX.

Wall collision and enemy line-of-sight are going to be nasty if the AI has to run on the S-CPU. The actual logic is pretty trivial compared to the math...
ehaliewicz
Posts: 18
Joined: Thu Oct 10, 2013 3:30 pm

Re: SNES Doom Source Released! Now What?

Post by ehaliewicz »

93143 wrote: Mon May 02, 2022 1:19 am
ehaliewicz wrote: Sun Apr 24, 2022 4:32 pm Yes, I have a similar coarse frustum culling scheme based on what quadrant the normal of a wall is. Depending on the orientation of the wall, it takes either one or two comparisons to discard a wall.
Is that in unrotated coordinates with axes snapped to the wall ends? Is (P) the camera? What am I looking at here?
This is in world coordinates. The map editor calculates the angle quadrant of the normal of each wall, and the engine does a raw comparison against the player/camera position before any transformation has been done. In the case where a wall is perfectly aligned with either axis, it only takes one comparison to decide whether or not to skip the wall, in other cases it requires two comparisons.
93143 wrote: Mon May 02, 2022 1:19 am
ehaliewicz wrote: Tue Apr 26, 2022 10:10 am Another trick I got from KK is a simple camera-space frustum culler, which uses a fixed 90 degree fov, so only two comparisons have to be performed for each left or right side of the screen. Basically, if the x coordinate of a transformed vertex is greater than the y (i.e. depth), it is off the screen to the right, and if the -x of a vertex is greater than the y, it is off the screen to the left.
This assumes the vertices have been transformed into camera space, though, which is expensive on the S-CPU.

Doom doesn't do a camera-space transform - it calculates the relative angle to each vertex and uses a lookup table to get its screen coordinate. Comparing angles rather than coordinates might be faster on the S-CPU, even though it involves a division. It also provides nearly free backface culling, and works just as well with arbitrary FOV angles...
Yes, I tried to get a similar scheme working in my engine, but I never figured out how to efficiently get the angle to the wall. You need some kind of atan/atan2 function, I went with standard transformation and projection via a reciprocal table so only a few multiplies are needed, no division.

SNES has a hardware multiplier right? So this could probably be pretty quick.

93143 wrote: Mon May 02, 2022 1:19 am Also, regarding Doom, I think I'm starting to get to the point where the solution space branches in a way that can't be resolved without testing. I've been poking at the problem of turning a PVS into a screen graph, and the optimal way to do it may depend on how far the S-CPU has to go before handing off the rest of the task to the Super FX.
Can you elaborate a little more on what you mean by "turning a PVS into a screen graph"?
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

ehaliewicz wrote: Mon May 02, 2022 10:54 am
Doom doesn't do a camera-space transform - it calculates the relative angle to each vertex and uses a lookup table to get its screen coordinate. Comparing angles rather than coordinates might be faster on the S-CPU, even though it involves a division. It also provides nearly free backface culling, and works just as well with arbitrary FOV angles...
Yes, I tried to get a similar scheme working in my engine, but I never figured out how to efficiently get the angle to the wall. You need some kind of atan/atan2 function, I went with standard transformation and projection via a reciprocal table so only a few multiplies are needed, no division.
Couldn't the trigonometry be tabulated too? I'm pretty sure that's how Doom does it.
SNES has a hardware multiplier right? So this could probably be pretty quick.
Yeah, there's an MMIO multiplier, but it's 8x8 unsigned. Any real precision requires at least two and generally four runthroughs.

The S-PPU has a 16x8 signed multiplier with no calculation delay, and the S-CPU can use it... but Mode 7 uses it too, and in fact the CPU-facing input registers are also Mode 7 matrix parameters, so if you're using Mode 7 it's hard to use the PPU multiplier safely. And even with a Super FX, the ability to freely use Mode 7 would significantly streamline efforts to circumvent the pixel buffer stall that plagues any attempt at drawing vertical columns directly in SNES CHR format...

The hardware divider is 16/8 unsigned, and there's no superior PPU version so you're stuck with that.

The Super FX has two 16x16 signed multiplication instructions (one for if you want a full 32-bit result, and a slightly faster one that just returns the upper word and puts the top bit of the lower word in the carry flag), but it has no divide instruction. To divide on Super FX, you need a reciprocal table.

I'm starting to get jealous of the SA-1. It has an MMIO multiplier and divider, just like the S-CPU, but they're 16x16 signed and 16/16 signed/unsigned, with 32-bit results (or 40-bit for cumulative multiplication). And instead of 8 and 16 cycles like the 5A22 versions, they both take 5 cycles (or 6 for cumulative multiplication). And of course this happens at up to 10.74 MHz instead of up to 3.58 MHz... I'm pretty sure you can't have both a Super FX and an SA-1 on the same cartridge...
Can you elaborate a little more on what you mean by "turning a PVS into a screen graph"?
Basically just what it sounds like: going from a potentially visible set (which could be the entire map if you aren't using a PVS method) to a list of stuff to draw (however preprocessed that ends up being).

My plan was to relieve some of the memory pressure on the relatively small Super FX ROM by moving the level data into the larger CPU ROM area. But the Super FX can't see CPU ROM, so this would require the S-CPU to figure out how much of the map the Super FX actually needs to know about, and DMA that data into Super FX RAM. (It would also require the S-CPU to run the AI, since oftentimes most thinkers aren't potentially visible, and in fact they can be audible from well outside the PVS.)

At one end of the complexity spectrum, we have a list of potentially visible sectors, and the S-CPU just concatenates them and sends them up. At the other end, the S-CPU generates a complete screen graph including row and column spans, scale factors, texture numbers and starting coordinates, lighting, sprite ordering and masking information, and everything else required for the Super FX to just render the screen at full throttle without any excess logic. I'm thinking that the optimal amount of work for the S-CPU might fall in between those extremes - unless the AI ends up being the long pole, in which case the Super FX might as well handle as much as possible (subject to RAM and DMA limits, of course; just sending the whole level is probably impractical).

I'm starting to lean towards a particular solution, which is probably bad since I haven't yet tested anything...
ehaliewicz
Posts: 18
Joined: Thu Oct 10, 2013 3:30 pm

Re: SNES Doom Source Released! Now What?

Post by ehaliewicz »

93143 wrote: Mon May 02, 2022 3:25 pm
ehaliewicz wrote: Mon May 02, 2022 10:54 am
Doom doesn't do a camera-space transform - it calculates the relative angle to each vertex and uses a lookup table to get its screen coordinate. Comparing angles rather than coordinates might be faster on the S-CPU, even though it involves a division. It also provides nearly free backface culling, and works just as well with arbitrary FOV angles...
Yes, I tried to get a similar scheme working in my engine, but I never figured out how to efficiently get the angle to the wall. You need some kind of atan/atan2 function, I went with standard transformation and projection via a reciprocal table so only a few multiplies are needed, no division.
Couldn't the trigonometry be tabulated too? I'm pretty sure that's how Doom does it.
Doom uses https://github.com/id-Software/DOOM/blo ... ain.c#L292
and https://github.com/id-Software/DOOM/blo ... bles.c#L50

So it requires a division plus a lookup per vertex in the general case. I think you could probably do a similar reciprocal table plus multiply, but at least in my case I'm not sure it's worth the effort. I would avoid a couple multiples in the rotation/translation, but I have native instructions for that and it's (probably) not a bottleneck for the engine. Plus, I already worked on that part of the engine quite a lot. :)
93143 wrote: Mon May 02, 2022 3:25 pm
Can you elaborate a little more on what you mean by "turning a PVS into a screen graph"?
Basically just what it sounds like: going from a potentially visible set (which could be the entire map if you aren't using a PVS method) to a list of stuff to draw (however preprocessed that ends up being).

My plan was to relieve some of the memory pressure on the relatively small Super FX ROM by moving the level data into the larger CPU ROM area. But the Super FX can't see CPU ROM, so this would require the S-CPU to figure out how much of the map the Super FX actually needs to know about, and DMA that data into Super FX RAM. (It would also require the S-CPU to run the AI, since oftentimes most thinkers aren't potentially visible, and in fact they can be audible from well outside the PVS.)

At one end of the complexity spectrum, we have a list of potentially visible sectors, and the S-CPU just concatenates them and sends them up. At the other end, the S-CPU generates a complete screen graph including row and column spans, scale factors, texture numbers and starting coordinates, lighting, sprite ordering and masking information, and everything else required for the Super FX to just render the screen at full throttle without any excess logic. I'm thinking that the optimal amount of work for the S-CPU might fall in between those extremes - unless the AI ends up being the long pole, in which case the Super FX might as well handle as much as possible (subject to RAM and DMA limits, of course; just sending the whole level is probably impractical).

I'm starting to lean towards a particular solution, which is probably bad since I haven't yet tested anything...
Couldn't store the PVS lists in the standard ROM and transfer them to the super fx RAM? The individual lists per sector won't be very large.
93143 wrote: Mon May 02, 2022 3:25 pm the S-CPU generates a complete screen graph including row and column spans, scale factors, texture numbers and starting coordinates, lighting, sprite ordering and masking information, and everything else required for the Super FX to just render the screen at full throttle without any excess logic
This is almost definitely not the way I'd do it. Rasterization is pretty expensive on these slow CPUs :)
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

ehaliewicz wrote: Mon May 02, 2022 4:03 pmCouldn't store the PVS lists in the standard ROM and transfer them to the super fx RAM? The individual lists per sector won't be very large.
But for those lists to be useful, the Super FX would still need to reference level data in its own ROM.

I suppose it remains to be seen how small the texture data can be crunched while maintaining decent performance, and how big the full game's level data would be without the PVS stuff. If pin 21 is a second chip select like nocash thinks it is, there might even be enough room...

I'm kinda pining after my updatable ROM idea (use RAM for part of the Super FX ROM and allow the S-CPU to DMA to it). Not only could you just dump the whole level into U-ROM during the loading screen, but reading from ROM is almost always more efficient than reading from RAM, because ROM reads buffer in the background but RAM reads stall the core (and potentially clash with writes and pixel plotting, causing further core stall). It's ridiculously inauthentic, though; there's no way Nintendo would have given a proposal like that a second look...
93143 wrote: Mon May 02, 2022 3:25 pm the S-CPU generates a complete screen graph including row and column spans, scale factors, texture numbers and starting coordinates, lighting, sprite ordering and masking information, and everything else required for the Super FX to just render the screen at full throttle without any excess logic
This is almost definitely not the way I'd do it. Rasterization is pretty expensive on these slow CPUs :)
Yeah, that represents the extreme point beyond which there's no use for the Super FX at all. The optimal point (especially in a hypothetical split-screen two-player mode) is probably somewhere short of that. It's unlikely I'd go that far unless I were attempting a no-chip version...
ehaliewicz
Posts: 18
Joined: Thu Oct 10, 2013 3:30 pm

Re: SNES Doom Source Released! Now What?

Post by ehaliewicz »

93143 wrote: Mon May 02, 2022 4:21 pm
ehaliewicz wrote: Mon May 02, 2022 4:03 pmCouldn't store the PVS lists in the standard ROM and transfer them to the super fx RAM? The individual lists per sector won't be very large.
But for those lists to be useful, the Super FX would still need to reference level data in its own ROM.
It looks like you have 4MB (or more) of ROM on super fx? That should be plenty for any one level, I think, which means you should be ok with bank switching?

My guess is, having to de-compress textures in real time will be very difficult to do fast enough to not affect performance, but perhaps you have enough RAM for a texture cache to store de-compressed textures. Most of the time, if a texture is de-compressed it will be re-used for the next frame, likely a large number of frames.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

ehaliewicz wrote: Mon May 02, 2022 5:34 pmIt looks like you have 4MB (or more) of ROM on super fx? That should be plenty for any one level, I think, which means you should be ok with bank switching?
One level? How much of a ROM hog is this sorted-linedef method?

Also, no, the Super FX itself is only known to be able to address 2 MB. No game exceeded that, and the idea that it can address 4 MB is speculative based on the presence of an unconnected pin on the GSU2 and GSU2-SP1 that might be an extra ROM enable pin.

Officially, there can be up to 6 MB of extra ROM in parallel with the Super FX, so the S-CPU isn't stuck in WRAM while the Super FX is using its ROM (and so data the Super FX doesn't need - music, sound effects, incidental graphics - doesn't take up valuable Super FX ROM). But no game used this feature. The 18 MB number I mentioned is just what would happen if you bankswitched the upper 4 MB of extra CPU ROM with an S-DD1 (resulting in a maximum of 16 MB based on the pinout of that chip). All of this extra CPU ROM is inaccessible to the Super FX.

I do like the bankswitching idea for the Super FX ROM. I had earlier dismissed it without thinking about it much, because there was apparently nowhere to put an MMIO interface in the Super FX architecture without stomping on part of the RAM. But it occurs to me that it doesn't need to be the Super FX switching banks - the S-CPU has plenty of spare address space in the system area. The only issue is that to my knowledge no such facility ever existed, so it would be custom hardware and thus inauthentic...
My guess is, having to de-compress textures in real time will be very difficult to do fast enough to not affect performance, but perhaps you have enough RAM for a texture cache to store de-compressed textures. Most of the time, if a texture is de-compressed it will be re-used for the next frame, likely a large number of frames.
With the right algorithm, decompression should have mostly per-run costs rather than per-pixel costs, and with the right format those costs shouldn't be extreme.

There is not enough RAM for a texture cache. The Super FX can address 128 KB of RAM, which is enough for eight 128x128 textures even if there's nothing else in RAM - no framebuffer, no game data, not even basic operational state. Worse, it's in two 64 KB banks, and switching banks is two cycles if the desired bank number is in a register, or four cycles if it's not. (EDIT: That's not strictly correct; I counted the cycles wrong. It's three and five.)

Cached textures would not be quick to draw. Reading a byte from RAM (assuming it's in the same bank as the framebuffer) takes longer than the inner loop of my most recent attempt at a decompressing renderer, and unlike reading from ROM it cannot be parallelized.
Last edited by 93143 on Thu May 19, 2022 11:36 pm, edited 1 time in total.
ehaliewicz
Posts: 18
Joined: Thu Oct 10, 2013 3:30 pm

Re: SNES Doom Source Released! Now What?

Post by ehaliewicz »

93143 wrote: Mon May 02, 2022 6:25 pm
ehaliewicz wrote: Mon May 02, 2022 5:34 pmIt looks like you have 4MB (or more) of ROM on super fx? That should be plenty for any one level, I think, which means you should be ok with bank switching?
One level? How much of a ROM hog is this sorted-linedef method?
No, no, I'm just saying as long as it's good enough for one level then it should be solvable :) Most likely, one level won't be that large.
I don't actually know how much space it takes up for a full size level though, I've only made smaller test levels so far.
93143 wrote: Mon May 02, 2022 6:25 pm I do like the bankswitching idea for the Super FX ROM. I had earlier dismissed it without thinking about it much, because there was apparently nowhere to put an MMIO interface in the Super FX architecture without stomping on part of the RAM. But it occurs to me that it doesn't need to be the Super FX switching banks - the S-CPU has plenty of spare address space in the system area. The only issue is that to my knowledge no such facility ever existed, so it would be custom hardware and thus inauthentic...
To be honest, any cartridge made nowadays will be "inauthentic" in several ways, and plenty of games used bankswitching back then, so I don't think it would have been impossible at the time.

93143 wrote: Mon May 02, 2022 6:25 pm With the right algorithm, decompression should have mostly per-run costs rather than per-pixel costs, and with the right format those costs shouldn't be extreme.

Cached textures would not be quick to draw. Reading a byte from RAM (assuming it's in the same bank as the framebuffer) takes longer than the inner loop of my most recent attempt at a decompressing renderer, and unlike reading from ROM it cannot be parallelized.
How are you planning on achieving good performance if you need to decompress textures, when RAM is too slow to draw textures from?

Also, it seems hard to achieve per-run costs on textures that aren't just runs of single color pixels?
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

Well, I've been trying to find time to actually work out an optimized multi-layer decompressing wall texturing loop, but whenever I get some free time my brain is fried. I might as well make some noise here in the interim...
ehaliewicz wrote: Mon May 02, 2022 8:44 pmHow are you planning on achieving good performance if you need to decompress textures, when RAM is too slow to draw textures from?
Because while writes to RAM are one- or two-cycle fire-and-forget instructions, a single-byte RAM read is a two-byte instruction with (if I understand the documentation correctly) a dead cycle and a mandatory 5-cycle wait state, consuming 8 cycles during which the GSU's processor core can do nothing else. Unless the RAM buffer was already busy with a write, in which case it will take even longer. ROM reads, by contrast, require you to set R14, do something else for at least four cycles to avoid a wait state, and read the byte, for a total compute time cost of as little as one cycle (if you don't count setting up the address, which you also have to do for RAM reads on top of the cost of the actual instruction).

This is all assuming you're running in the I-cache. If you're loading code from ROM or RAM, that complicates matters, but mostly it slows down code (but not memory accesses) by a factor of 5. All numbers assume 21.4 MHz mode with no clock trickery (a single-byte RAM read is 6 cycles in 10.7 MHz mode, only one cycle faster than a word read).

Also, the bigger issue (I think) is that RAM is too small to draw textures from. Remember, the standard texture size in Doom is 128x128. You'd be constantly decompressing new textures at full resolution, and I doubt it would be a net win even if the actual drawing loop ended up radically faster.

...

The Super FX can render colormapped opaque wall pixels to a column-major linear bitmap from an uncompressed texture in ROM at 12 cycles per pixel:

Code: Select all

1	with R8
1	add Rtexstep		; update texcoord
1	getb			; obtain pixel colour
1	to R14
1	merge			; request raw texel from ROM
1	stw Rpixptr		; draw pixel to framebuffer
1	from Rcmapoff
1	to R14
2	getbl			; combine colormap offset and light level with raw texel to look up pixel colour
1	loop			; decrement pixel counter and branch
1	inc Rpixptr		; increment framebuffer pointer
Obviously this requires an appropriate colormap to be stored in-bank. Also, since we've saved a cycle by using stw instead of stb for the pixel write, this loop can only be used with walls at least two pixels high, since there needs to be a once-through tail after the loop that uses stb so as to not overwrite the pixel underneath the one it's supposed to render... I suppose there could be a branch before the loop that skips the loop body if the pixel counter is 1...

Reading the texel from RAM instead causes the same routine to take 19 cycles (and use an additional register):

Code: Select all

1	with R8
1	add Rtexstep		; update texcoord
1	getb			; obtain pixel colour
1	to Rtexptr
1	merge			; compose texel RAM address
1	to R14
8	ldb Rtexptr		; obtain raw texel from RAM
1	stw Rpixptr		; draw pixel to framebuffer
1	with R14
1	add Rcmapoff		; combine colormap offset and light level with raw texel to look up pixel colour
1	loop			; decrement pixel counter and branch
1	inc Rpixptr		; increment framebuffer pointer
If the cached texture and the framebuffer are in different RAM banks, the loop goes up to 25 cycles and uses two more additional registers, because ramb is two cycles and needs to be set up with from Rn for a total of three cycles and a register each way (I counted this wrong in my previous post). It might be possible to optimize this a bit by grouping accesses, but the loop would get considerably more complicated, and as I said my brain is fried.

...

Note that I am not using plot for this. The Super FX and its RAM setup were designed for horizontal rasterization. Unless I've been labouring under a grave misapprehension, drawing vertical columns with plot would bottleneck at 80 cycles per pixel in 21.4 MHz mode. This is because SNES CHR format forces the plotting circuitry to read and then write an entire 8-pixel sliver, at 5 cycles per byte both ways, for every incompletely-plotted sliver. DOOM-FX appears to suffer from this.

If you're doing a lot of column drawing, I figure it's a lot more efficient to just draw to a bytemap, and use plot to convert it to CHR in bulk later. This should add about 11 cycles per copied pixel if you manage the RAM buffer well, which is why it may be advantageous to use Mode 7 for part or all of the display so as to skip the conversion step for at least part of the framebuffer.
Also, it seems hard to achieve per-run costs on textures that aren't just runs of single color pixels?
I'm not specifically talking about RLE, although I think it should be possible to set up a decoder that handles literals fairly quickly. Lots of compression techniques (including all three that I'm considering for walls) work in groups with header bytes. There's a bit of extra cost decoding each texel if it's not a solid-colour run, but typically processing the header bytes constitutes the bulk of the work.

Take none's dictionary method. It's fairly costly to read a pointer and a run length and pop over to the position denoted by the pointer to start loading texels. But once you've done that, you're just reading texels (and tracking the run length), which shouldn't be massively more expensive than an uncompressed texturing loop.

What worries me is distant walls. If the header processing has to happen for every pixel or anything close to it, the method is very slow. Mipmapping, with a small uncompressed texture accompanying the large compressed one, could help, but that's a bit silly and hurts the compression ratio...
To be honest, any cartridge made nowadays will be "inauthentic" in several ways, and plenty of games used bankswitching back then, so I don't think it would have been impossible at the time.
To me, authenticity isn't about using battery-backed SRAM instead of F-RAM. It's about the capabilities of the cartridge as seen by the programmer. If it's 512 KB of ROM and an 8 KB save RAM, I don't care whether it uses voltage translation to avoid frying 3.3V parts; it's sufficiently authentic for my purposes.

If pin 21 is real and an easy bankswitching method could be implemented, I wonder if it wouldn't be reasonable to just not compress the textures, except for possibly doing live compositing of switch strips. I doubt I'm going to get 12 cycles per pixel on average using any sort of compression method, although without a full-capability prototype of the decompressing texture mapper (and at least some attempt at compressing data to get some statistics) it's hard to say how close I could get.

Still, using that much ROM on a game with a heavy-duty coprocessor would harm the authenticity factor even if it's technically feasible with period hardware. Early Nintendo 64 games had 8 MB of ROM with no special chips, and no commercial SNES game was even that large. I guess it's a question that should wait until I have a better idea of how much performance the compression would cost me...
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

This has been bugging me for some time, but I never got around to fixing it.

I've edited this post and this post to reflect the fact that Mode 7 increases the VRAM usage of a framebuffer. This is due to the fact that whether you're using pixel-as-texel or tile-as-texel format, you can't really use the other half of each word for anything else, meaning that in this application Mode 7 basically takes up 16 bits per pixel in VRAM. I appear to have done the calculations without taking this into account.

This unfortunately means that the VRAM-hungry 224x160 viewport would be bumped out of feasibility by the use of essentially any Mode 7, let alone 40% of its surface area; the whole thing would have to be finalized as SNES CHR. Substantially smaller viewports like 128x192 or 216x144 should still be able to use some Mode 7 to reduce format conversion overhead, and 108x144 should fit entirely in Mode 7 with the use of VRAM HDMA to round out the bandwidth.
User avatar
Señor Ventura
Posts: 233
Joined: Sat Aug 20, 2016 3:58 am

Re: SNES Doom Source Released! Now What?

Post by Señor Ventura »

Well, it depnds of the frame rate, right?
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

No, the problem is that the whole viewport can't be updated in one shot because the bandwidth is too low. There's only so much data that can be updated between the last time the old frame is drawn to the TV and the first time the new frame is drawn, and any framebuffer data that doesn't fit in that single update has to be double buffered in VRAM.

A 224x160 CHR framebuffer (which corresponds to a 224x192 display when the 32-line status bar is included) takes 35,840 bytes in VRAM. Assuming the status bar is clear of sprites to allow VRAM HDMA at 20 bytes per HBlank, and the entire 70-line extended VBlank can be used entirely for framebuffer data, a single update can cover (70x165.5)+(32x20) = 12,225 bytes. Thus the total storage required by the fractional buffering is (2x35,840-12,225) = 59,455 bytes (probably a little more because that isn't a multiple of the tile size). This just barely fits in VRAM with less than 6 KB left for the tilemaps, status bar graphics and incidentals.

Now, any use of Mode 7 will increase the effective size of the framebuffer in VRAM, or reduce the effective size of VRAM, whichever way you prefer to look at it. This is because Mode 7 uses an interleaved format where the bottom byte of each word is a tilemap byte and the top byte is a pixel. This means that half of each word is unavailable to the framebuffering scheme, effectively reducing the available space in VRAM. Using all of the Mode 7 area for Mode 7 graphics would thus reduce the effective size of VRAM to about 48 KB (a little more, really, since some of the Mode 7 tilemap is actually used, and you can save space in the non-Mode 7 tilemap this way).

It's clear that with the amount of room 224x160x8bpp takes up already, there's no room to shrink VRAM around it and still have a viable screen mode. In fact it's awfully tight as it is; one might wish to draw the gun on the Super FX, not just because it won't fit in VRAM, but simply to make sure there are more lines clear for VRAM HDMA, which causes sprites to glitch if present.

All of this is out the window if you're willing to put up with screen tearing (displaying a partially updated framebuffer). But screen tearing is horribly ugly, and I'm not willing to put up with it.
Post Reply