SNES Doom Source Released! Now What?

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: SNES Doom Source Released! Now What?

Post by none »

What about this idea:

Only render the walls into the intermediate buffer.

Render sprites and floors directly, horizontally, using PLOT, ordered front to back. While doing so, update a span tree (or similar data structure) for each scanline containing the horizontal spans that have not yet been rendered to. This can be used to avoid overdraw in case of sprite in front of floor, or sprite in front of another sprite.

When copying in the the intermediate buffer, also use the span tree to speed it up.

Drawback is no spectres.

This, of course only would work if floors and sprites could be clipped against the walls that do not yet exist. I don't know if the clipping info can be generated in the bsp step, if not, you could still convert the walls vertical spans to horizontal spans (i mean only the coverage information for the span tree, not the actual pixels). The original DOS doom did something similar for the floors i think (it generated them vertically, then converted them to horizontal for rendering).

Edit: Because of the 8-byte RMW thing, to avoid overhead around the span start and end, if they do not fall on an 8 pixel boundary, drawing of the horizontal spans could be delayed (info about what is to be rendered could be attached to the span trees), then, the whole scanlines could be rendered in one go. This would probably be faster anyways, because of better register usage.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: SNES Doom Source Released! Now What?

Post by Oziphantom »

intense compression is great but it also needs intense cpu which when trying to run doom is very limited.

Vertical strips make more sense as its easier to do the height compression on it, and use the same code as the walls.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: SNES Doom Source Released! Now What?

Post by Oziphantom »

93143 wrote: Sun Mar 27, 2022 10:01 pm
Oziphantom wrote: Sun Mar 27, 2022 5:31 am Doom needs 8MB of RAM to run so it can't load the whole 11MB WAD into RAM.
It only needs 4 MB to run. It will use up to 8 MB if it's there.
So much for trusting dedicated Doom sites :D
User avatar
Nikku4211
Posts: 569
Joined: Sun Dec 15, 2019 1:28 pm
Location: Florida
Contact:

Re: SNES Doom Source Released! Now What?

Post by Nikku4211 »

none wrote: Tue Mar 29, 2022 1:20 am What about this idea:

Only render the walls into the intermediate buffer.

Render sprites and floors directly, horizontally, using PLOT, ordered front to back. While doing so, update a span tree (or similar data structure) for each scanline containing the horizontal spans that have not yet been rendered to. This can be used to avoid overdraw in case of sprite in front of floor, or sprite in front of another sprite.

When copying in the the intermediate buffer, also use the span tree to speed it up.

Drawback is no spectres.
Is it possible to render Spectres on another layer, and then use the SNES' native subtractive transparency to blend the 2 layers together?

I know it's not going to be accurate to DOS Doom, but it'll at least differentiate them from regular Pinkies.
And also remind people of the awesome-as-heck PS1 Doom port. :3c
I have an ASD, so empathy is not natural for me. If I hurt you, I apologise.
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: SNES Doom Source Released! Now What?

Post by none »

Nikku4211 wrote: Tue Mar 29, 2022 2:54 pm Is it possible to render Spectres on another layer, and then use the SNES' native subtractive transparency to blend the 2 layers together?
It'd probably work but it would be very costly (needs an additional framebuffer and dma to vram).

I suppose you could build a second span tree for transparent objects and merge both together while actually rendering the spans but that seems also expensive.

Rendering the spectres in black and just rendering every second scanline might work well enough...
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

none wrote: Tue Mar 29, 2022 1:20 amOnly render the walls into the intermediate buffer.

Render sprites and floors directly, horizontally, using PLOT,
I had initially thought of only using small intermediate buffers for 8-pixel-wide segments of walls, and copying them immediately. The sprites would have had to be re-encoded in rows rather than columns, probably increasing their size and ideal (non-bottlenecked) rendering time somewhat, but the floors would have worked perfectly (when not using the mosaic trick, which I hadn't confirmed as working at that point).

The reason I went to a full-screen buffer is that I realized that Spectres and transparent walls would require everything behind them to be drawn in the intermediate buffer, and that doing so would eliminate RAM access penalties from partially filled slivers. I thought of using a halfway house where the screen would be divided into chunks, and only chunks with transparent walls or Spectres in them would be fully buffered, but the complexity scared me off. I also figured (more as self-justification than based on any rigorous analysis) that the associated performance increase would be inconsistent at best because of the sliver RMW penalty showing up at every non-tile-aligned edge in the direct-to-CHR parts of the scene.

Also, if using the mosaic trick, the flat and sprite renderers would have to render into mosaic trick format, which is a minor issue for flats but prevents sprites from gaining a substantial advantage from repeating the same texel, as long horizontal runs now bottleneck at 20 cycles per pixel instead of 5.

It might be worth reconsidering, perhaps with a smarter way to defer and combine wall recopying. If it works, it works - any performance increase that fits in memory and doesn't trade away features is a good thing when you're not on a deadline...
ordered front to back. While doing so, update a span tree (or similar data structure) for each scanline containing the horizontal spans that have not yet been rendered to. This can be used to avoid overdraw in case of sprite in front of floor, or sprite in front of another sprite.
I'm not sold on span-buffering mask objects. Masking compressed graphics of arbitrary shape with other compressed graphics of arbitrary shape strikes me as a lot of work, with potential for high RAM usage and bad edge cases. And sprites don't usually result in massive overdraw in basic Doom.

Note that partially transparent walls exist in Doom. They have holes in them that you can see through. Some of these are quite convoluted in shape, and can be repeated for a considerable length that can be viewed at any angle, potentially through another partially transparent wall:

E2M5_corner_and_clip.png

The original DOOM-FX didn't have transparent walls, and it didn't have Command Center. I'd be aiming to include both.
Drawback is no spectres.
Yes, that's a problem too. Even discounting the lack of easy framebuffer reads, the reverse painter's algorithm doesn't play well with transparency without a blend buffer, and a blend buffer is way too expensive.
This, of course only would work if floors and sprites could be clipped against the walls that do not yet exist.
You can't find out where the floors are without finding out where the walls are, and the geometry is traversed front-to-back, convex subsector by convex subsector. If a wall doesn't exist yet, it can't clip what you're looking at. Doom is designed to only draw each pixel once (aside from sprites and partly transparent walls), and any attempt at simplifying or optimizing the engine would do well to at least keep this feature.

I was considering a hybrid portal/BSP engine, because allegedly portal is faster in most cases, but determining the subsector something is in seems a bit ad hoc. But obviously I need to get more familiar with the practical realities of these techniques before making any solid decisions...
Edit: Because of the 8-byte RMW thing, to avoid overhead around the span start and end, if they do not fall on an 8 pixel boundary, drawing of the horizontal spans could be delayed (info about what is to be rendered could be attached to the span trees), then, the whole scanlines could be rendered in one go. This would probably be faster anyways, because of better register usage.
What do you mean by better register usage?

I find that the Super FX CPU core's 15 usable registers (the 16th is the program counter) is not as many as it sounds like, particularly when drawing compressed textures with a layer of indirection. If I've counted correctly, my compressed flat renderer uses 14 registers just for one colourmapped affine raster line, and I think my enhanced-compression wall renderer uses all 15. I could save a few registers, but it would cost cycles.

Also, the PC version uses over 80 KB on the visplane structure; I suspect it might be better to draw everything as quickly as possible to avoid wasting RAM. There can be a lot of texture fragments and lighting conditions scattered across a single scanline, as can be seen in the above screenshot... Also, doesn't this scheme basically require you to decode all the sprites twice?

Oziphantom wrote: Tue Mar 29, 2022 1:59 amintense compression is great but it also needs intense cpu which when trying to run doom is very limited.
If my method works as well as I hope, it shouldn't take much more time than regular RLE, and possibly less.
Oziphantom wrote: Tue Mar 29, 2022 2:02 am
93143 wrote: Sun Mar 27, 2022 10:01 pm
Oziphantom wrote: Sun Mar 27, 2022 5:31 am Doom needs 8MB of RAM to run so it can't load the whole 11MB WAD into RAM.
It only needs 4 MB to run. It will use up to 8 MB if it's there.
So much for trusting dedicated Doom sites :D
The Doom Black Book says this:
Fabien Sanglard wrote:The engine runs on a clearly-established memory budget. Upon starting up on NeXT the memory manager allocates 4 MiB of RAM and not a byte more. This is done in order to make sure the advertised minimum 4 MiB configuration is sufficient. On DOS the memory manager checks that the machine has at least 4 MiB but will use up to 8 MiB if available in order to improve its cache retention.
And I've seen the 4 MB number in other places as well.

But I just checked the Ultimate Doom manual, and it says 8 MB.

I think the implication might be that Episode 4: Thy Flesh Consumed requires 8 MB, but the first three episodes (i.e. regular Doom) do not.

Nikku4211 wrote: Tue Mar 29, 2022 2:54 pmIs it possible to render Spectres on another layer, and then use the SNES' native subtractive transparency to blend the 2 layers together?
Yes, but there are a couple of drawbacks.

1) If you're not using a span buffer on sprites, you have to draw every sprite in front of the Spectre twice.

2) It takes up more memory, meaning GPRAM and VRAM are stressed, so large screen sizes might not work. Also, DMA takes longer, which further restricts screen size in addition to the compute time penalty because fractional double-buffering only saves the amount of space you can fill in one frame. If one were to attempt this, it would probably be wise to go with 2bpp if possible.

Solving the VRAM and DMA issues by combining the layers on the Super FX is possible, but then the Spectre buffer has to be a full-size bytemap instead of a 2bpp or 4bpp CHR buffer, which worsens the GPRAM issue.
I know it's not going to be accurate to DOS Doom
It might not be far off, at least in well-lit areas, depending on how it's drawn. A 2bpp buffer with a few shades of dark gray might allow something very close to the original effect.

none wrote: Tue Mar 29, 2022 4:56 pmI suppose you could build a second span tree for transparent objects and merge both together while actually rendering the spans but that seems also expensive.
Why would you need two span buffers? Spectres don't mask anything.
Rendering the spectres in black and just rendering every second scanline might work well enough...
Maybe. Alternately, just drawing black pixels where the original distortion pattern would have put the darkest pixels might look decent (though it would hit a span buffer pretty hard, especially up close). Using solid black could look a bit obvious under certain lighting conditions, though.
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: SNES Doom Source Released! Now What?

Post by none »

93143 wrote: Wed Mar 30, 2022 2:33 am Masking compressed graphics of arbitrary shape with other compressed graphics of arbitrary shape strikes me as a lot of work, with potential for high RAM usage and bad edge cases.
It's maybe not as bad as you think, if you use an RLE scheme where you do not encode the length, but the end x-position of the span. I would use such a scheme anyways (also for the walls), because it speeds up searching into the texture if you need to clip left (or top in case of walls) - you can use a binary search then, instead of linear search.
93143 wrote: Wed Mar 30, 2022 2:33 am Note that partially transparent walls exist in Doom. They have holes in them that you can see through. Some of these are quite convoluted in shape, and can be repeated for a considerable length that can be viewed at any angle, including through another partially transparent wall:
You're right, I didn't think of those.
93143 wrote: Wed Mar 30, 2022 2:33 am You can't find out where the floors are without finding out where the walls are, and the geometry is traversed front-to-back, convex subsector by convex subsector. If a wall doesn't exist yet, it can't clip what you're looking at.
Per subsector, you only need to know the left and right clipping coordinates. I think you can get those while traversing the bsp even if you didn't yet render the walls (but not to sure about that). It would work with a portal engine in any case.
Edit: I think misunderstood what you meant. You were basically saying it IS possible, right?
93143 wrote: Wed Mar 30, 2022 2:33 am What do you mean by better register usage?
I mean not in the pixel renderer itself, but when the actual rendering is deferred, the sorrounding code (generating the spans, stride and offset, fetching texture coordinates etc) does not need to page out its state when calling the pixel renderer.
93143 wrote: Wed Mar 30, 2022 2:33 am Also, doesn't this scheme basically require you to decode all the sprites twice?
Only if you do RLE over the non-transparent pixels, which for the sprites, wouldn't work well anyways because you don't get very long same colored runs.
93143 wrote: Wed Mar 30, 2022 2:33 am The Doom Black Book says this:
Fabien Sanglard wrote:The engine runs on a clearly-established memory budget. Upon starting up on NeXT the memory manager allocates 4 MiB of RAM and not a byte more. This is done in order to make sure the advertised minimum 4 MiB configuration is sufficient. On DOS the memory manager checks that the machine has at least 4 MiB but will use up to 8 MiB if available in order to improve its cache retention.
And I've seen the 4 MB number in other places as well.

But I just checked the Ultimate Doom manual, and it says 8 MB.

I think the implication might be that Episode 4: Thy Flesh Consumed requires 8 MB, but the first three episodes (i.e. regular Doom) do not.
I've played Doom 2 (v1.666) back in the day on a 386 with 4 MB of RAM. It didn't even require a customized boot disk, for reducing DOS memory footprint, lots of games required that because they only barely fitted into 4 MB.

Frame rate was so-so (I used to reduce the window size a little) but I don't remember if there where noticable spikes in the frame rate from paging in stuff in from hd.
93143 wrote: Wed Mar 30, 2022 2:33 am Why would you need two span buffers? Spectres don't mask anything.
Because it simplifies merging in the spectres if rendering is deferred - if reading back from the frame buffer is to be avoided, you need to know where the spectres are while you are drawing the spans and do the transparency on the fly.

....

What about the other way around? Rendering the walls and sprites directly and floors into the intermediate buffer (still using the span buffer / deferred rendering scheme). Then rotate everything 90 degrees using mode 7. This would make the transparent walls possible. And allows encoding sprites vertically (although I'm not sold on this really being significantly better). With mode 7 you can also have the double-wide pixels, but they pack better than with the scroll trick, because you can just stretch them via matrix manipulation.

When using mode 7 to rotate the walls, maybe it is possible to render the floors into (hardware) sprites, and composite the floor and wall layers in hardware? Then the intermediate buffer wouldn't be needed at all. This would need a scheme to detect empty tiles though, to reduce the required additional DMA bandwidth. Also double-wide pixels for floors wouldn't be possible in hw.

About using a portal engine, I've been working on a custom portal engine myself (for DOS though). I'm only using triangular sectors. When transversing from front to back, I know which side I'm coming from and can infer the opposing point from that. I calculate the points on-screen x coordinate and divide the clipping area into two halves based on that. Then I render (front to back) the sector's sprites, then the floors and ceilings, and then the walls, and finally i queue rendering of the two adjacent sectors (with the divided clipping regions). This scheme works quite well I think (compared to using convex sectors with arbitrary number of sides). I currently use a simple 1-bit coverage buffer for masking the walls and floors against the sprites (this is not ideal...). I'm now planning to implement a span buffer (similar to what's described above), which would conveniently allow to merge together all the floor and sprite spans that were cut up by all the clipping. I guess a similar scheme would work for snes doom too. You would need to keep track of which sector the monsters are in though.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

none wrote: Wed Mar 30, 2022 4:53 amIt's maybe not as bad as you think, if you use an RLE scheme where you do not encode the length, but the end x-position of the span.
But the biggest problem is the transparent walls, which can't be drawn in horizontal spans. Anyway, the enemies aren't encoded with full RLE, only gap and post RLE, and I don't think they tend to have enough holes in them to make it worthwhile to optimize the search algorithm, not unless there's additional compression going on.

Which there is, in my scheme. Hmm...

I had thought of including a table of line starts with each wall texture, since the lines are not constant-length in ROM, and clipping a wall on the left might otherwise be a lot of work. The sprite format might benefit from something similar.
93143 wrote: Wed Mar 30, 2022 2:33 amIf a wall doesn't exist yet, it can't clip what you're looking at.
Edit: I think misunderstood what you meant. You were basically saying it IS possible, right?
Yes. My understanding of the Doom engine is shaky, but I'm pretty sure the way the geometry is traversed generates walls and floors/ceilings in the same front-to-back process, and the floors and ceilings are specifically defined (in screenspace) as being where the walls aren't.
93143 wrote: Wed Mar 30, 2022 2:33 am What do you mean by better register usage?
I mean not in the pixel renderer itself, but when the actual rendering is deferred, the sorrounding code (generating the spans, stride and offset, fetching texture coordinates etc) does not need to page out its state when calling the pixel renderer.
The problem there is that unless I've missed a lot of optimization, you can't actually do that without slowing down the renderer, because it uses all or almost all of the registers by itself. Using immediates is more expensive than using register values, especially if they're more than four bits.

I wonder if it's worth it to have slower renderers for very short runs that don't require so much higher-level information to be stashed in RAM...
93143 wrote: Wed Mar 30, 2022 2:33 amAlso, doesn't this scheme basically require you to decode all the sprites twice?
Only if you do RLE over the non-transparent pixels, which for the sprites, wouldn't work well anyways because you don't get very long same colored runs.
Well, my super sekrit scheme for additional compression does mean that a non-transparent post needs to be partly decoded before you know how tall it is. Perhaps I could fiddle with it a bit and see if I can't change that - it would probably cost a bit of compression ratio, but it might not be terrible...
93143 wrote: Wed Mar 30, 2022 2:33 amWhy would you need two span buffers? Spectres don't mask anything.
Because it simplifies merging in the spectres if rendering is deferred - if reading back from the frame buffer is to be avoided, you need to know where the spectres are while you are drawing the spans and do the transparency on the fly.
I was still stuck in the context of just transferring the Spectre buffer to VRAM separately, so I didn't see what you meant.

This is basically a better version of the idea of blending the Spectre buffer and the main buffer on the Super FX. It solves the problem of needing a whole second bytemap, and doesn't require the framebuffer to be readable to work.

But you still can't get perfect accuracy this way, since the Spectre effect doesn't sample the same pixel it's going to write to.
mode 7
The thought had occurred to me.

If you're using a bytemap buffer, flats only suffer a speed penalty of one cycle per pixel when rendering on the minor axis (that's the difference between inc Rn and with Rn; add Rm). So if you're not going to recopy everything to CHR format, all you really need is a column-major bytemap buffer that everything gets rendered to, and then you're done. Using tepples' scheme, you can display such a framebuffer directly in Mode 7 with an appropriate combination of matrix and tilemap, and unlike the other two ways to do this it isn't limited to 128 pixels in the vertical axis.

However, it is still limited to no more than 16,384 pixels total, so the screen size achievable with this method is not particularly impressive. And bandwidth is a problem since Mode 7 doesn't allow multiple tile pools, which severely limits double-buffering. With VRAM HDMA it's possible to force 108x144 to work, matching DOOM-FX, but anything much bigger than that requires a CHR buffer. High-detail mode is even worse, with 156x104 being about the limit.

I'm liking the idea of using Mode 7 for the display settings with the lowest pixel counts, so as to skip the recopy step. This could save as much as 8 ms for screen sizes near the Mode 7 limit, extending the performance advantage of these settings.

I do think it's possible to manage a 256x224 screen in low-detail mode (256x192 viewport, 128x192 framebuffer), or a 224x192 screen in full resolution (224x160 viewport and framebuffer), with judicious use of VRAM HDMA (required for decent frame rate with 112x192, and to fit 224x160 into VRAM at all). These framebuffers are far too large for Mode 7, and the full-resolution one in particular is likely to chug as badly as Randy's port or worse unless I can speed up the rest of the engine.

I suppose I could use Mode 7 for part of a large framebuffer, so as to save the amount of time it would have taken to recopy it. If my calculations are correct, this might allow nearly half of the 128x192 framebuffer to go unconverted, or 40% of the 224x160 buffer, saving about 5.5 and 7 ms respectively.

EDIT: I don't think the 224x160 buffer would work with that much Mode 7 in it. I think I forgot that Mode 7 is effectively 16bpp in VRAM, so there outright isn't room to double-buffer it. The 128x192 buffer might still work, since it's a lot smaller...
And allows encoding sprites vertically (although I'm not sold on this really being significantly better).
I don't know how significant it is, but if I don't need to encode in rows I'd rather not. Every little bit helps.

Then again, with my enhanced compression scheme, it's possible that the best compression axis might not be the same one as for Doom's picture format...
(still using the span buffer / deferred rendering scheme)
You still think that's going to work after seeing that screenshot?

I doubt span-buffering maskables is practical, although I'm open to changing my mind. Those partly transparent walls in particular can get quite nasty, and they do need to be drawn vertically. Also remember that this is low-resolution on primitive hardware; saving a few pixels of overdraw at the cost of having to set up an additional raster line might not be a net win even if the data structure fits in RAM. And reading data from RAM needs to be minimized because it cannot be parallelized; the core stalls until the data is ready.

It would be great to be able to not bother storing all the irregular portal shapes generated when traversing the solid geometry. Perhaps a vertical span buffer could work - I don't think there are any partly transparent floors in Doom... but deferred rendering would require far too much data to be switched out between segments, I think. If you draw one thing at a time, you can keep the same raster parameters and just advance them without rendering in order to cross occluded spans.

I'm still not convinced that the RAM usage would remain respectably bounded (just one of those vine walls can occlude a dozen spans in a single vertical line all by itself, and as you can see there's at least one point in the game where multiple vine walls overlap on screen), and Spectres are still a problem. A separate Spectre span buffer (non-merging, so Spectres can operate on each other) with masking from the main span buffer could allow Spectres to be drawn last, as long as the framebuffer format is readable...

...wait, no. The partial invisibility effect samples adjacent pixels, meaning that Spectres can end up containing copies of pixels that are later occluded. The scheme I just described would have Spectres containing pixels copied from the occluders instead. Which I guess can happen with solid geometry in the real game, but I'd consider that a bug. This might be an acceptable break from authenticity...
About using a portal engine, I've been working on a custom portal engine myself (for DOS though).
Interesting! Is there any public information about it? (I guess there is now...)
I'm only using triangular sectors. When transversing from front to back, I know which side I'm coming from and can infer the opposing point from that. I calculate the points on-screen x coordinate and divide the clipping area into two halves based on that. Then I render (front to back) the sector's sprites, then the floors and ceilings, and then the walls, and finally i queue rendering of the two adjacent sectors (with the divided clipping regions). This scheme works quite well I think (compared to using convex sectors with arbitrary number of sides).
Clever. Do you suppose going to triangular subsectors in Doom would simplify things enough that the increased number of subsectors wouldn't bog things down and/or eat too much ROM for level data?

There's a renderer for Mega Drive called PortalView, that uses "sorted linedefs". I'm imagining something where the linedefs associated with a subsector are ordered clockwise or something, so you can stop searching when you run off the edge of the portal. With subsectors of arbitrary convex geometry, you might want to be able to merge multiple portals opening into the same subsector so you don't have to process that subsector multiple times, but if you could come up with a visible linedef quickly and just search left and right until you get an occluded vertex, it might not be so bad. And a split portal is really two portals anyway, so maybe it's good to handle a split sector twice. It's hard to see how you could easily pick a linedef that can be seen through the portal, though - perhaps the portal data could suggest a likely candidate, resulting in a shorter search...
I currently use a simple 1-bit coverage buffer for masking the walls and floors against the sprites (this is not ideal...). I'm now planning to implement a span buffer (similar to what's described above), which would conveniently allow to merge together all the floor and sprite spans that were cut up by all the clipping. I guess a similar scheme would work for snes doom too.
I wonder how Randy did clipping? In the secret room in E2M1, the Cacodemon is very clearly clipped by the pillar edges even above the visible extent of the pillar. (In the PC version, the Cacodemon doesn't seem to be able to exceed the height of the roof, so I can't tell whether the SNES version's clipping behaviour is accurate or not.) But normal clipping by horizontal edges works fine at any angle.

I suppose I could try to figure it out from the source code...
You would need to keep track of which sector the monsters are in though.
I was thinking about perhaps using the blockmap to narrow down how much BSP traversal you have to do to decide which subsector a set of coordinates is in. Does that make any sense?

How does a normal portal engine do it? Just check for crossed portals, using roughly the same math as a BSP branch (or a wall collision)?

...

Since GPROM is so limited, I was hoping to be able to put level data in CPUROM, which means the S-CPU would have to run a lot of these initial geometry routines to figure out which subsectors to send up to the Super FX, not to mention running monster AI (at least in areas the Super FX doesn't know about). The S-CPU has an MMIO multiplier, but it's 8x8 unsigned, and the PPU's 16x8 signed multiplier can't be used while Mode 7 is active, so minimizing the number of multiplications this part of the engine needs to do is rather important.

If the SNES side turns out to be the long pole, it may be necessary to use exclusively Mode 3 for all screen sizes just to allow the S-CPU to access the PPU multiplier...

Lookup tables do work really well on the S-CPU, much better than on the Super FX. The S-CPU also has a 16/8 unsigned divider, which is something the Super FX completely lacks (it has to use a reciprocal table), and I suppose that might be useful.
Last edited by 93143 on Wed Jan 25, 2023 1:15 am, edited 1 time in total.
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: SNES Doom Source Released! Now What?

Post by none »

93143 wrote: Thu Mar 31, 2022 3:24 am But the biggest problem is the transparent walls, which can't be drawn in horizontal spans.
i should have clarified, for the walls it's of course not the x-position but the y-position, but the same general idea.

93143 wrote: Thu Mar 31, 2022 3:24 am I had thought of including a table of line starts with each wall texture, since the lines are not constant-length in ROM, and clipping a wall on the left might otherwise be a lot of work.
That table will be going to be quite big...maybe it is possible to build a compression format that has equally spaced line entries.
For example, if you want to achieve a compression ratio of 1:4, always reserve 32 bytes (or 16 bytes in case of 4-bit color) for each of the 128-high columns.
Then, if you encounter a segment that needs more space, you encode a pointer or an offset to somewhere else in the data buffer (a scheme quite like this would be good for compression anyways, since now you can reuse data from other lines and possibly also other textures if the pointer is long enough: for example the stone wall texture has like 20 variations - with demon head, without demon head, with blood, without, it would be great to be able to reuse the data for those. Maybe it makes sense to create texture sets that, with compression, each fit into 64k...

Also, clipping walls top (or bottom depending on rendering direction) is equally as bad, i think. Horizontal clipping only has to be done once per wall segment. Vertical clipping needs to be done on every column in certain situations (when standing close to a wall for example).

93143 wrote: Thu Mar 31, 2022 3:24 am You still think that's going to work after seeing that screenshot?

I doubt span-buffering maskables is practical, although I'm open to changing my mind. Those partly transparent walls in particular can get quite nasty, and they do need to be drawn vertically. Also remember that this is low-resolution on primitive hardware; saving a few pixels of overdraw at the cost of having to set up an additional raster line might not be a net win even if the data structure fits in RAM. And reading data from RAM needs to be minimized because it cannot be parallelized; the core stalls until the data is ready.

It would be great to be able to not bother storing all the irregular portal shapes generated when traversing the solid geometry. Perhaps a vertical span buffer could work - I don't think there are any partly transparent floors in Doom... but deferred rendering would require far too much data to be switched out between segments, I think. If you draw one thing at a time, you can keep the same raster parameters and just advance them without rendering in order to cross occluded spans.


I'm still not convinced that the RAM usage would remain respectably bounded (just one of those vine walls can occlude a dozen spans in a single vertical line all by itself, and as you can see there's at least one point in the game where multiple vine walls overlap on screen),
Yeah, the vine walls are really horrible. You've got a lot of good points there.
93143 wrote: Thu Mar 31, 2022 3:24 am and Spectres are still a problem. A separate Spectre span buffer (non-merging, so Spectres can operate on each other) with masking from the main span buffer could allow Spectres to be drawn last, as long as the framebuffer format is readable...
I'd still merge the spectre spans, incrementing a "spectre count" per span.

93143 wrote: Thu Mar 31, 2022 3:24 am Is there any public information about it?
No it is still in a very experimental stage.

93143 wrote: Thu Mar 31, 2022 3:24 am Do you suppose going to triangular subsectors in Doom would simplify things enough that the increased number of subsectors wouldn't bog things down and/or eat too much ROM for level data?
I've not really been that greedy about the size, but I supposse you can make the format quite small if you put everything in one large array and if you exploit the spatial distribution of the sectors (i.e. you need to be careful about how the sectors are numbered when the map is compiled). I'm aiming for something like this:

Code: Select all

union {
  struct {
    unsigned char flags; // 2 bit sector flags, 3x2 bits target side flags (00 = side 1, 01 = side 2, 10 = side 3, 11 = wall)
    char sector_offset;
    union {
      char portal_offset;
      char wall_offset;
    } side[3];
    char vertex_offset[3];
  } compressed_subsector;
  struct {
    unsigned char flags; 
    union {
      unsigned short portal_index;
      unsigned short wall_index;
    } side[3];
    unsigned short vertex_index[3];
  } uncompressed_subsector;
  struct vertex {
    short x;
    short y;
  };
  struct {
    char floor_z, ceiling_z;
    unsigned char floor_flat_index, ceiling_flat_index;
    // doom would also need the sector tag, lighting, and special attributes
  } sector;
  struct {
    unsigned char texture[3];
    char portal;   // if not zero, this is a sector boundary. these entries can not be shared over multiple subsectors.
    // doom would also require some additional attributes from the linedef here
  } wall;
} map[];
This is just pseudo code, I'm aware a real union would pad all structs to the same size - the idea is to interleave all the data, but have mostly 8-byte and some 16-byte entries which use two "cells". The exception being vertices, where two vertices go into an 8-byte "cell" - if the data were in RAM, i'd place the projected vertex there, but since it needs to live in ROM...

This model has, space-wise, some advantages over variable length portals. Variable portal counts mean that the subsectors vary in size, so you basically if you want to pack them tightly, you cannot use indexes to point to them, you need to store byte or word pointers instead. This means in the compressed form, you can only adress a range of 256-512 byte offsets instead of 2048 byte offets. In large, open maps, or maps with long winded loops, the map geometry would allow using the compressed format less often, because you cannot number sectors and vertices so that everything that is next to each other in 2d space also has similar sector numbers. I can't tell if dooms levels are be big enough for that to matter much though. The 2-bit side number field would not work anymore and would need to be expanded to at least 3 bits or better 4 bits. You could do without the side number, but this would hit performance in the traversal mechanism (requires for example backfacing/frontfacing check for every side, depending on implementation).

In total, say you have a map with only square subsectors, while you would need double the amount of triangular sectors, each subsector can shrink from something like 12-24 bytes to 8 bytes in size. I should probably write a recompiler for doom maps to get some actual numbers.

If you do the traversal on the S-CPU, in comparison to x86, I imagine there might be some additional performance gains, because there are really few registers available:
- no inner loop for iterating through the sides, additionally to saving the loop setup cost, means less registers get clobbered during the vertex projection step.
- for three sides you are looking from into the sector, you can make three copies of the code, one for each case, meaning you can use lots of immediates (I guess for n-sided sectors, it would still be reasonable with a few unrolled loops and a jump table), and you can also make better optimized specialized code paths for when a vertex is not in the clipping region, and/or is behind the camera plane
- the 65816 saves 1 cycle every time you use a byte-sized field instead of a word (this is strongly simplified, but still...)

Another point of note is that you effectively have to project fewer vertices because you can cull them out earlier, and the projection will take a big chunk of cpu time. For each axis you need two 16 by 8-bit multiplies and two 16 bit additions for the matrix transform, and the perspective division. So, every vertex that is saved counts. Also, fewer sprites per sector means depth-sorting the sprites will become faster.

The main drawback is not the traversal itself, but the additional clipping that needs to be done, sectors that are further away from the camera get clipped by the sectors in front of them into thinner stripes than what would be really needed. Avoiding that is very tricky, detecting and combining adjacent stripes beforehand is not trivial, because of depth ordering issues, at least I didn't find a good way to do it. I thought about depth sorting sectors, keeping lists of sectors to be drawn, keeping lists of adjacent lines with the vertices - it introduces the kind of complexity the build engines portal system has, at this point you could as well do proper concave sectors probably. It's not a big problem in my case - I save the state the wall renderer was in (the screen y coordinates and strides) with the wall data - and for floors, the last texture coordinate and offset for each scanline. If I encounter a wall or a floor that has already been partially drawn, since the stripes i draw are strongly ordered from left to right, I can always load back the old values. So "real" clipping only needs to be done where it's actually needed, and only to the left, and it can be done in screen space.

On the S-CPU that could mean bigger overhead however. Since the data is in ROM, the values from the last stripe cannot be stored with the data. It would need an additional RAM buffer - which does not play well at all with the "interleaved" indexing, since indices are sparse - same goes for saving already projected vertices btw). One "solution" would be to store the entire level data in RAM (which at least would allow strong compression of map data).

Another possibility is just living with sparse entries in the buffer. This would basically waste a lot of RAM. But maybe it can be "recycled": The indices of the unneeded entries luckily never change, and could be used for example for implementing a memory pool where all kinds of linked lists and similar small dynamically allocated objects could be stored (say for example the per-sector actor list entries). On the SNES this can work very well, because you don't have to think about cache efficiency issues.

Alternatively, for solving the problem with the walls, a small stack of wall renderer states encountered in the last stripe might work - since this is doom, and it doesn't need to support room over room. Storing and reusing projected vertices on the other hand is tricky without wasting RAM. I don't know a good solution. A hashmap-like cache table could work, say 1,5k, for 256 entries containing vertex id, x and z or 1/z - but it would require recalculating some vertices in case of a cache miss. i don't know how often that would happen - probably not all too often, because accesses to the same vertex would tend to cluster together.

For the floors, you only know the values after the spans have actually been rendered, which would be part of the code that would live on the SuperFX - but a span buffer would solve that, even a simple one which just keeps an array of floor spans strongly ordered from left to right, that you can just append to if needed (adding in a partially clipped span would amount to changing the end offset of the last span). Or just a single entry per scanline that is flushed when a span has finished.
93143 wrote: Thu Mar 31, 2022 3:24 am It's hard to see how you could easily pick a linedef that can be seen through the portal, though - perhaps the portal data could suggest a likely candidate, resulting in a shorter search...
You don't really always need a visible linedef, you just need a good guess to reduce the amount of searching. I'd just pick the "center" point (I mean not in geometric terms, but for example if you have a 6-sided sector, you always start at the third point and hope it's good enough). That of course assumes you have the "look from" info available. The Build engine renders the longest line first.
93143 wrote: Thu Mar 31, 2022 3:24 am I was thinking about perhaps using the blockmap to narrow down how much BSP traversal you have to do to decide which subsector a set of coordinates is in. Does that make any sense?

How does a normal portal engine do it? Just check for crossed portals, using roughly the same math as a BSP branch (or a wall collision)?
I think Build keeps a linked list of actors per sector and just checks whether the actors have crossed any portal(s) when they move. If you have a working portal engine, you can do away with the blockmap entirely - area damage checks and large (spidertron) collision boxes would be a little more expensive though. The initial placement can be hardcoded (and teleporter destination sectors - but iirc those already are).
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

none wrote: Thu Mar 31, 2022 4:36 pm
93143 wrote: Thu Mar 31, 2022 3:24 amI had thought of including a table of line starts with each wall texture, since the lines are not constant-length in ROM, and clipping a wall on the left might otherwise be a lot of work.
That table will be going to be quite big...
It's just two bytes per column, or 1.5% of the uncompressed size of the texture.
For example, if you want to achieve a compression ratio of 1:4
That's pretty ambitious - I'd be reasonably happy with 3:8. There are plenty of textures that don't get anywhere near 1:4 even in PNG format.

...

I'm a little extra busy for a few days. I'll respond further when I have time.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

none wrote: Thu Mar 31, 2022 4:36 pm
93143 wrote: Thu Mar 31, 2022 3:24 amBut the biggest problem is the transparent walls, which can't be drawn in horizontal spans.
i should have clarified, for the walls it's of course not the x-position but the y-position, but the same general idea.
What I mean is, you can't efficiently combine span buffers in two axes, so you can't span-buffer the vine walls and row-major sprites in the same frame.

This problem is solved pretty easily by reducing it to a single column-major linear framebuffer, with a maximum penalty of about 11 ms 18 ms for partial full recopy* of the largest practical buffer (224x160), 6 ms for partial recopy of full-screen low-detail (128x192), and essentially zero for anything that fits completely in Mode 7. A column-major buffer allows span-buffering of solid walls, transparent walls, column-major sprites, flats, and even the gun.

* There isn't really enough VRAM to use Mode 7 for any of a 224x160 framebuffer, since any way you slice it Mode 7 is effectively 16bpp in VRAM.

The vine walls are still an issue. It's possible that the worst case would still fit in RAM and be adequately performant, and this would have to be tested against a more authentic portal-buffering approach (that uses normal painter's algorithm for partly transparent graphics and just eats the overdraw) to see if span buffering is overall a good idea for Doom.

Also, texturing with a span buffer requires cumbersome indirect RAM access to read the span buffer every time it's needed unless there's at least one free register in the loop nest, and in the case of my wall scheme I'm pretty sure there isn't.

Generally when you have a system with a 128 KB hard cap on RAM, no data cache, and a 512-byte instruction cache with a 5x speed penalty for cache misses, complexity is your enemy. Then again, there are different types of complexity. If the complexity can be jammed into the 512-byte I-cache and the 15 general-purpose registers, and scheduled so as to avoid bus-induced core stalls, the Super FX is actually a fairly fast chip and it can do a lot of logical and arithmetical maneuvering. On the flip side, the S-CPU can do some surprising stuff with unrolled loops and lookup tables (the Wolfenstein 3D scaler code is a thing of simple beauty), but anything requiring heavy processing is going to be slow.
encode a pointer or an offset to somewhere else in the data buffer (a scheme quite like this would be good for compression anyways, since now you can reuse data from other lines and possibly also other textures
This may be a brilliant idea. I'll see if I can't free a register or two in the wall loop. Dictionary compression (with the "dictionary" being a texture pool) seems like a natural solution to the problem of there being a lot of nearly identical textures, if you can make it work without a ton of extra compute time... and I see no reason why it couldn't be combined with my enhanced compression scheme, in principle. The dictionary aspect would be the outer loop, so if it took a few extra cycles it wouldn't be the end of the world...

Or would it be reasonable to allow remote lookups further down the decompression process if the parameter envelopes match? Hmm... Maybe not. The inner loops are performance-critical and don't need extra features to worry about...
if the pointer is long enough
An 8-bit pointer is too short even for a single texture. A 21-bit pointer might be impractical. I think the answer is that all textures in a common dictionary pool will have to share a 64 KB bank.

A dictionary lookup run would have to be reasonably large for this to work. You need at minimum two bytes for the pointer and a byte for the data length in texels (not bytes; I don't keep track of that). In order to avoid needing this three-byte sequence for every local run (with the pointer just being the address of the byte following it), we would need an extra byte per lookup run to tell the decoder it's a lookup run. Since we're allowing lookup in other textures, we can't just encode the run length in the target data because it would make no sense when decoding that texture. Four bytes for a lookup means you'd better have at least five bytes at the other end or it's completely worthless.

I suppose I could try to fit it into three bytes - I'd have to limit the length of a lookup run to perhaps 64 texels, and this would reduce the flexibility of my own scheme somewhat because if I use (say) a two-bit pattern to mean "remote lookup", that reduces the number of valid command bytes for my scheme by a quarter. It could still work; the question is if it results in better compression...
for example the stone wall texture has like 20 variations - with demon head, without demon head, with blood, without, it would be great to be able to reuse the data for those. Maybe it makes sense to create texture sets that, with compression, each fit into 64k...
You know, it may actually be possible to jam every texture into ROM. I was kinda sorta close with my extra compression scheme, and I was thinking of trying to disassemble some of the more obvious patched textures and draw them separately, but initiating multiple raster lines per column of wall texture to deal with patches could be more expensive than just handling it in the wall texturing loop nest...
Horizontal clipping only has to be done once per wall segment. Vertical clipping needs to be done on every column in certain situations (when standing close to a wall for example).
But you have to do that anyway, because other than just reducing texel bit depth uniformly, there's no way to compress a wall texel column in such a way as to allow blind lookup of any y-position. The length of the column in bytes is known - just subtract the column offset from the next column's offset - but that doesn't help at all because with a compression scheme of any complexity you don't have any idea what type of byte you're looking at, never mind how many texels have preceded it, unless you started at the beginning. So you have to do a linear search. Even with RLE, you won't get decent compression without allowing runs of literals, which means you can't assume anything about the data distribution.

Perhaps some sort of caching system could be used to speed up lookup in adjacent pixel columns that use the same texel column... but that could get complicated unless there's another register free, and there really isn't... Also, you'd have to make sure to draw the texture in order of decreasing highest visible texel to avoid having to backtrack (which you can't really do).

It strikes me that being able to look up the start of a column, rather than doing a linear search from the top left corner of the texture, would help immensely not only for left-clipped walls, but also for distant and/or high-aspect walls with less than one pixel column per texel column. It means you can just skip columns without having to decode them. This is an extra big deal in low-detail mode.

However, I do think that using constant-data-length wall columns is likely to be far too cumbersome and ruin the compression ratio. I think it's probably better to let the compressor adapt to the data, and use exactly as much space as needed and no more. If this approach ends up requiring a 256-byte table of line starts for a texture that was 16 KB before compression, so be it.
backfacing/frontfacing check for every side
I don't understand. You should never have to do a backface cull in a portal engine. If you can see it through a portal, it's not a backface.
In total, say you have a map with only square subsectors, while you would need double the amount of triangular sectors, each subsector can shrink from something like 12-24 bytes to 8 bytes in size. I should probably write a recompiler for doom maps to get some actual numbers.

If you do the traversal on the S-CPU, in comparison to x86, I imagine there might be some additional performance gains, because there are really few registers available:
- no inner loop for iterating through the sides, additionally to saving the loop setup cost, means less registers get clobbered during the vertex projection step.
- for three sides you are looking from into the sector, you can make three copies of the code, one for each case, meaning you can use lots of immediates (I guess for n-sided sectors, it would still be reasonable with a few unrolled loops and a jump table), and you can also make better optimized specialized code paths for when a vertex is not in the clipping region, and/or is behind the camera plane
- the 65816 saves 1 cycle every time you use a byte-sized field instead of a word (this is strongly simplified, but still...)

Another point of note is that you effectively have to project fewer vertices because you can cull them out earlier, and the projection will take a big chunk of cpu time. For each axis you need two 16 by 8-bit multiplies and two 16 bit additions for the matrix transform, and the perspective division. So, every vertex that is saved counts. Also, fewer sprites per sector means depth-sorting the sprites will become faster.

The main drawback is not the traversal itself, but the additional clipping that needs to be done, sectors that are further away from the camera get clipped by the sectors in front of them into thinner stripes than what would be really needed. Avoiding that is very tricky
I guess I'd have to get deeper into this to really have a basis for comparison.

It occurs to me that Doom's use of subsectors with arbitrary side counts allows extensive use of axis-aligned rectangles, which can be special-cased when checking which side of a line you're on (and I believe Doom takes advantage of this). This may be more important in a BSP engine than a portal engine, but it's still a saving.
strong compression of map data
Hmm. We do have 128 KB of WRAM. That could be enough. And a brief cooldown period between levels shouldn't be too objectionable. It might even make it easier to load the data into the Super FX, since you wouldn't have to DMA from ROM to WRAM first (on the other hand, the desired subset of the map data might be in more pieces, which would cost a bit of Super FX time).
If you have a working portal engine, you can do away with the blockmap entirely - area damage checks and large (spidertron) collision boxes would be a little more expensive though. The initial placement can be hardcoded (and teleporter destination sectors - but iirc those already are).
I was under the impression that the blockmap wasn't actually optimal for collision, and that the BSP could have been used instead. But what I was thinking of here was not to use a blockmap for collision, but to enable the selection of a starting node on the BSP tree to use when locating an object, since a grid lookup is so trivial compared to a bunch of side checks against lines at arbitrary angles. This could provide a more systematic and computationally bounded way to deal with the location problem compared with the Build approach.

But I might be talking nonsense. Like I said, I'll have to get into it more before I'll be able to speak with confidence about the algorithmic details...
Last edited by 93143 on Wed Jan 25, 2023 1:24 am, edited 1 time in total.
none
Posts: 117
Joined: Thu Sep 03, 2020 1:09 am

Re: SNES Doom Source Released! Now What?

Post by none »

I've been trying out some different compression schemes for wall texture data, expanding on the texture atlas idea.

I've written a basic encoder that puts all the textures from a WAD into one big data buffer, and then creates groups of 8, 16, 32, ... pixels, and stores a 24-bit pointer to each group (e.g. 16, 8, or 4 pointers per vertical column, respectively). Each group is then further compressed separately using either RLE, or one of two flag based methods. Duplicate pixel groups are then removed and pointers are accordingly updated. Finally, the compressed data is decompressed and written to a file for validation purposes. I've tested with the Doom 2 texture data.

An approach like this would allow (nearly) random access into the texture data (limiting the amount of pixels that need to be skipped to the pixel group size).

I've tried a few different RLE schemes (8-bit run lengths, 4-bit run lengths, LZ4-like variable run lengths), after removing duplicate pixel groups, none of them seems to improve compression significantly (reducing size only by 1% - 2%) - even if it is applied to the raw data (not restricted to pixel groups), which makes me doubt a mixed dictionary / RLE approach will work at all (unless I screwed something up in the RLE implementation).

I've also attempted to reduce the table size by storing some of the pointers directly with the data, instead of inside the table (dividing pixel groups into sub groups and pointing from one subgroup to the next), but this does not work at all (results in a net loss).

However, the two flag based methods work a lot better, both perform about equally well. Both methods store a flag word with a flag for each pixel in the group. The first method is a "common byte" method (storing the most often used color first, and then flagging every pixel that differs from this color and storing only the pixel data for those pixels), the second method is a variation of this that does not store a common byte but flags each pixel that differs from the last one instead - this may be a little faster to decode.

I've also tried a variation of the second method that can store some of the pixel groups that would produce overhead bytes in uncompressed form, but the advantage is small (about 1%) and it's probably not worth the overhead when decoding.

Out of all the different variations I've tried, the second flag based method, using 16-pixel groups and 8 pointers per 128-high column nets the best overall compression - shrinking the 5MB uncompressed texture data to about 3MB.

Code: Select all

------------------------------------------
total pixel subgroups: 310656
unique pixel subgroups: 163013
uncompressed pixel groups: 0
pixel groups with overhead: 17420
------------------------------------------
uncompressed size: 428 textures in 4970496 bytes
compressed data size: 2086992 bytes
table size: 931968 bytes
total size: 3018960 bytes
------------------------------------------
compression factor (data only): 41%
compression factor (including tables): 60%
------------------------------------------
I've not yet experimented with splitting texture data into separate banks much. However I did some early tests with compressing smaller subsets of the data, and my first guess is that compression will suffer a lot from that, I doubt the smaller table sizes are enough to counter that.
93143 wrote: Tue Apr 05, 2022 11:18 pm I don't understand. You should never have to do a backface cull in a portal engine. If you can see it through a portal, it's not a backface.
The thing is that you don't know what you can see when you first peek into a sector and you need to figure that out. So the backface culling is done to reduce the number of faces you have to do a more complicated test for.
93143
Posts: 1717
Joined: Fri Jul 04, 2014 9:31 pm

Re: SNES Doom Source Released! Now What?

Post by 93143 »

none wrote: Thu Apr 07, 2022 3:45 pmI've been trying out some different compression schemes for wall texture data, expanding on the texture atlas idea.
I really need to actually try out these ideas myself. After all the hype I've built up about my extra compression idea, I don't really want to say what it is without testing it. The trouble is that my ideas (both the old one and my take on your dictionary idea) require a smart encoder, so they're nontrivial to test, and I'm still pretty busy.
I've written a basic encoder that puts all the textures from a WAD into one big data buffer, and then creates groups of 8, 16, 32, ... pixels, and stores a 24-bit pointer to each group (e.g. 16, 8, or 4 pointers per vertical column, respectively).
But the patches don't always line up with those group definitions. Using 8 pixels is likely to give the best results (most switches are 24 texels off the ground, for instance), but it's still not perfect (23 is not unheard of, and not all of them are rectangles), and you need an unnecessarily large pointer table because most shared texel runs are a lot longer than that. I was thinking of more of a dynamic encoder that would check for and reference identical runs regardless of size.

The remote referencing would only be used where actually necessary; unique textures and prototypes (textures referenced elsewhere) would just be stored as local literals using whatever compression scheme underlies the dictionary method, unless they contain substantial runs of duplicate data themselves in which case they can reference their own data. It strikes me that a variable-run-length local compression method might play better with a variable-group-length dictionary method, since you could align runs with groups rather than needing unique groups every time a patch boundary falls in the middle of a run.

Also, 24-bit pointers are a lot less practical than 16-bit pointers. Not only do you have to use ROMB twice every time, but you need a spare register to store the old bank so you can go back. This is less of an issue if the pointer tables take up a small enough number of banks that you can just duplicate the code and use immediates, but that's even slower, and with a megabyte worth of pointer tables (or, in my notional approach, pointers that can be anywhere in the texture data pool) I don't think that's a particularly elegant solution...
I've tested with the Doom 2 texture data.
Did you use the full run-time texture set or the WAD patch set?
An approach like this would allow (nearly) random access into the texture data (limiting the amount of pixels that need to be skipped to the pixel group size).
I can see what you're trying to do here. Having a constant group size reduces the workload when trimming the top off a texture. But if using arbitrary-sized groups provides a significant improvement in compression, I think it's likely to be worth it. I doubt this was the major reason DOOM-FX was slow.
I've tried a few different RLE schemes (8-bit run lengths, 4-bit run lengths, LZ4-like variable run lengths), after removing duplicate pixel groups, none of them seems to improve compression significantly (reducing size only by 1% - 2%) - even if it is applied to the raw data (not restricted to pixel groups), which makes me doubt a mixed dictionary / RLE approach will work at all (unless I screwed something up in the RLE implementation).
Perhaps you have. I doubt Randy would have settled on RLE if it was that bad. Just looking at some of these wall textures, it seems impossible that RLE could be so ineffective. Are you encoding columns and allowing runs of literals?
However, the two flag based methods work a lot better, both perform about equally well. Both methods store a flag word with a flag for each pixel in the group. The first method is a "common byte" method (storing the most often used color first, and then flagging every pixel that differs from this color and storing only the pixel data for those pixels), the second method is a variation of this that does not store a common byte but flags each pixel that differs from the last one instead - this may be a little faster to decode.
Interesting. A sort of more-forgiving RLE that uses constant run lengths (but is still VBR and requires per-texel decoding to get the data length, so kinda the worst of both worlds there)... I think I could rig this to allow partial runs in the context of my extra compression scheme, but it wouldn't be free...
Out of all the different variations I've tried, the second flag based method, using 16-pixel groups and 8 pointers per 128-high column nets the best overall compression - shrinking the 5MB uncompressed texture data to about 3MB.
That's unfortunately not good enough. I'm pretty sure the GSU2 can't address more than 2 MB on the ROM bus. Trimming and probably outright cutting textures would still be necessary. I'll see if I can't make time to prototype some encoders for my ideas and see if they work.
I've not yet experimented with splitting texture data into separate banks much. However I did some early tests with compressing smaller subsets of the data, and my first guess is that compression will suffer a lot from that, I doubt the smaller table sizes are enough to counter that.
Well, the dictionary lookup is the outer loop, so perhaps the extra overhead for 24-bit pointers is forgivable. I don't like losing that register, though; I already have to make space just to use the dictionary method at all...
93143 wrote: Tue Apr 05, 2022 11:18 pm I don't understand. You should never have to do a backface cull in a portal engine. If you can see it through a portal, it's not a backface.
The thing is that you don't know what you can see when you first peek into a sector and you need to figure that out. So the backface culling is done to reduce the number of faces you have to do a more complicated test for.
But with sorted linedefs, the "complicated test" is literally just projecting one vertex in one axis and comparing it with one portal edge (at least once you get started - with a triangular subsector or a linear search from the portal edge it's true all the time). Can an accurate backface cull be significantly cheaper than that? I guess the perspective division is something... And a failed backface cull still leaves the question of whether the face is actually visible through the portal, whereas checking that directly solves both problems at once.

Besides, I think most subsectors in Doom are either triangles or quadrilaterals, so there isn't a ton of wasted time. Particularly if you can select a likely candidate a priori.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: SNES Doom Source Released! Now What?

Post by tepples »

"flags each pixel that differs from the last one instead" is pretty much how the PB8 RLE codec works. Each 8-byte block starts with a flags byte (0: literal; 1: repeat), with 1 bit for each byte that the block represents, and then the literal values for those bytes that aren't repeats. So to skip 8 pixels, you read one byte and add 9 - popcount(value) to the pointer to skip that byte's pixels. Three popcounts and you've read up to the height of most switches. Can Super FX calculate a byte's popcount (Hamming weight) efficiently?
creaothceann
Posts: 611
Joined: Mon Jan 23, 2006 7:47 am
Location: Germany
Contact:

Re: SNES Doom Source Released! Now What?

Post by creaothceann »

I didn't see popcount here, so it'd have to be done with a 256-byte LUT. (16 bytes for doing it 4 bits at a time.)
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Post Reply