Sprite rendering research

Discuss emulation of the Nintendo Entertainment System and Famicom.
Fiskbit
Site Admin
Posts: 1383
Joined: Sat Nov 18, 2017 9:15 pm

Sprite rendering research

Post by Fiskbit »

I've done a bunch of sprite rendering research off and on for the last year or so using the amazing circuit diagrams in Breaks, and I recently did a bunch more to help out kitrinx with a new PPU for the Mister NES core. I'm planning on putting my findings together into a new wiki page hopefully sometime soon. I wrote up a bunch of text on Discord about the most recent findings and figured I'd bring it over to the forums so it's more accessible for the time being.

Note that all dot numbers I'll be discussing here are from the PPU FSM's perspective, while dots on the wiki are generally 1 more than this because nearly all signals have at least a 1 dot delay. Going forward, I'll also be referring to primary OAM and secondary OAM (which live in the same block of OAM, just with different addresses) as OAM1 and OAM2 for brevity.

Sprites on scanline 0

I've been curious for some time what it is that actually prevents sprites from rendering on scanline 0. Accurate emulators that I'm aware of haven't really had a good explanation for this and generally make it work with some kind of hack. The reality of how it works, however, is actually pretty horrible. The relevant pages in Breaks for this are the HV decoder (indicates which scanline and dot ranges impact which control signals), Sprite Eval (the bulk of this material), and OAM (controls the OAM write signal).

Basically, a lot of the sprite handling depends on something Breaks calls /VIS or n_VIS, which is a signal that is asserted during dots 0-255 of scanlines 0-239 while rendering is enabled (it's limited by what Breaks calls VBLNK (240-261) and BLNK (240-260 or rendering disabled)). Sprite handling is made up of 4 phases during the scanline, which have this behavior:
  • 0-63: OAM2 init. The PPU alternates between reading OAM1 on even dots and writing OAM2 and incrementing OAM2ADDR on odd dots. The OAM buffer is forced to $FF, so this value gets written across OAM2.
  • 64-255: Sprite evaluation. The PPU alternates between reading OAM1 and incrementing OAM1ADDR on even dots and writing OAM2 on odd dots. OAM1 increments are +4 (zeroing out the bottom 2 bits) unless the byte is found to be in-range, in which case the next 4 increments are +1 to both OAM1ADDR and OAM2ADDR (although the last OAM1ADDR increment can be +1 or +4 due to a bug). Once 8 sprites have been found or OAM1ADDR overflows, OAM2 dots become reads, instead.
  • 256-319: Sprite fetch. OAM2 is selected the whole time. For each of the 8 sprites, the PPU reads 4 bytes, reading the last one 4 extra times before incrementing again and repeating this process. It fetches memory from VRAM and loads all of this data into the shifter, but clears the pattern data if the Y coordinate wasn't in range.
  • 320-340: Idle. OAM2 is selected and read the whole time. Nothing else is done.
So normally, you have /VIS asserted during dots 0-255 and deasserted for the rest of the scanline. On the pre-render scanline, though, /VIS is deasserted the entire scanline. Sprite fetch is not affected at all by this because /VIS is never asserted when it runs, but it causes some significant changes to the operation of the first two phases:
  • The PPU is not allowed to increment OAM1ADDR ($2003) as part of sprite evaluation. (Writing to $2004 increments as normal.)
  • OAM2ADDR is selected during rendering whenever /VIS is not asserted. So, the PPU doesn't touch OAM1 at all on pre-render.
  • The PPU's automatic OAM2 writes are forced to be reads, instead. That means that OAM2 init cannot clear OAM2, and OAM2 is not written during evaluation at all.
  • The in_range result is not allowed to enter the latch chain that makes it increment by +1 for several dots, so it stays in +4 mode. (This doesn't matter much, but this chain runs whether rendering is enabled or not and should affect $2004 write increments. Increments are only forced to +1 during BLNK, which does not include pre-render.)
This means the PPU basically does nothing in these phases. OAM2 init doesn't work and sprite evaluation doesn't examine or copy anything, so OAM2 maintains its previous values. Whatever those were.

Then we get to sprite fetch. This phase should work as normal, so it will load the 8 sprites from OAM2, checking to see if each one is in range and only loading the pattern data if it is. In practice, what happens is that rendering concludes at the end of scanline 239, which did its normal work for getting sprites ready for the next scanline, scanline 240. Then we have 21 scanlines of vblank, which is short enough that OAM persists across this, and pre-render happens. The failed OAM2 init should at least sweep across OAM so that all of the values are refreshed and healthy, even if they're not initialized. Finally, the sprite shifters go through OAM2 checking if the sprites are in range. Those sprites are in range for scanline 240, but we're checking scanline 261, which actually looks like scanline 5 because the in-range check only uses the bottom 8 bits of the vertical counter. They aren't in range for scanline 5, so all the sprite shifters load with transparent pixels and no sprites are drawn on scanline 0.

And that is why we don't see any sprites on scanline 0.

However, this has various surprising implications. The pre-render scanline is actually fetching sprites for scanline 5, and so if OAM2 has sprites in it that are in range on scanline 5, they will be drawn. It can't put them into OAM2 itself, but if they're there, it will render them. And there are multiple reasons why they may be there.

For one, it happens just naturally from sprite evaluation on scanline 239. The Y coordinate of the current sprite in OAM1 is always copied to OAM2, even if it isn't in range. This means that if you don't fill OAM2 and OAM1 sprite 63 is out of range, you'll end up with one entry in OAM2 that has sprite 63's Y coordinate followed by three $FF's from OAM2 init. While this Y coordinate isn't in range for scanline 239, it could be for scanline 261 (aka scanline 5). If its coordinate is 0-5, you'll end up with a sprite drawn at the very right edge of scanline 0. There are various reasons why it may not be visible, though: tile $FF may not have an opaque pixel in that spot, or it may be behind the background because the priority bit is set. We did find that this happens in commercial games, though. Namco Pac-Man clears OAM by writing $00's, so unused sprites are in range of scanline 5, producing an orange dot in the top right corner (see attached image; note that it's surprisingly hard to find real hardware footage that doesn't crop this scanline). It should also happen in any game that has sprite shuffling that can put sprites in this narrow Y range into OAM1 slot 63 (assuming they meet the other requirements for the resulting dot to even be visible).

It could also happen if you turn off rendering on scanlines 0-5 to preserve OAM2 until pre-render. It's a long wait from here to scanline 261, though, so OAM decay is a real concern, but you can prevent decay by sweeping $2003 across the address space (even just by writing $2004 or doing OAM DMA constantly while you wait).

Most surprising to me, though, is that you can get sprites on this scanline simply from OAM decay. Because the PPU doesn't init OAM2, most of it will be decayed on this scanline the first frame after a long blanking period, regardless of whether you've set up OAM1 with OAM DMA. If the values permit it, you'll just get random sprite slivers here. The saving grace is that it doesn't trigger sprite overflow or sprite 0 hit (at least, not unless you toggle rendering in a way that preserves sprite evaluation's sprite 0 flag). Unfortunately, this also seems particularly problematic if you're using MMC3 IRQs with 8x16 sprites, because this could in theory cause unexpected A12 edges in sprite fetch that clock the counter one or more extra times. This should be testable on the Everdrive even with its emulated MMC3.

Of course, you don't even need the help of OAM2; you can turn rendering on late in the scanline to preserve the contents of the sprite shifters from an earlier evaluation and they should happily draw on scanline 0 for you. So there's that option if you actually want sprites there.

Sprite shifters

Sprite shifters are pretty confusing at first. The main relevant page for them in Breaks is FIFO.

The shifters are loaded with data during sprite fetch, and then they're told to start counting if rendering is enabled on dot 339 of scanlines 261 through 239 (the 0/HPOS signal in Breaks' HV decoder). Once they have started counting, they do not stop until they have expired. When they expire, they start clocking the shift registers. However, they can only clock the shift registers when /VIS is asserted (ie while rendering is enabled in dots 0-255 scanlines 0-239).

This means that if you disable rendering while the sprite shifters are running, they will continue to count. However, if they were outputting and rendering is disabled, or if they expire while rendering is disabled, they wait until rendering is enabled again before continuing to advance the shift registers. Because of this, rendering toggles don't cause any sprite pixels to be lost, and if you turn rendering on mid-scanline, you're likely to have a bunch of sprite shifters all start outputting at once for the next 8 pixels, with the highest priority sprites winning.

When the shifters are told to go on dot 339, there is a delay before they actually see this signal and start going. When dot 340 is skipped, this produces an interesting glitch. All of the shifters will start the scanline in the enabled state, outputting their first pixel at X=0 and shifting. They then begin counting as normal, so the count is offset by 1. When it expires, they will output the remaining 7 pixels. The first pixel shifting cancels out with the delayed count, so the remaining 7 pixels are drawn in the same place they normally are. This glitch just results in their first pixel of all sprites on scanline 0 being drawn at X=0 instead of the correct position. This means that in Pac-Man, the dot in top right corner is actually drawn in the top left corner when the preceding dot 340 is skipped, so it should alternate back and forth.

If rendering is disabled when the sprite shifters would be told to go, then they are basically left in the same state as the counter being expired (the counter expiring simply clears the bit that makes them go). Whenever they're not counting and they are in the visible region of the screen with rendering enabled, they output and clock the shift registers, which is why you get that dot 0 behavior on scanline 0 when skipping dot 340 as described above. If you skip the dot 339 signal that makes them go and you turn rendering back on in the visible region, all 8 shifters will immediately start outputting, just like shifters that expired or were suspending by disabling rendering mid-screen.

$2004 writes during rendering

This one's a lot more speculative right now, but I figured I'd include it anyway. The relevant page for this is OAM.

Current wisdom is that writing to $2004 during rendering does not modify OAM, but $2003 is incremented (in +4 mode). Based on the circuit, I think there's actually a lot more going on. I don't see anything that suppresses the write signal from $2004 writes, so I think it will actually write to the current OAM address, which may even point at OAM2. What it doesn't do, though, is allow the register data bus into OAM. Rather, it looks like it will use the current value in the OAM buffer. I think this is essentially the same scenario as the PPU's copies from OAM1 to OAM2 during sprite evaluation, where it reads OAM1, has that in the OAM buffer, and then writes the OAM buffer to OAM2. So, I think $2004 writes during rendering will write the OAM value from the previous dot into the current OAM address.

It will also trigger an increment. If OAM1 is already being incremented anyway, then it's just one normal increment; the two enables are combined. If OAM1 is not being incremented, I suspect it chooses +1 or +4 based on the value being written to OAM. On the other hand, if this occurs on a cycle adjacent to an OAM1 increment, it may produce special behavior akin to the +5 increment during the sprite overflow search. There, it does +4 on OAM1 dots and +1 on OAM2 dots, but this appears to produce some kind of glitchy behavior because +4 increments are supposed to clear the bottom 2 bits of $2003, which doesn't happen here. I suspect that same condition may occur here for adjacent-dot increments. However, if you have two +4's in a row, I don't really have a guess as to what will happen. Unfortunately, I just don't understand yet how the counter behaves during the sprite overflow search and the behavior may rely on some kind of consistent analog weirdness.
You do not have the required permissions to view the files attached to this post.
User avatar
Dwedit
Posts: 5259
Joined: Fri Nov 19, 2004 7:35 pm

Re: Sprite rendering research

Post by Dwedit »

Scanline 0 is a flickery mess on an actual TV (the skipped pixel is actually making the scanline twitch left and right 1 pixel), so it's good that it's never actually seen.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
Fiskbit
Site Admin
Posts: 1383
Joined: Sat Nov 18, 2017 9:15 pm

Re: Sprite rendering research

Post by Fiskbit »

This behavior likely also affects RGB systems (I say likely because those are almost all earlier PPU revisions than what I studied). Aside from the Titler's 2C05-99, these don't have a skipped dot, and they're more likely to display what would normally be overscan because of arcade monitor adjustments.

I find it interesting that PAL crops scanline 0; I had previously assumed it was because sprites simply can't be drawn there, but perhaps these sprite glitches factored into that decision. (Alternatively, they could have instead fixed the bug, which I don't think would be that hard. You need to allow writes during OAM2 init on the pre-render scanline, so it would be a change to the NOR that handles the automated /WE input. They do seem to have fixed a lot of bugs in the PAL PPU.)

Probably the most important finding about scanline 0 here is that decayed OAM2 could mess with the MMC3 scanline counter when using 8x16 sprites. I don't know just how likely this is, but it should be possible. Any of sprites 1-6 being in range and having an even tile ID would do it.
Fiskbit
Site Admin
Posts: 1383
Joined: Sat Nov 18, 2017 9:15 pm

Re: Sprite rendering research

Post by Fiskbit »

I've added another finding about sprite shifters to the first post, as the current 4th paragraph in that section. It turns out that skipping the dot at the end of pre-render causes the shifters to output their first pixel at X=0 instead of the correct location later in the scanline. The other 7 pixels of each sprite are drawn as normal; it's just that the first pixel should alternate between X=0 and the correct location.

Edit: An interesting side-effect of this is that with careful rendering toggles and OAM address manipulation, you can get this X=0 glitch to trigger a sprite 0 hit, which can be used to distinguish between 2C02 and RGB PPUs because RGB PPUs don't skip this dot (except the 2C05-99, which quacks like a 2C02; sorry, Titler fans). 100th Coin and I have designed a test for this that is included in the latest AccuracyCoin test suite. This is much faster than attempting to detect RGB PPUs using $2002 reads, which takes around 30 frames compared to the 3 or so here.
Alyosha_TAS
Posts: 225
Joined: Wed Jun 15, 2016 11:49 am

Re: Sprite rendering research

Post by Alyosha_TAS »

That's a lot of cool findings, nice work!
Fiskbit
Site Admin
Posts: 1383
Joined: Sat Nov 18, 2017 9:15 pm

Re: Sprite rendering research

Post by Fiskbit »

I'm making progress (slowly) on getting my sprite knowledge up on the wiki. I had to take a lengthy detour to make a PPU signals page that documents the signals produced across a scanline, what they're used for in the PPU, and how much delay there is on each signal at each place they're used. This is pretty critical information for nailing down exact behavior such as the increment timing for OAM1 and OAM2 addresses during sprite rendering. There's still more to do on that page, but it's good enough for now for me to progress on a new sprite rendering page.

I also wrote up a page explaining OAM internals. This is hopefully high level enough for programmers to understand the core concepts and low level enough to fully explain why decay and corruption work like they do. The specific cases where OAM can corrupt will be discussed probably in the sprite rendering article.