Efficient technique for improving DMC timing accuracy

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

supercat
Posts: 161
Joined: Thu Apr 18, 2019 9:13 am

Re: Efficient technique for improving DMC timing accuracy

Post by supercat »

lidnariq wrote:
supercat wrote:the only arrangements that can't be resolved easily are those that would have less than 1552 [cycles] at the end after everything else is accounted for
Er, that's my point. The threshold is 1428 cycles, not 1554. Because 1552 is achievable. Admittedly it's comparatively expensive, because that bit period of 214 cycles means busy-waiting for almost two scanlines in the subsequent IRQ, but it's still achievable.

I suppose, given that you have this level of precision, sometimes one might prefer to use IRQs where p1=p2 to skip busy-waiting to save on CPU time, and only use the more precise ones near the end of the frame to achieve PPU synchronization.
The minimum duration for a typical ISR is going to be 54 or 72 cycles because of the need to perform the second rate write after the first reload. A typical raster interrupt would need some time padding to meet that, and could thus allow quite a bit of room for adjustment as to when time-critical stores are taken without having to increase its total execution time. The only thing that needs to be a specific number is the total time for a frame; everything else can be compensated at little or no extra cost.
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Efficient technique for improving DMC timing accuracy

Post by lidnariq »

supercat wrote:The minimum duration for a typical ISR is going to be 54 or 72 cycles because of the need to perform the second rate write after the first reload.
The point I'm trying to get at is that usually one would want to avoid using a large value for "p2" (the second value written in the ISR) because it would impose a problematically huge overhead cost to the following ISR. But, if that next ISR has "p1" (the first value written in the ISR) the same as "p2", the wait for the second write can be skipped, and the second ISR has minimal overhead.

I'm only talking about the "middle" ISRs, the ones that work together to get the desired IRQ at the right time. The ones that actually are used for raster effects have an entirely different cost calculation.

I'm just not certain how often this would come up. So far all I've done is write a naïve depth-first search, which is distinctly the wrong approach for finding an optimal set of values to write to $4010.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Efficient technique for improving DMC timing accuracy

Post by tepples »

The loss of one CPU cycle for each 6 frames (one dot per two frames) while rendering is turned off when decompressing a new scene into the pattern tables and nametables might prove problematic. How do you solve that while keeping NMI off?
supercat
Posts: 161
Joined: Thu Apr 18, 2019 9:13 am

Re: Efficient technique for improving DMC timing accuracy

Post by supercat »

tepples wrote:The loss of one CPU cycle for each 6 frames (one dot per two frames) while rendering is turned off when decompressing a new scene into the pattern tables and nametables might prove problematic. How do you solve that while keeping NMI off?
The code has two types of frames--long and short. Presently it simply uses a counter to generate one long frame for each short frames, but it could just as well add 3 to a counter for every short frame which had video enabled, subtract 9 for every long frame which has video enabled, show long frames when the counter is positive and short frames when it is negative. If it did that, then code could compensate for blanked frames by ensuring that rendering was always disabled for a multiple of two frames, and add an extra count for each pair of frames.

Alternatively, one could use raster splits to blank most of the screen without blanking the part that controls that weird half-dot drop. This may require subdividing the task of putting data into CHRAM into chunks that can fit within the various blanking intervals, but on the flip side would allow the game to show the player something while it was preparing the next level.

I think the simplest approach is probably to have the IRQ use the sequence of ten IRQs of length 426*8, eight of length 380*8, one of length 54*8, and one of length 54*7+190 (59560 cycles) and add 7 to the counter for every time it repeats the sequence. Provided the screen isn't blanked for more than about a quarter second, the code would generate enough long frames to regain sync. If blanking times could be long, one could make the IRQ use a sequence of length 59562 and subtract 5 from the counter any time the counter is positive, but I that wouldn't be necessary if the screen is only blanked for a few frames.
supercat
Posts: 161
Joined: Thu Apr 18, 2019 9:13 am

Re: Efficient technique for improving DMC timing accuracy

Post by supercat »

lidnariq wrote:I'm just not certain how often this would come up. So far all I've done is write a naïve depth-first search, which is distinctly the wrong approach for finding an optimal set of values to write to $4010.
My approach was simply to keep a list of how best to compute each value, and then go through the array and for each value, fill in all of the higher values that can be computed from it in a way that's shorter than any way yet found for them.

To support the cases where p1 and p2 are both large, but the next step has p1 and p2 equal, one should keep a list of the best way to compute each value with the last p2 being small, and the best way if the last p2 doesn't need to be small. The one with p2 small could add values where p1==p2 and where p1!=p2, while those where p2 is big could only add in those where p1==p2.

In any case, unless one cares about IRQ durations during blanking, I don't think any of the "tricky" values are needed.
supercat
Posts: 161
Joined: Thu Apr 18, 2019 9:13 am

Re: Efficient technique for improving DMC timing accuracy

Post by supercat »

tepples wrote:The loss of one CPU cycle for each 6 frames (one dot per two frames) while rendering is turned off when decompressing a new scene into the pattern tables and nametables might prove problematic. How do you solve that while keeping NMI off?
If you want to try to actively maintain sync, that can be done by using the sequence "LDA $2002 / BIT $2002" timed so that vsync should hit just after the load and nudge the frame timing if it doesn't. The problem I see with that is that unless one can control what instructions the main line is running, the 7 cycle uncertainty of when the IRQ or NMI will get processed would lead to considerable timing uncertainty. Although I haven't figured out the best way to establish initial sync, it should be possible to synchronize things within a few frames so that the timing of DMC events would be known precisely even though the timing of individual interrupts would have 7 cycles of jitter. If one tries to take measurements of whether one is ahead or behind within an interrupt whose timing isn't certain, that timing uncertainty would be added to one's measurements.

Perhaps the right approach would be to actively synchronize the DMC and display whenever the game is paused, but otherwise just run with the DMC. It's possible to align the CPU with the DMC, but it's expensive. The cost wouldn't matter during pause, however, and if the raster gets out of sync a player could fix the problem by pausing and unpausing the game.
blackbird
Posts: 4
Joined: Thu Jun 02, 2022 9:04 pm

Re: Efficient technique for improving DMC timing accuracy

Post by blackbird »

Hey NESDev. I built a demo trying to generalize the approach suggested in this thread. Please check it out: "stableframe" source code

I wanted to work on my first game for the NESDev Compo this year, but I realized my game couldn't meet the mapper requirements for the game category. I need 1) mid-frame scanline interrupts with four-row granularity and 2) to use as little CPU computation as possible to maintain sync. After checking out earlier threads I stumbled on this discussion and approach and thought, if it worked, this would be perfect for mid-frame raster effects without a mapper.

Essentially the demo works in two phases:
  1. When you start the demo you are "desynced" and no interrupts are running, similar to a start screen. When you hit a button you start "dmc sync".
  2. First the game syncs the PPU and CPU, then it kicks off a DMC sample at DMCFREQ=$f (54 cycles). It then measures the length of time it takes to fire an IRQ with a page of nops, with the lower byte of the program counter acting as our measurement.
  3. Once the IRQ fires, we precisely modify DMCFREQ four times (with values from a lookup table) to get DMC to be aligned, no matter what value we measured. (From what I can tell four updates to DMCFREQ here is enough for any initial offset.) Finally, we kick off another 54-cycle DMC sample and rti to the main game loop.
The next phase is post-sync, which is the approach described in this thread: each interrupt triggers the next DMC sample fetch.
  1. Each IRQ is an indirect jump into an IRQ routines table, which we walk down to do raster effects and change timing.
  2. The first IRQ that fires is our "VBLANK IRQ" at the start of scanline 240, which replaces the need for NMI.
  3. In this demo, the middle of the frame has a series of interrupts every four rows to change the color emphasis bits and write a coarse PPU scroll in HBLANK.
  4. The end of the frame accommodates for the fractional number of CPU cycles in a frame by jumping to one of four timing sequences at the end of the table. Then it loops.
I hope this is useful and can generalize for other projects. You can customize the interrupts using gen_irq_routines_table.py, where you describe your interrupt setup and it generates an IRQ table for you, so you don't have to calculate frequency changes yourself. For example:

Code: Select all

frame = Frame()
frame.advance_to(cpu_cycle) # use a pre-computed frequency combination to try and align to a given cycle

frame.one_step(r54) # next IRQ just sets DMCFREQ to $f and returns
frame.two_step(r128, r54) # do a two-step change to DMCFREQ where P1=128 and P2-P7=54

frame.one_step(r54, routine="irq_routine_custom") # specify a custom IRQ routine to run next

print(frame.remaining_in_frame()) # how many cycles left in a frame, for example 8875.5
I tested this a bunch in Mesen and FCEUX, and left it running on my NTSC front-loader for a few days, and it looks pretty stable (fingers crossed). I also included a Mesen script to stress test DMC sync to see that it always syncs to the same PPU cycle. I'm still new to NES development so let me know if I'm missing anything. I owe a lot to supercat's demo and other posters in this thread for getting this working, so thank you!

Post-sync:
screenshot.png
Attachments
stableframe.nes
(40.02 KiB) Downloaded 42 times
Last edited by blackbird on Mon Jul 04, 2022 11:56 am, edited 1 time in total.
Fiskbit
Posts: 891
Joined: Sat Nov 18, 2017 9:15 pm

Re: Efficient technique for improving DMC timing accuracy

Post by Fiskbit »

This is really incredible work! I had the same issue in last year's compo where my game couldn't work without scanline interrupts, so I had to use the DMC IRQ, but nothing this complex. I'm considering doing one this year that may need 8+ interrupts per frame, so this looks potentially very useful for me.

I'll have to spend some time digging into this, but I've got some initial questions:

1. Are the IRQ positions fixed or can they be at different vertical positions on different frames, such as if the screen is scrolling vertically? It sounds like they may need to be fixed, though the left/right behavior in the demo makes me think maybe the whole set of interrupts can be shifted up or down. It looks like that also shifts the vblank IRQ, though, so maybe not.

2. How close can the interrupts be to the bottom of the screen? Since you're using an IRQ to signal vblank, I'm guessing there's a 4 scanline gap there? Perhaps for interrupts too close, the solution is to just burn cycles from an earlier interrupt. I'm guessing there's no restriction on how close they can be to the top.

3. Does it matter for the measurement which of the two alignments the CPU and APU boot into? I'd guess not because your demo allows syncing on any frame, which presumably is like starting with a random alignment.


(Regarding your demo ROM, you may want to initialize PPU memory; I think your attributes are uninitialized and can cause the text to be unreadable.)
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Efficient technique for improving DMC timing accuracy

Post by lidnariq »

Fiskbit wrote: Mon Jul 04, 2022 10:59 pm 1. Are the IRQ positions fixed or can they be at different vertical positions on different frames, such as if the screen is scrolling vertically? It sounds like they may need to be fixed, though the left/right behavior in the demo makes me think maybe the whole set of interrupts can be shifted up or down. It looks like that also shifts the vblank IRQ, though, so maybe not.
They're fixed. It's a set of IRQs that are configured to take the exactly correct number of total M2 cycles.

Since each IRQ can do some calculation as well, you can switch between multiple different tables.
2. How close can the interrupts be to the bottom of the screen? Since you're using an IRQ to signal vblank, I'm guessing there's a 4 scanline gap there? Perhaps for interrupts too close, the solution is to just burn cycles from an earlier interrupt. I'm guessing there's no restriction on how close they can be to the top.
Supercat's system cannot use NMIs at all, so you can choose any vertical alignment you want. The only real constraint is that each IRQ has to be (7+128·n)·x+y M2 cycles, where x and y are from the DPCM bit period table and n is an integer, and that the total sum of all IRQ periods must work out to the required 29780½·4 or 29780⅔·3 cycles
blackbird
Posts: 4
Joined: Thu Jun 02, 2022 9:04 pm

Re: Efficient technique for improving DMC timing accuracy

Post by blackbird »

Fiskbit wrote: Mon Jul 04, 2022 10:59 pm This is really incredible work! I had the same issue in last year's compo where my game couldn't work without scanline interrupts, so I had to use the DMC IRQ, but nothing this complex. I'm considering doing one this year that may need 8+ interrupts per frame, so this looks potentially very useful for me.
Sweet, that's what I was hoping for. :) Thanks for pointing out that nametable attributes are unset, I'll fix it on Github.
Fiskbit wrote: Mon Jul 04, 2022 10:59 pm 1. Are the IRQ positions fixed or can they be at different vertical positions on different frames, such as if the screen is scrolling vertically? It sounds like they may need to be fixed, though the left/right behavior in the demo makes me think maybe the whole set of interrupts can be shifted up or down. It looks like that also shifts the vblank IRQ, though, so maybe not.
As lidnariq mentions, since we're using indirect jumps into a table for each IRQ you can do some dynamic behaviors if you want. After the first sync my demo uses a known sequence of interrupts and DMC frequencies since that works best for my game, but if I wanted to try converting this to something capable of vertical scrolling:
  • You could have second and third end-of-frame alignment sequences that ends the frame one scanline early or late. You could update the IRQ trampoline to those when scrolling instead of the (+1.5, -0.5, -0.5, -0.5) sequence I use in my demo.
  • Since we can't use NMI, I chose in my demo to have VBLANK happen at a known interrupt, but you could have any IRQ likely to occur during VBLANK check for PPUSTATUS bit 7 and jump to your logic if set. You'd just need to ensure those interrupts are long enough to accomodate VBLANK setup.
There are some assumptions I make in my demo, such as compensating for the 0.5 additional frame cycles at the end of the table, though you could do this at the beginning or middle of frame as well. The key to any frame alignment is just having two runs of DMC frequency changes that end two cycles apart. This is easy to scan for in `cycle_map.py` which has some basic cycle times and the frequency changes needed to achieve them. But you can also perform much more dynamic cycle waiting behavior if you know or can compute what frequency values to use.
Fiskbit wrote: Mon Jul 04, 2022 10:59 pm 2. How close can the interrupts be to the bottom of the screen? Since you're using an IRQ to signal vblank, I'm guessing there's a 4 scanline gap there? Perhaps for interrupts too close, the solution is to just burn cycles from an earlier interrupt. I'm guessing there's no restriction on how close they can be to the top.
Since I want an interrupt right at scaline 241, the floor is 3.8 scanlines (54*8=432 cycles) before end of frame, before needing to resort to delay loops. But if you shift the VBLANK IRQ you can get as close as you want. FWIW the game I'm working on currently has interrupts midway through scanline 216 and 224.
Fiskbit wrote: Mon Jul 04, 2022 10:59 pm 3. Does it matter for the measurement which of the two alignments the CPU and APU boot into? I'd guess not because your demo allows syncing on any frame, which presumably is like starting with a random alignment.
Good question, I believe I compensate for random CPU and PPU alignment as described on the wiki but I'm not sure I handle CPU and APU misalignment.
lidnariq wrote: The only real constraint is that each IRQ has to be (7+128·n)·x+y M2 cycles, where x and y are from the DPCM bit period table and n is an integer, and that the total sum of all IRQ periods must work out to the required 29780½·4 or 29780⅔·3 cycles
This sums it up. You can also change DCM periods more than twice in an interrupt if you need to, though you'll burn a lot of CPU cycles doing so. This is how I do the initial sync: by changing DMCFREQ four times in one interrupt, we can compensate for any DMC delay between 0-432 and land the timer on a common cycle (usually around 500-600 cycles later). The downside is that this precomputed table takes up 432 bytes (each nybble being one DMCFREQ value) and can only be used for one specific, precomputed alignment, and you're running CPU delay code for multiple scanlines at a time.
Post Reply