CPU<>PPU order of operations

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

CPU<>PPU order of operations

Post by Near »

I can find no documentation on this anywhere, so I'll ask here.

So, the CPU and PPU run in parallel. Of course that is impossible to do with our emulation, which runs each processor a little at a time.

So say that the CPU and PPU are at exactly the same position in time ... how do they interleave?

A CPU cycle will consume 12 clock cycles, whether it is a read or a write. So let's say the CPU reads from $2002. Does it read the PPU state, and then have the PPU tick three times? What about a write to $2000? Does it write the PPU state, and then tick the PPU three times? (Note: ticking the PPU would be when NMIs were tested.)

Under this model, I can see no possible way to pass ppu_vbl_nmi tests 03 and 07 at the same time. I'm not saying there is not a way! I just can't find it.

For 07, I get the same 00-05=N, 06-08=- that others here have gotten. Unfortunately both people who solved it never bothered to answer how they did it for the benefit of others.

Also under this model, the cache point for F=1,Y=261 BG&&SPR disable PPU cycle skip is 337, which is completely nonsensical as the rest of the PPU works in two cycle pairs, meaning it should be 338.

03 reads from $2002, and 07 writes to $2000. 03 can only pass if we clear the NMI line at 260,340. 07 can only pass if we clear it at 261,0.

The only way I can see to stagger this is if CPU writes affect the PPU four clocks later than CPU reads. This would match SNES bus hold behavior, which is documented in the W65C816S reference manual, yet I can find no documentation on this anywhere for the NES.

The idea is that reads are requests from other chips, so they see them and start acting on them faster. But writes need to stay on the bus for a given amount of time after the other chip sees it to be acknowledged / copied.

So in this instance: CPU reads would add 8 cycles, then run the PPU for two ticks (8 cycles), then perform the read, then run the PPU for one more tick. CPU writes would add 12 cycles, then run the PPU for three ticks, then perform the write.

With this, it is now possible to pass 03 and 07 at the same time. It also allows us to place the extra cycle skip test at X=338. And yet, now 05 has "4444433333" instead of "4443333322", so it fails. And absolutely nothing I try can change that pattern.

I can try and debug problems, and I can special case behavior when the PPU and CPU are at the same exact time (eg a 'conflict'), but I need to know the proper order of operations first.

-----

If it helps any, this is my current setup, which uses the former interleave pattern and fails only 07 (and requires 10 to be at 337):

Timing: (PPU executes one cycle (4 CPU cycles), and then performs the following:

Code: Select all

  if(status.ly == 240 && status.lx == 340) status.nmi_hold = 1;
  if(status.ly == 241 && status.lx ==   0) status.nmi_flag = status.nmi_hold;
  if(status.ly == 241 && status.lx ==   2) cpu.set_nmi_line(status.nmi_enable && status.nmi_flag);
  if(status.ly == 261 && status.lx ==   0) cpu.set_nmi_line(status.nmi_flag = 0);  //260,340 will pass 03, but fail 07
  status.lx++;
$2002 read:

Code: Select all

    result |= status.nmi_flag << 7;
    result |= status.sprite_zero_hit << 6;
    result |= status.sprite_overflow << 5;
    result |= status.mdr & 0x1f;
    status.address_latch = 0;
    status.nmi_hold = 0;
    cpu.set_nmi_line(status.nmi_flag = 0);
$2000 write:

Code: Select all

    status.nmi_enable = data & 0x80;
    status.master_select = data & 0x40;
    status.sprite_size = data & 0x20;
    status.bg_addr = (data & 0x10) ? 0x1000 : 0x0000;
    status.sprite_addr = (data & 0x08) ? 0x1000 : 0x0000;
    status.vram_increment = (data & 0x04) ? 32 : 1;
    status.taddr = (status.taddr & 0x73ff) | ((data & 0x03) << 10);
    cpu.set_nmi_line(status.nmi_enable && status.nmi_flag);
CPU triggers an actual NMI whenever the line transitions from 0->1.
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper »

It might sound empiric, but I assume a PPU clock before a memory access. For NTSC, 3 PPU clocks before a memory access, subject to change if a DMC DMA occurs.
ReaperSMS
Posts: 174
Joined: Sun Sep 19, 2004 11:07 pm

Post by ReaperSMS »

About to head out for a job interview, I'll fill in more detail when I get back.

Running the CPU, then running the PPU 3 times isn't how things happen on the hardware.

Hardware wise, the CPU takes the input clock, and generates two non-overlapping clocks phi1 and phi2 from it. Each one is a pulse wave with a phase offset and duty cycle such that they're high for a little less than a half period of the input clock, and about half an input clock phase apart.

phi2 gets sent out a pin to the cart edge, and is the clock devices should generally use to know when the CPU's providing a valid address and data. Write data and the address go out a bit before the rising edge of phi2, and the CPU latches read data on the falling edge.

I imagine the PPU's bus is similar, other than having seperate rd/wr lines for vram.

Depending on exactly how the internals are wired up, the PPU probably latches the internal reg to the bus drivers somewhere around the rising edge of phi2, and latches the external bus to it's own regs around the falling edge. For regs it only reads, this is pretty trivial. For regs it can write back to, there's some questions regarding how the clocking works.

I don't know how square the post-divider input clock is. It could be very lopsided, and I suspect on the SNES the bulk of the extra cycles for the longer access periods go into phi2.

The two-cycle pairs thing on the PPU is entirely on the VRAM side of it, as it needs to burn a cycle to latch the address (multiplexed A/D bus).
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

Zepper wrote:It might sound empiric, but I assume a PPU clock before a memory access. For NTSC, 3 PPU clocks before a memory access, subject to change if a DMC DMA occurs.
That's pretty much how I assumed existing emulators pulled this off. If it's wrong, it will create a lot of weird "workaround" code that will pass tests, but not be 100% correct for all registers.

As far as whether it's { PPU, PPU, PPU, CPU } or { CPU, PPU, PPU, PPU } is a fairly minor (yet still important) question. Either way it ends up behaving the same after the first cycle. But it may "offset" all timing events by one cycle to compensate.

But let's say reads are 8:read:4, and writes are 12:write. Then the pattern becomes non-deterministic from the PPU alone. For RWWR: { PPCP PPPC PPPC PPCP }

PPU cycles between CPU cycles can vary from 2-4 in this case.

Zepper, I realize it was a while ago, but may I ask how you fixed test 07, if you still recall?
phi2 gets sent out a pin to the cart edge, and is the clock devices should generally use to know when the CPU's providing a valid address and data. Write data and the address go out a bit before the rising edge of phi2, and the CPU latches read data on the falling edge.
Awesome! Sounds like you really know your stuff in this regard :D
Look forward to hearing more. But if possible, can you please dumb it down for someone who isn't well versed with digital circuits? Eg, I'm not sure of the phi1/2 clocks/phases; so more like the PPCP example above would be fantastic.

I suspect it's possible, with lots of hacks, to make an NES emulator pass all tests regardless of how you treat the interleaving, it's just that the proper way will result in a minimum of "special case" checks.
User avatar
thefox
Posts: 3139
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Post by thefox »

Maybe this goes without saying, but there are more than one possible power-up PPU-CPU alignments, which in turn affects some things like how OAM data readback (through $2003 and $2004) (mis)behaves. I doubt it affects any of blargg's test ROMs though, unless he explicitly states so. Hopefully we'll some day see an emulator that actually emulates all the different alignment possibilities...
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

Are there any solid, easy to code tests for which alignment you're on? It might be useful as part of the seed for a PRNG.
Zelex
Posts: 268
Joined: Fri Apr 29, 2011 9:44 pm

Post by Zelex »

If no matter what its startup state dependent, which you can't easily test for, then most games should not rely on the PPU/CPU sync behavior for correct operation. Therefore, it should be pretty irrelevant, no? Hack in correct behavior for the one or two games that require it, and you make your life easier in the process.
ReaperSMS
Posts: 174
Joined: Sun Sep 19, 2004 11:07 pm

Post by ReaperSMS »

Let's see if some diagrams can help a bit. Head to the bottom if you want the too long, didn't read summary. This is gonna be a bit heavy on clock details.

First case, generic 6502. You have an input clock, nominally square, that looks something like the following. Call it mclk for lack of a better term.

Code: Select all

mclk: 111111000000111111000000....
Internally, the CPU needs a few more events than that can provide to be able to process things, and also needs something suitable to use as a signal to external hardware when the address and data are valid. These are phi1 and phi2 (phi2 tends to be listed as M2 in pinouts on the NES). Skipping the hardware that generates it (it's a mess I can barely follow) the outputs look like so:

Code: Select all

mclk: 111111000000111111000000....
phi1: 011110000000011110000000....
phi2: 000000011110000000011110....
phi1 updates the CPU internals, phi2 signals external hardware that it should provide data for reads, and expect data for writes. In reality, they're closer to the full width of the pulses than the above diagram, so for the rest of this, we'll treat it as if phi2 is 1 whenever mclk is 0.

On the NES and SNES, the CPU is getting fed a 21MHz clock signal, that is divided internally to form the mclk signal for the actual 6502 or 65816 core. The unknown bit is whether that divider generates a 50/50 square wave, or some lopsided one. The SNES in particular could be quite likely to generate a lopsided one, as that would give the external hardware the most slack at the lower clock rates. Below, NES50 is a 50/50 NES freq wave, NES33 is a 33/66, and the SNS5 F/S/X and SNES F/S/X waves are the 3 SNES freqs at 50/50, and at a non-square rate I was planning on using in my fpga. F = /6, S = /8, X = /12

Code: Select all

21MHz: 101010101010101010101010101010101010101010101010
NES50: 111111000000111111000000111111000000111111000000
NES33: 111100000000111100000000111100000000111100000000
SNS5F: 111000111000111000111000111000111000111000111000
SNS5S: 111100001111000011110000111100001111000011110000
SNS5X: 111111000000111111000000111111000000111111000000
SNESF: 111000111000111000111000111000111000111000111000
SNESS: 111000001110000011100000111000001110000011100000
SNESX: 111000000000111000000000111000000000111000000000
Note how the 0 periods on the non-square ones are longer. This gives external devices (rom, etc) more time to come up with data and get it stabilized.

As for what exactly phi2 represents, it being high signifies the following:
The address bus will stay stable while it is high
If it's a write, the data bus will stay stable while it is high
If it's a read, the external device should drive the data, and hold it stable while m2 stays high

Now let's take a closer look at the PPU's place in all this. Here are the 21MHz clock, NES33 clock, NES50 clock, and a 50/50 guess at the PPU's post-divider clock

Code: Select all

21MHz: 101010101010101010101010101010101010101010101010
NES33: 111100000000111100000000111100000000111100000000
NES50: 111111000000111111000000111111000000111111000000
PPU:   110011001100110011001100110011001100110011001100
You can see there the x3 relationship, this is for the more trivial case of powerup synchronization. Noting that the M2 phase covers two full PPU cycles, we can surmise that the CPU-facing read hardware does not run directly at the full PPU clock rate, where the PPU would just treat M2 being high as meaning to do a read or write from the CPU that cycle. If it did, most of the accesses would see two back to back ones, rather than the single access it is. Also, it means the value would possibly jitter on the data bus while M2 is high.

There is a much more likely situation for the PPU, and that's where the CPU-facing logic is controlled almost entirely by M2 and R/W.

At the rising edge of M2, for a read, the PPU latches the appropriate data (status flags, $2007 buf, palette data, or OAM[$2003]) onto it's external data lines. This value will stay steady, regardless of what happens underneath. For $2002 and $2007, it also sets a flag indicating that the NMI flag should be cleared, or that the buffer should be refilled. Presumably, the PPU would act on this flag either at one of the following PPU clock edges (for the clear, could be either rising or falling), or at the start of the next memory cycle (2 PPU cycles long). Alternatively, it could treat a $2002 read as being an asynchronous clear of the NMI latch, though that might have some race issues with latching it for returning to the CPU. For the cases that could hit either edge,

For a write, the PPU could either look for the rising edge, and apply the write at one of the next PPU edges, or could wait until M2 falls before applying. Writes to $2005 would fill the latch used to reset the VRAM address, writes to $2006 would fill the temp latch that gets copied into aforementioned scroll regs, and set a flag to flush it to the VRAM address. Writes to $2007 probably set the buffer and a flag indicating a pending memory request. Writes to $2000 or $2001 probably go straight to the register, since from the PPU's perspective, they're read-only. Writes to $2004 probably write the value, and probably set an increment flag that the PPU will act on at it's convenience. One possibility for $2007 writes that comes to mind, is that since they're suppressed during rendering anyways, they might bypass the buffer, and drive the next two PPU cycles directly, since the data will stay steady across the entire access period.

For the SNES, the waves look something like:

Code: Select all

21MHz: 101010101010101010101010101010101010101010101010
SNS5F: 111100001111000011110000111100001111000011110000
SNESF: 111000001110000011100000111000001110000011100000
PPU:   110011001100110011001100110011001100110011001100
Either with a 50/50 CPU clock or a lopsided one, there's still only one PPU cycle covered by M2.

Determining what case actually covers reality could be done trivially with an oscilloscope that can show you M2 against the 21MHz input. Determining it programmatically is a little trickier, given that whether something happens on the positive or negative edge generally just delays it a cycle or not, and the CPU can't read from the PPU fast enough to see the difference. Inspection of the die shots that are floating around might be able to shed some light on the particulars as well, but I haven't gotten the hang of reading them yet, and I think the 2C02 one is still missing a number of layers.

What this means for emulation:

Your approach of doing CPU reads 4 cycles earlier than writes working suggests the PPU latches the output on reads at the rising edge of M2, and latches incoming data on writes at the falling edge. It may not be 100% correct to the hardware, but it's probably within what can be determined without an oscilloscope and/or a logic analyzer with a rather high oversampling rate.

As you can see above, it's not as simple as just interleaving the CPU and PPU, though for a lot of things, there's no visible difference.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

So then the consensus is that nobody knows perhaps the most fundamental aspect of a parallel-operation system. Wonderful =(

In that case, short of adding a hack to force-block an NMI, I can see no way to pass blargg's test 07, without subsequently failing test 03. And yet others seem to solve it. So is everyone else special casing the last point that NMI can be enabled and blocking it; or is there some other trick? If the latter, please explain, would be most helpful :/
ReaperSMS
Posts: 174
Joined: Sun Sep 19, 2004 11:07 pm

Post by ReaperSMS »

It's hardly the most fundamental aspect.

If I'm reading the expected output correctly for the tests, assuming 0 is based the same for each, what we want is something that will produce this:

Code: Select all

cyc  0 1 2 3 4 5 6 7 8
stat V V V V V V - - -
ctrl N N N N N - - - -
Where stat is whether $2002.d7 read back as true or not, ctrl is whether an NMI occurs with a write to $2000 on that dot.

This looks perfectly reasonable, given the assumption that the PPU regs update at or near the end of it's cycle. The internal vblank flag (NMI hold in your code I believe) gets cleared during cycle 5. A read that hits on that cycle will return the state of that flag early in the cycle, whereas a write to $2000 won't hit the register (and the logic that does VBLANK & ENABLE) until the end of the cycle, after the vblank flag drops.

Here's a rough timeline:

Code: Select all

4R 4F 5R 5F 6R
      R
         C
            W
R is the read, C is when the vblank flag is cleared, W is when the written data hits the reg for $2000.

One other thing that comes to mind, though I don't think it applies here, is that if they decided to synchronize the CPU signals to the PPU clock for whatever reason, that would add a 1-2 cycle latency. It's exceedingly unlikely, since it would mean stuffing another bit or two worth of flipflops on all the data lines, M2, and R/W, and would be somewhat unneccesary.

One last thing is that IIRC, NMI coming into the CPU might be synchronized, though with the divider, there's some question as to which clock it's synchronized to. If so, NMI pulses shorter than a CPU cycle might be invisible.
Last edited by ReaperSMS on Mon Oct 17, 2011 7:26 pm, edited 1 time in total.
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper »

byuu wrote:Zepper, I realize it was a while ago, but may I ask how you fixed test 07, if you still recall?
What test are you talking about?
ReaperSMS
Posts: 174
Joined: Sun Sep 19, 2004 11:07 pm

Post by ReaperSMS »

He means test 7 out of ppu_vbl_nmi,

Code: Select all


07-nmi_on_timing
----------------
Tests NMI occurrence when enabled near time
VBL flag is cleared.

Enables NMI one PPU clock later on each line.
Prints whether NMI occurred.

00 N
01 N
02 N
03 N
04 N
05 -
06 -
07 -
08 -
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

ReaperSMS wrote:It's hardly the most fundamental aspect.
The reason I disagree is because changing this alignment will require changing all timing, everywhere. And without having absolute values, you can't give definitive "this behavior happens at this cycle." Take the part where it caches the BG&&OAM enable for skipping the extra PPU dot. Nobody says "it happens at X=338", because where it happens depends on how you choose to guess how the interleaving works.

To me, this is major. This is the core foundation of your emulator. You don't want to build your castle on top of a poor foundation. And nobody can document the completely accurate process of how the hardware works.

Instead, everyone plays games of fiddling with numbers and hold latches and stuff until they can pass a given set of test ROMs that cover maybe 5% of the total possible combinations. You may nail blargg's $2000/$2002 timing tests, but are there also timing tests for $2003, $2004, $2006? Or what about with the more complex 64 PPU registers on the SNES?

Minor flaws in the lowest level tend to multiply as you get to higher and higher levels of operation, requiring increasingly complicated hacks to match behaviors. If you want to match hardware in all your tests, you need to get everything exactly right, always.
A read that hits on that cycle will return the state of that flag early in the cycle, whereas a write to $2000 won't hit the register (and the logic that does VBLANK & ENABLE) until the end of the cycle, after the vblank flag drops.
Well, the issue is that three entire PPU cycles complete between each CPU cycle. And with our stock "don't know/care" approach of CPPP CPPP CPPP for both reads and writes, that means the CPU $2002 read or $2000 write will always occur on the very edge of every third PPU cycle.
ReaperSMS
Posts: 174
Joined: Sun Sep 19, 2004 11:07 pm

Post by ReaperSMS »

What I meant was that at some point or another, you will hit a level where you technically don't (or can't) know the specific timing, and that the current knowledge is as good as the set of tests we have so far. Somewhere in there, emulation has to adopt some sort of compromise between simulating every last detail, and abstracting slightly in a way that maintains software compatability. Otherwise you end up with scope creep, and suddenly you're trying to simulate process variations, or speed vs temperature variations, and have to find a way to characterize some particular chip. If you aren't trying to hit that point (and that point is rather extreme) the usual compromise is to emulate a system that, to every observing program known, is indistinguishable from the Real Thing within reason. Power-up synchronization is something that is fairly reasonable for an emulator to force to a case that cleans up some edge issues.

Even at the HDL level, compromises have to be made. Partly because we can't know certain things without doing a full RE of a die shot (and as awesome as visual 6502 is, their simulator is simulating ideal transistors in a lot of spots, because reality is messy) or because certain constructs don't work with modern logic. The NMOS 6502 is full of pass gates doubling as dynamic latches, which is something you just can't really do in an FPGA. In particular, dynamic level sensitive latches are not (reasonably) synthesizable, so they have to be 'emulated' as some form of edge sensitive flip flop. That's going to be a somewhat lossy process, but most of the time it doesn't matter in practice, due to guarantees made by other parts of the hardware (such as ensuring none of the input signals will change across the latch enable)

Other compromises come in, to deal with hardware details such as having to share one ram chip among everything, and ways to deal with that.

As for the specifics...

What we care about is observable behavior. In this case, we have test roms, that run on the actual NES, and show repeatable behavior. As anomie's timing doc says, when you give a proclamation of "X happens on dot Y", there's at least half a dot of uncertainty as to exactly when that happens. Most of the time, that doesn't interact with some other behavior, but when it does, those uncertainties come into play.

Most of the PPU regs are not readable, so if you have some event you can get a PPU cycle accurate test for, they should have repeatable behavior.

Also, cycles are not atomic, monolithic things. Nothing happens instantly, and the chips we're dealing with are more complex than those that can be handled by a single level of logic. Ergo, to "properly" emulate the CPU, sans abstractions and "hacks", you'd need to drop down to at least a half-cycle, and emulate the sub-cycle signalling.

All of these details fall out of what the CPU expects and provides, regarding M2, R/W, and the A & D buses. It expects peripherals to start driving the data for a particular address around the time M2 goes high, expects that data to stabilize before M2 goes low, and latches the value on the pins when it does. Likewise, when it writes, peripherals know that M2 means the CPU's trying to write, but they don't know how long it will take the value to settle. When M2 drops, they know the CPU thinks it's had enough time.

The suppression behavior (test 06) could be reproduced either with some trickery inside the PPU, delaying some signals more than others, etc, by having the CPU ignore NMI pulses shorter than a cycle, or by having the NMI output of the PPU be registered in a fashion that lets the clear and set logic settle such that there are no externally visible glitches on the line. (glitch in this case would be the hardware definition, where you see a spurious edge during switching) Any of these three could be what the Real Thing does, but they're otherwise indistinguishable in that test, so you usually implement whichever one fits your language best.
User avatar
thefox
Posts: 3139
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Post by thefox »

Zelex wrote:If no matter what its startup state dependent, which you can't easily test for, then most games should not rely on the PPU/CPU sync behavior for correct operation. Therefore, it should be pretty irrelevant, no? Hack in correct behavior for the one or two games that require it, and you make your life easier in the process.
I'm pretty sure you could detect the PPU syncs by seeing how the OAM readback behaves (which bytes get corrupted), but of course no games rely on that. But some emulators have goals besides just getting games to run, which in fact doesn't require that accurate low level emulation at all (just look at FCEUX, it's scanline based).
Post Reply