Well, I've hit my limit. Unless someone can shed more light, this is about as good as it's going to get.
There's a fantastically evil test ROM here:
https://forum.freeplaytech.com/showthread.php?tid=85
The first test runs a loop that is "daa" repeated ten times (40 states), plus another ten states to increment a counter and loop.
I confirm that my emulation core performs exactly 50 states for each iteration of this loop.
It sets up timer 0 into "T16" mode (mode bits = %10), with a compare value of 0x32 (or 50.)
mic_ states that the interrupt fires every approximately 12800 states.
Given an input frequency of 6144000hz ... the TMP95C061 diagram shows the prescaler gets clock rate / 4.
SysPro.pdf is vague here in telling you the input clock to the prescaler is always 384KHz, but that's wrong.
The value into the prescaler is based on the clock gear setting. So it's 6144000/(4,8,16,32,64).
Everything on this system pretty much runs at the fastest clock rate, so the prescaler input is 1536000hz.
In order to fire the timer at 12800 states, then the timer has to increment every 256 states. 256 * 50 = 12800.
In order for T16 to fire every 256 states, T16 has to run at a frequency of 24000hz.
But if we go off of the dev manual, T16 is 128/fc. 6144000/128 = 48000hz.
And indeed, if I don't use the dev manual values for T1 (8/fc), T4 (32/fc), T16 (128/fc), T256 (2048/fc), then the audio in every runs twice as slow as it's supposed to. I've checked and triple checked things. It affects both the BIOS intro music (which uses FF3 invert on T3 compare match to fire Z80 IRQs) and Sonic Pocket's PSG playback (which runs on a regular T1 clock off of timer 0.)
I've checked and confirmed my Z80 is running at the proper 3072000hz.
I run my PSG at 192000hz instead of 3072000hz because the counters are stepped every 16 clocks, and nothing runs faster than that. It would be wasteful to synchronize when the internal state cannot change.
To be sure, I tried adjusting the Z80 and PSG frequencies anyway, and it does not produce the correct sound.
However, circling back to ngp_bus ... it's clear that higan is emulating the loops twice as fast.
My only conclusion is that the main loop is not in fact taking 50 steps, and the cycle timings mic_ lists are not accurate.
This is in fact very possible to be the case. Prefetch plays a heavy role here. If the loop truly took 50 states, then the output value would be 0xfe. And yet the value is 0xea for mic_ on real hardware.
I can match a value of 0xea if I force ROM accesses to always incur an extra 2-state penalty on every read. The problem is, the BIOS very clearly sets up the ROM to incur no extra state penalties on accesses by writing B0CS, B1CS = 0x17.
But now we look at the RAM test, and we see 0xfc with an aligned program counter, 0xf7 with an unaligned program counter. RAM is 16-bit instead of 8-bit. Specifically, this test is running out of the APU RAM, not the CPU RAM, for whatever that's worth, but the Z80 is not active during the second test.
I get 0x1fe with no extra waits on RAM. If I add in a wait of 1, it drops to 0x144. A wait of 2 drops to 0x0ea, which is too slow.
In order to get the second value to not be the same as the first value, I have to make my prefetch reload fetch 16-bit words, so that the timing is different between aligned and unaligned loop starting points.
...
So my final takeaway is: there are too many variables at play.
The first test I need performed on hardware is to determine the true wait state penalties of each memory area:
I/O registers
BIOS ROM
CPU RAM
APU RAM (Z80)
VPU RAM (K1GE,K2GE)
Cartridge Flash 0 (CS0 -- CS1 should be the same)
for:
8-bit reads
16-bit aligned reads
16-bit unaligned reads
32-bit aligned reads
32-bit unaligned reads
And for the cartridge flash, I need these values for every B0CS value:
%000 - %111
If d2 really is the 8-bit select mode, then I need the values for %00-%11. %10 will likely hang the system waiting on the /WAIT pin then.
The test should basically be a loop sets one of the requested chip select timings (for flash ROM/CS0 only), and performs one of the reads types, and loops for a few seconds. The test should also run "dry" without a read, and that value should be subtracted from all of the other tests. The results will be biased by the amount of additional states it takes to execute the read instruction we choose, but I don't know any way around that.
Once I have these numbers, I can deduce the core timings for each memory access. Once I have those numbers, the next step will be a huge amount of investigation into the prefetch behavior. I am particularly stuck at understanding if and how the prefetcher runs in parallel to the rest of the CPU. Emulating that will be monstrously expensive. My best guess is that it'll suck up idle cycles where it can, but the second it needs the result of a prefetch, it'll block the instruction for additional states until the prefetcher is done. It also seems pretty clear the prefetcher favors word accesses, but if you jump to an unaligned memory address, a penalty is incurred. My guess is this only applies to the very first prefetch. I don't know if it will try to pull in just one byte here, or try and fetch a word and discard the low byte. The byte mode would be faster for 8-bit buses, so that's more likely the case.
Once we have the bus access timings and the prefetcher timed correctly, then we will basically have to write test loops for every instruction possible to suss out the amount of idle states that each instruction actually takes.
Once that is done, then we get to take on the true behemoth ... when the TLCS900 and Z80 share the APU RAM (0x7000-0x7fff), the chips stall each other out. That will be a world of pain as well, but ngp_bus is a good start once we can pass the non-Z80 tests.
...
So, if anyone's up for writing test ROMs, please let me know. I don't have the hardware for this, so I'm stuck.
If no one's interested then well, this is the end of my NGPC emulation progress. Hopefully someone will take this up in the future.
I spent today creating abstracted bus handlers for all memory types (I/O registers, BIOS ROM, CPU RAM, APU RAM, VPU RAM, flash 0, flash 1, and the unused CS2, CS3, CSX.) The bus width and wait penalties of each can be modified freely. The flexibility is there to adjust this as new discoveries are found. All CPU instructions have their state counts easily modifiable. There's a rudimentary prefetcher that always tries to stay full and will reload completely on jumps, which is definitely wrong. So I've provided all the tools and support to implement correct timing, should the correct values become known later.