Neo Geo Pocket emulation

Discussion of development of software for any "obsolete" computer or video game system. See the WSdev wiki and ObscureDev wiki for more information on certain platforms.
JamesHall7
Posts: 4
Joined: Sat Mar 09, 2019 1:54 am

Re: Neo Geo Pocket emulation

Post by JamesHall7 »

byuu wrote:I couldn't find any games that use the "D" button (from the SNK manual), so I unmapped binding it as an input for now, since having it would mean that "Option" would have to be named either "C" or "C/Option", both of which are super ugly.
At least Card Fighters 2 makes use of it (it brings up a debug menu).
[There was an SNK board that provided TV output and took an AES control, so perhaps that's where it could be used? A video of it can be seen here.]
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: Neo Geo Pocket emulation

Post by Near »

It turns out there's three separate checks to get the BIOS to skip the setup on subsequent runs, it's a huge mess. I have to hack around all three for now. I'll mark this as lower priority for now.

The much bigger problem is this nightmare cycle timing.

The CPU itself can perform 8-bit, 16-bit, and 32-bit memory reads. These reads go through the bus handler logic to choose between chip selects 0,1,2,3,external or an internal access to I/O or RAM. Each selected chip can be configured as an 8-bit or as a 16-bit bus, and can have four different wait state modes (1,2,1+/WAIT,0). The flash ROMs are 8-bit. I have no idea what all the internal memory is. If the CPU tries to read a 16-bit or 32-bit value with A0 set (eg an odd memory address), it splits things up. 16-bit reads become two 8-bit reads. 32-bit reads become two 8-bit reads plus one 16-bit read.

And now we have the 4-byte CPU instruction prefetch. I have no idea how this one works, but my guess is that it tries to do 16-bit accesses wherever possible, and will also try to run in parallel with program execution, eg performing fetches during bus idle cycles.

If I simply fetch 8-bits at a time to the prefetch, and honor the 0-states setting on flash ROM accesses the BIOS sets up, then executing code out of ROM is running twice as fast as it's supposed to. To get ROM to operate at the correct speed, I have to insert two additional wait states per 8-bit ROM access, beyond what the manual says is needed. This feels like nonsense... yet B0CS and B1CS are each set to 0x17, which is 8-bit mode, 0-wait states per the TMP95C061 manual. I also looked at several other TLCS900 SoCs, and although many shuffle the bits around or add extra state modes ... it's always the case that 0b00 is 0-wait states.

There is the possibility that B0CS is different for the NGPC, and 0x17 ends up setting a two-wait state mode. But after this, now code executing out of RAM is still running twice as fast as it should. Inserting the same penalties as ROM makes it slow down to ROM speeds, which is about 5-10% faster than it's supposed to be. But RAM is internal, so why in the world would it need additional wait states at all?

Even if we end up with additional wait states, then there's the question of how those factor into prefetch and DMA timings: does it just add to the amount of cycles needed per access? Then how would the instruction prefetch work if it tried to nab something during a two-state idle period, but it turns out that you need four states to load ROM? Would the idle cycle just get extended two more cycles? Sigh ...

So, yeah ... I don't know. There's far too many possible implementations and variables to try and brute force this timing. It's the same limitation I hit in trying to emulate the GBA ROM prefetch, or the SA-1 bus conflict timing. I'll keep trying to guess values until I get something as close as I can get to the real timings, but it's very unlikely I'll be able to get these timings correct through guessing.

> At least Card Fighters 2 makes use of it (it brings up a debug menu).

Fun, it prints gibberish tiles for me for the debug menu, even when 0x20001f is set to 0xff.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: Neo Geo Pocket emulation

Post by Near »

Well, I've hit my limit. Unless someone can shed more light, this is about as good as it's going to get.

There's a fantastically evil test ROM here: https://forum.freeplaytech.com/showthread.php?tid=85

The first test runs a loop that is "daa" repeated ten times (40 states), plus another ten states to increment a counter and loop.
I confirm that my emulation core performs exactly 50 states for each iteration of this loop.
It sets up timer 0 into "T16" mode (mode bits = %10), with a compare value of 0x32 (or 50.)
mic_ states that the interrupt fires every approximately 12800 states.
Given an input frequency of 6144000hz ... the TMP95C061 diagram shows the prescaler gets clock rate / 4.
SysPro.pdf is vague here in telling you the input clock to the prescaler is always 384KHz, but that's wrong.
The value into the prescaler is based on the clock gear setting. So it's 6144000/(4,8,16,32,64).
Everything on this system pretty much runs at the fastest clock rate, so the prescaler input is 1536000hz.

In order to fire the timer at 12800 states, then the timer has to increment every 256 states. 256 * 50 = 12800.
In order for T16 to fire every 256 states, T16 has to run at a frequency of 24000hz.

But if we go off of the dev manual, T16 is 128/fc. 6144000/128 = 48000hz.
And indeed, if I don't use the dev manual values for T1 (8/fc), T4 (32/fc), T16 (128/fc), T256 (2048/fc), then the audio in every runs twice as slow as it's supposed to. I've checked and triple checked things. It affects both the BIOS intro music (which uses FF3 invert on T3 compare match to fire Z80 IRQs) and Sonic Pocket's PSG playback (which runs on a regular T1 clock off of timer 0.)

I've checked and confirmed my Z80 is running at the proper 3072000hz.
I run my PSG at 192000hz instead of 3072000hz because the counters are stepped every 16 clocks, and nothing runs faster than that. It would be wasteful to synchronize when the internal state cannot change.
To be sure, I tried adjusting the Z80 and PSG frequencies anyway, and it does not produce the correct sound.

However, circling back to ngp_bus ... it's clear that higan is emulating the loops twice as fast.
My only conclusion is that the main loop is not in fact taking 50 steps, and the cycle timings mic_ lists are not accurate.

This is in fact very possible to be the case. Prefetch plays a heavy role here. If the loop truly took 50 states, then the output value would be 0xfe. And yet the value is 0xea for mic_ on real hardware.

I can match a value of 0xea if I force ROM accesses to always incur an extra 2-state penalty on every read. The problem is, the BIOS very clearly sets up the ROM to incur no extra state penalties on accesses by writing B0CS, B1CS = 0x17.

But now we look at the RAM test, and we see 0xfc with an aligned program counter, 0xf7 with an unaligned program counter. RAM is 16-bit instead of 8-bit. Specifically, this test is running out of the APU RAM, not the CPU RAM, for whatever that's worth, but the Z80 is not active during the second test.

I get 0x1fe with no extra waits on RAM. If I add in a wait of 1, it drops to 0x144. A wait of 2 drops to 0x0ea, which is too slow.

In order to get the second value to not be the same as the first value, I have to make my prefetch reload fetch 16-bit words, so that the timing is different between aligned and unaligned loop starting points.

...

So my final takeaway is: there are too many variables at play.

The first test I need performed on hardware is to determine the true wait state penalties of each memory area:
I/O registers
BIOS ROM
CPU RAM
APU RAM (Z80)
VPU RAM (K1GE,K2GE)
Cartridge Flash 0 (CS0 -- CS1 should be the same)
for:
8-bit reads
16-bit aligned reads
16-bit unaligned reads
32-bit aligned reads
32-bit unaligned reads
And for the cartridge flash, I need these values for every B0CS value:
%000 - %111
If d2 really is the 8-bit select mode, then I need the values for %00-%11. %10 will likely hang the system waiting on the /WAIT pin then.

The test should basically be a loop sets one of the requested chip select timings (for flash ROM/CS0 only), and performs one of the reads types, and loops for a few seconds. The test should also run "dry" without a read, and that value should be subtracted from all of the other tests. The results will be biased by the amount of additional states it takes to execute the read instruction we choose, but I don't know any way around that.

Once I have these numbers, I can deduce the core timings for each memory access. Once I have those numbers, the next step will be a huge amount of investigation into the prefetch behavior. I am particularly stuck at understanding if and how the prefetcher runs in parallel to the rest of the CPU. Emulating that will be monstrously expensive. My best guess is that it'll suck up idle cycles where it can, but the second it needs the result of a prefetch, it'll block the instruction for additional states until the prefetcher is done. It also seems pretty clear the prefetcher favors word accesses, but if you jump to an unaligned memory address, a penalty is incurred. My guess is this only applies to the very first prefetch. I don't know if it will try to pull in just one byte here, or try and fetch a word and discard the low byte. The byte mode would be faster for 8-bit buses, so that's more likely the case.

Once we have the bus access timings and the prefetcher timed correctly, then we will basically have to write test loops for every instruction possible to suss out the amount of idle states that each instruction actually takes.

Once that is done, then we get to take on the true behemoth ... when the TLCS900 and Z80 share the APU RAM (0x7000-0x7fff), the chips stall each other out. That will be a world of pain as well, but ngp_bus is a good start once we can pass the non-Z80 tests.

...

So, if anyone's up for writing test ROMs, please let me know. I don't have the hardware for this, so I'm stuck.
If no one's interested then well, this is the end of my NGPC emulation progress. Hopefully someone will take this up in the future.

I spent today creating abstracted bus handlers for all memory types (I/O registers, BIOS ROM, CPU RAM, APU RAM, VPU RAM, flash 0, flash 1, and the unused CS2, CS3, CSX.) The bus width and wait penalties of each can be modified freely. The flexibility is there to adjust this as new discoveries are found. All CPU instructions have their state counts easily modifiable. There's a rudimentary prefetcher that always tries to stay full and will reload completely on jumps, which is definitely wrong. So I've provided all the tools and support to implement correct timing, should the correct values become known later.
ccovell
Posts: 1045
Joined: Sun Mar 19, 2006 9:44 pm
Location: Japan
Contact:

Re: Neo Geo Pocket emulation

Post by ccovell »

Your posts are all interesting, but they just add to my horror at the CPU thing inside of the NGP. WHY did they use this thing, in lieu of the more sensible ARMs or low-power MCUs/CPUs coming onto the market???
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: Neo Geo Pocket emulation

Post by Near »

The best theory I've heard is they had a boatload of these sitting around from some other project of theirs. But they went on to sell two million of these. It's ... a really good question.

I may have managed to find a bit more information by chance.

So right now, my prefetch doesn't really save any cycles at all. It always keeps the buffer filled, so we can essentially call it a wash (actually a bit slower than not having one if there are BnCS wait states, heh.) And I execute four states for DAA. Right now, I treat one state as one cycle @ 6.144MHz. This results in the ROM timing test running twice as fast as it should.

If we go off of the diagram here:
https://www.dataman.com/media/datasheet ... C061BF.pdf

Each state is shown as X1 transitioning low,high,low,high. Considering that the prescaler is said to take a 1/4 clock input, perhaps the 6MHz is a lie, and it's really four clocks per state. That will make the ROM timing test run twice as slow as it should.

But now what if we consider that DAA being four states, two of them are filling the prefetch buffer? Since ROM is an 8-bit bus, it'll take two states per 8-bit read. Thus if we skip the penalty for the next DAA instruction fetch, it ends up taking two states. And now we have roughly the correct timing.

Even more interesting in the micro DMA timing break down ... the first three states perform one prefetch operation. But does it really need three states to do a single prefetch? Maybe ... the read cycle of the DMA is two states plus one dummy cycle but with the read address still on the bus A0-23 pins. Curious. You'd think if they could, they'd have it be:
* first two cycles is a prefetch
* next two cycles is a read
* next two cycles is a write
Instead of:
* first three cycles is a prefetch
* next two cycles is a read
* next cycle is a dummy cycle with the read address still on the bus
* next two cycles is a write
25% faster. Has to be a good reason.

Next, it says micro DMA does not occur if the instruction queue has three or more (four) bytes available. You'd think it would try and fill in that last byte if it were an 8-bit design. I'm still not entirely sure about this one. Performing 16-bit accesses to an 8-bit bus is going to be wasteful as a lot of times you'll branch and not need the prefetched byte. But then again, if it doesn't fetch when there's three bytes, it's probably intending to fetch 16-bits, which would put you at five bytes, so obviously it wouldn't want to do that. Hmm.

Finally, it talks about how if the source and target are 8-bit buses, then an extra two states are inccured for each. Wrap this in with it saying that the timing diagrams assumes zero added wait states from BnCS.

So now we're getting somewhere ... I'm going to guess the TLCS900 prefetch is just really simple like with the 68K and the opportunity points to perform prefetches are hard-coded into each instruction's idle timings. It looks like, on the whole, bus accesses incur two states plus whatever additional wait states BnCS asks for. The entire chip likely stalls when prefetch happens and there's extra wait states, but generally we don't have to worry about that as the NGPC never uses additional BnCS wait states.

Figuring out what's actually happening during each cycle of each instruction of the TLCS900/H will require a logic analyzer to single-step and watch the pins. That's way beyond my abilities, I'm afraid. Best I can do is guess, and mic_'s test ROM only has four instructions per loop so ... it's a good starting point, I suppose.
User avatar
FitzRoy
Posts: 144
Joined: Wed Oct 22, 2008 9:27 pm
Contact:

Re: Neo Geo Pocket emulation

Post by FitzRoy »

byuu wrote:So, if anyone's up for writing test ROMs, please let me know. I don't have the hardware for this, so I'm stuck.
If no one's interested then well, this is the end of my NGPC emulation progress. Hopefully someone will take this up in the future.
Well, you had to know you were going to hit a wall without test roms. Seems like each system has already been emulated up to that wall, and that the only thing left to do in the scene is exactly this kind of nailing down of details. It's probably also true that if you were going to do it, it would be easier now while things are fresh in your mind. Flashmasta sells NGP and WSC flash carts. If you really needed a 21fx style device over these, why not throw some money at qm to make one?

p.s. is your forum coming back dude?
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: Neo Geo Pocket emulation

Post by Near »

NES could be better, I just really don't want to rewrite all the mappers first. Game Boy only got worse every time I emulated a new test ROM, so I'm very averse to touching that anymore. Genesis is held back because I'm not able to understand a single VDP FIFO explanation out there -- my fault, others have figured it out. FM chips are nightmare fuel, don't expect much progress from me on those. Everything else other than the SNES is held back by a lack of software testing on real hardware. The SNES is held back by a lack of PPU logic analyzer traces. The WonderSwan is held back by the undumped IPLROMs.

I have no plans to open another general public forum, sorry. A development only forum may happen later.
Last edited by Near on Fri Mar 22, 2019 2:15 pm, edited 1 time in total.
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Neo Geo Pocket emulation

Post by lidnariq »

byuu wrote:The SNES is held back by a lack of PPU logic analyzer traces.
What do you need there?
Last edited by lidnariq on Fri Mar 22, 2019 2:56 pm, edited 1 time in total.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: Neo Geo Pocket emulation

Post by Near »

I need detailed information on what's on the two PPU buses during each clock cycle of a scanline (at 10.74MHz frequency) in all video modes. All BG layers and features that can be enabled, should be. It would also help to have individual layers enabled one by one to see if it changes timings. Sprites will be toughest because there's a lot of sprite settings that will affect fetch patterns, so I don't have a clear answer for what tests are needed. But we could start with just filling a line with sequential 8x8 sprites and go from there.

From this, we need to work out at which cycle the PPU performs memory reads. It's critical for OAM, a bit less important but still useful for CGRAM, and for VRAM it only matters for edge cases like toggling the PPU display disable bit or changing the video mode mid-scanline.

Once we have the external fetch pattern timings correct, then I can start trying to determine internal latch points for the various I/O registers, which I can do visually with carefully constructed software tests and observing the screen output.
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Neo Geo Pocket emulation

Post by lidnariq »

By "two PPU buses" do you mean the the [edit: PPU] RAM bus and the inter-PPU bus?
Last edited by lidnariq on Fri Mar 22, 2019 3:40 pm, edited 1 time in total.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Re: Neo Geo Pocket emulation

Post by Near »

No, there's two PPU chips. PPU2 only talks to PPU1, but I still need to know what gets sent between them and when.
Rahsennor
Posts: 479
Joined: Thu Aug 20, 2015 3:09 am

Re: Neo Geo Pocket emulation

Post by Rahsennor »

byuu wrote:NES could be better, I just really don't want to rewrite all the mappers first. Game Boy only got worse every time I emulated a new test ROM, so I'm very averse to touching that anymore. Genesis is held back because I'm not able to understand a single VDP FIFO explanation out there -- my fault, others have figured it out. FM chips are nightmare fuel, don't expect much progress from me on those. Everything else other than the SNES is held back by a lack of software testing on real hardware. The SNES is held back by a lack of PPU logic analyzer traces. The WonderSwan is held back by the undumped IPLROMs.
I feel like a kid interrupting the adults at work here, but I'd just like to chime in and thank you for all the hard work you've poured into emulating so many crazy systems. The way people treat old hardware these days it's going to be all we have left a lot sooner than we think. I wish there was something I could do to contribute, but I wouldn't know a logic analyzer from a hole in the ground, so all I can do is wish you luck.
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Re: Neo Geo Pocket emulation

Post by mic_ »

I have a feeling that it's not synchronizing the DAC value generation and the NGPC audio playback, and is instead cycle timing things perfectly. If that's the case, then being off at all would result in the two desyncing and the CPU code overwriting DAC samples before they are played.
My YM emulation code writes the sample data in 640-byte chunks into a 4kB circular buffer, and a uDMA channel is set up to transfer 2 bytes (left & right sample) from this buffer to the audio DAC @ 16000 Hz.
Initially, 6 chunks (3840 bytes) of sample data are generated before the uDMA channel is started.
After that there's an infinite loop which first waits until DMAS0 is at least half a chunk (320 bytes) ahead of the next write position, and then generates another chunk of sample data. I don't think it should be possible to get any buffer overruns, but I can't swear on it since I don't remember what the best-case cycle count for my YM emulation is.

It sets up timer 0 into "T16" mode (mode bits = %10), with a compare value of 0x32 (or 50.)
mic_ states that the interrupt fires every approximately 12800 states.
I don't have my documents handy right now, but in the terminology I'm used to I'm using the T4 clock source. That is, 48000/4 == 12000 Hz.
With the TLCS-900/H being clocked at 6,144,000 Hz and each state being two clock cycles, you've got 3,072,000 states/second.
The timer interrupt is set to trigger an interrupt after 50 ticks, i.e. 12000/50 == 240 times per second. So there should be approximately 3072000/240 == 12800 CPU states until the interrupt fires.
Supposedly it's a 15-bit LFSR with taps on d0 and d2, although that's apparently not 100% right either. Doesn't seem anyone knows exactly how it works. I'm not even sure how anyone was able to analyze it since the registers are write-only, and there's no digital audio output.
I'm going off vague memories now since this was many years ago. But what I think I did was write a program that ran on the NGPC and played some white noise at a relatively low frequency. I then recorded the output from the headphone jack (so yeah, analogue audio) and wrote a program on my PC which tried all possible two-bit tap combinations and compared the result against the recording. In this experiment I did not consider the possibility of more than two bits being tapped.
WHY did they use this thing
While it may have been obsolete even in 1998, it's actually a pretty nice CPU to write code for.
StinkerB06
Posts: 1
Joined: Wed Jun 02, 2021 6:01 pm
Contact:

Re: Neo Geo Pocket emulation

Post by StinkerB06 »

mic_ wrote: Thu May 16, 2019 7:19 am
Supposedly it's a 15-bit LFSR with taps on d0 and d2, although that's apparently not 100% right either. Doesn't seem anyone knows exactly how it works. I'm not even sure how anyone was able to analyze it since the registers are write-only, and there's no digital audio output.
I'm going off vague memories now since this was many years ago. But what I think I did was write a program that ran on the NGPC and played some white noise at a relatively low frequency. I then recorded the output from the headphone jack (so yeah, analogue audio) and wrote a program on my PC which tried all possible two-bit tap combinations and compared the result against the recording. In this experiment I did not consider the possibility of more than two bits being tapped.
  1. Did you try writing a 15-bit Galois LFSR implementation on your PC to compare with the output of the hardware? I think the NGP/C PSG could be using something else other than a Fibonacci LFSR for its noise channel. Or maybe it's similar to the Virtual Boy's noise implementation which is basically like the output of the XOR gate of the Fibonacci LFSR.
  2. What exactly is the width of the two PCM DAC's? The NGPC Wikipedia page says they're 6-bit, but last time I checked the sources of MAME and Higan, I thought they were 8-bit.
  3. Can the DAC's be output at the same time as the PSG? Or can only one of the two sources be output at a time?
Also, I wonder how different the Z80 core in the NGP/C (that's intended to drive the PSG independent of the program activity on the main CPU) is from an actual Z80. Are illegal opcodes supported? Also, I think the 256-byte I/O area is completely unused on the NGP/C (unless I misread source codes), so I'm wondering what happens if you attempt to use the Z80's I/O instructions. Are the IM instructions ignored? Because I think only mode 1 seems to make sense, but once again, not entirely sure on that.

I've been kind of obsessed with the NGP/C CPU architecture (it's pretty feature-rich compared to the Z80) and sound hardware (it's basically a better SN76489), so I decided to sign up on this forum to ask these questions... about a system that's not an NES.
Post Reply