I started another thread to discuss the decompression. It's my opinion at this point that our evolution tables are okay.
I think the problem is that the chip is 'glitching' out, just like the S-DD1 did. Feed it a certain input pattern and it can't handle it and goes screwy for the rest of that decompression.
Usually it would end up repeating the same byte forever (even after 8MB of reads) until the next decompression.
My bet is that Epson provided a compressor that could detect when the input would cause a problem and force a 'renormalize' to avoid it.
For whatever reason, $4807 -appears- to greatly exacerbate this behavior, as my output logs show errors much faster this way.
I believe this error is what throws off my algorithms below, because their first mismatch appears to be the first mismatch read from an offset that would fail on $480b=0.
So ... we really need to emulate this 'edge case' that breaks the SPC7110, but I don't have a clue how to do that. I put up a $100 bounty if anyone can figure it out.
That said, here's the final word on $4807. Consider it a "length" parameter, and $480b.d0 is the enable setting.
length = $480b.d0 & 1 ? $4807 : 1; //$4807 can be zero
The decompressor has to be modified to handle this mode.
First, consider that the modes work like this:
Mode 0 = 1bpp 8x8 (8 bytes)
Mode 1 = 2bpp 8x8 (16 bytes)
Mode 2 = 4bpp 8x8 (32 bytes)
Mode 3 = invalid (doesn't start the decompression so you always get 0x00)
Modes 4-255 = mirrors of 0-3 (decompressor only looks at low two bits.)
Now every time the buffer is empty (which it obviously will be at the start of a new decompression), what you want to do is load one tile worth of -output- data from the appropriate mode. To do this, you need to generate a number of input tiles first.
So next, load max(1, length) tiles in to dcu_tiledata[256 * 32];
Yes, even if length was zero, what ends up happening is the first byte of decompressed output repeats forever, which may not be 0x00, it's up to the data.
Now that you have your tiles, call the current mode's below deinterleave function to generate one output tile from your input tile pool into dcu_output[32];
from here, $4800 can be read until it's empty (its size is based on the mode: 8, 16 or 32 bytes of data will be in here.)
The nice thing about doing it this way is you get rid of bitplanebuffer entirely. You can write to direct offsets during decompression now, so it's unnecessary.
After doing all of this, the behavior of $4807 finally becomes rather simple:
Code: Select all
void SPC7110::deinterleave_1bpp(unsigned length) {
uint8 *target = dcu_output, *source = dcu_tiledata;
for(unsigned row = 0, sp = 0; row < 8; row++) {
target[row] = source[sp];
sp += length;
}
}
void SPC7110::deinterleave_2bpp(unsigned length) {
uint8 *target = dcu_output, *source = dcu_tiledata;
for(unsigned row = 0, sp = 0; row < 8; row++) {
target[row * 2 + 0] = source[sp + 0];
target[row * 2 + 1] = source[sp + 1];
sp += 2 * length;
}
}
void SPC7110::deinterleave_4bpp(unsigned length) {
uint8 *target = dcu_output, *source = dcu_tiledata;
for(unsigned row = 0, sp = 0; row < 8; row++) {
target[row * 2 + 0] = source[sp + 0];
target[row * 2 + 1] = source[sp + 1];
target[row * 2 + 16] = source[sp + 16];
target[row * 2 + 17] = source[sp + 17];
//the purpose of this is that every time the address crosses over a 16-byte boundary, we add 16 more (so it crosses a 32-byte boundary always.)
//this allows us to never copy the same source word to more than one target word
sp = sp + 2 * length + 16 * ((sp + 2 * length) / 16 - sp / 16);
}
}
Now here's the thing. Deinterleave is the best name I can come up with for what this is doing.
Although the actual effect on using it with data appears to be to -interleave- it, my only theory is that this is meant to compress data that was interleaved in raw form.
My guess is that somehow, these functions will shrink the size of the compressed data size in certain cases; but I honestly can't see how.
So by all means ... if anyone can make sense of how this would -ever- be useful for -anything-, please let me know, so that we can name the function better.
...
Anyway, I still need to write emulation of the invalid BCD increment behavior on the RTC, and test some data port behaviors (want to verify the control settings register.)
Once that's done, I'll try and get a final doc up on how this all works. It'll be a monster, though. This chip is the epitome of edge case behavior.
> You've seen my pseudo-code decompression functions, don't you?
They're a bit tricky to follow. I've actually already done a lot of simplifications on the original algorithm (which was designed to show you how the chip works), but it's tricky code.
> Timing might be also a problem, at least when using DMAs which expects the decompressed bytes to be ready within 8 clks. But as far as I understood, you are already using non-DMA reads for that purpose? Does that make a visible difference? More distorted results with DMA, and less distorted without DMA?
Yes, I was using direct reads. When it comes to $4807=#$ff with mode 2, that would seem to require the SPC7110 to really go into hyperdrive. That's basically 255 bytes that need to be decompressed for every read to $4800.
I am betting heavily that a DMA will result in seeing the same value repeated when you read $4800 too quickly, but I haven't gotten to that yet. Kind of afraid of that, to be honest. We likely need the decompression context to be threaded, and consume cycles for each byte output.
> Should I conclude the Super-FX is well known ? Because this chip always fascinated me and I always wondered how it works internally.
Well, we are using the official mnemonics for the SuperFX opcodes, so take from that what you will. The SuperFX has less added functionality than the SA-1. The latter seemed like Nintendo was holding a contest for who could come up with the most functions to add to the chip. It's so bad that ~30-40% of the chip's functionality are never even used by ANY SA-1 games. Plus SFX doesn't allow both the CPU and chip itself to access the same memory at the same time. The SA1 does this by stalling the opcode cycles on the SA1 side when there would otherwise be a conflict. I'm sure both SFX and SA1 are missing a lot of edge case behavior, though.
After I get the RTC BCD and data port last bits in, I'll be happy enough to consider the SPC7110 better emulated than the SFX and SA1.