lidnariq wrote: ↑Thu Sep 03, 2020 5:03 pmSince then we've had people using HDMA to stream stereo BRR at 32kHz, so there's not too much impact on the main CPU. (I think. I can't find a link as easily)
I'm unaware of anybody actually doing it, but the method I proposed is here:
viewtopic.php?f=12&t=14634&p=178343#p178343
Due to real life being rather busy, my hobbies have been on a
very slow burn, so I'm just now getting around to trying to write an audio engine. This will be part of it, and hopefully I'll be able to test it in the relatively near future.
_aitchFactor wrote: ↑Thu Sep 03, 2020 4:45 pmHigh-bandwidth HDMA streaming is a new concept to me. How fast is it and are there any particular caveats to it?
There's a game called N-Warp Daisakusen that you may have heard of, by d4s. I've looked through the source code, and it looks like it uses HDMA to stream audio, writing to all four I/O ports once per scanline in data bursts lasting multiple scanlines, and using a cycle-counted loop on the SPC700 to get the data from the ports while remaining in step with the DMA unit.
This method can send four bytes per scanline while the transfer is active. However, perhaps due to synchronization issues with the unreliable ceramic oscillator used by the APU, d4s seems to have prioritized speed of data reading above all else, so as to leave as much timing margin as possible in the pickup loop. This results in a method where the SPC700 simply dumps the data on the stack during the transfer. After the transfer, it then has to pick up the data from the stack and put it where it's supposed to go. This limits the average data rate, because you can't be sending more data while the previous data burst is being copied to its destination. Also, the fixed-length pickup loop raises compatibility questions when using long data bursts, because there's no way to re-synchronize, or even detect a problem, during a burst.
When I say "high-bandwidth" HDMA streaming, I'm referring to a method that writes the data from the I/O ports directly to the desired address, eliminating the secondary data move loop. Such a method should additionally be capable of supporting long data bursts of at least 30-40 scanlines, with multiple bursts per frame. The method I've proposed uses self-modifying code on the SPC700 side to write the base destination addresses into the
mov instructions in the pickup loop before a burst. Since my method is more timing-sensitive, partly due to the longer pickup process, it uses a hot-swappable delay section to pad the pickup loop to a specific length based on the observed clock ratio between the two processors (an HDMA timing pulse each frame should do nicely if you set up the SPC700's timers right, although you might want to put the frame length through a lowpass filter before using it).
This sort of method should be pretty light on the S-CPU. If I understand correctly, you can use indirect-addressed HDMA to send whole blocks straight from ROM with a couple of writes to the HDMA table.
...
How fast is it? That depends on how hard you want to hit the SPC700. It should be possible to fit several 27- or 36-line data bursts into one 224-line frame, with several scanlines in between bursts for the music engine to do tasks unrelated to the streaming. This could get you 600-620 bytes per frame easily enough, which on NTSC is enough for two 32 kHz BRR streams (Mozart in stereo) or three 22 kHz streams (Ken+Ryu+announcer).
The theoretical limit should be somewhere north of 900 bytes per frame in overscan mode, or somewhat less than that in 224-line mode. However, using the whole frame restricts the music engine to VBlank and therefore 60 Hz (or 50 Hz), which could cause timing issues in certain scenarios.
In the case of something like Doom, the active frame is shorter for PPU DMA bandwidth reasons. Since you can't casually mix DMA and HDMA without breaking compatibility with early-model consoles, you can't use any of the extended VBlank for HDMA audio streaming. And because the overhead (and gap size) is largely fixed, this means the data rate shrinks somewhat faster than the screen size. I figure the equivalent of four or five 11 kHz streams should still fit, given the display size in the existing port. (I believe Doom on PC uses 11 kHz for all its sound effects, with the exception of the Super Shotgun which doesn't show up until Doom II.)
...
Caveats? Well, first there's the fact that AFAIK no one has tried this, so it might still run into a showstopper. No one has brought one up since I posted my idea, and I haven't thought of any, but you never
really know until you try it for real...
There's also the fact that while it doesn't load down the 5A22 much, it ties up the SPC700 perhaps even more than a conventional high-bandwidth occasional-sync CPU-to-CPU transfer would, because it has to waste about a quarter of each scanline waiting for the next data shot. Streaming three 22 kHz BRR sound effects at the same time is about 620 bytes per frame on NTSC, or 740 on PAL, which is roughly 60% of the SPC700's compute time, and there's overhead on top of that. Sadly, there is no way to automate data pickup on the SPC700 side.
This sort of method is limited to no more than 256 bytes in a single burst (and I think the method I've posted is only good for 252) because the index registers on the SPC700 are 8-bit. This is not a huge issue, because it's hard to reliably stay in sync much longer than that anyway.
It might be good for transfers to be a multiple of 9 bytes, at least when transferring sample data, because that's the size of a BRR block. If a single data burst isn't a multiple of 9 bytes, you end up with annoying constraints on ring buffer size in ARAM. Even as it stands, you have to make sure the buffer fits a whole number of transfers in it - add the requirement that it be a multiple of 9 bytes when the transfers aren't, and your options for small buffer sizes get pretty restricted.
Also, you have to be careful with how you tell the SPC700 other things (music control, sound effects, etc.). Obviously the S-CPU can't be writing to the I/O ports while HDMA is active, because it will corrupt the transfer. So you have to either include any additional instructions in the HDMA table itself or do your general audio control communications while HDMA is not running (which could mean during VBlank, and generally you want VBlank for video DMA). Receiving data from the SPC700 is easy if it's 4 bytes or less and you know when it gets written, because AFAIK
reading the I/O ports doesn't disrupt anything, but for extended or unscheduled communications the same considerations apply: either use the audio HDMA channel to request data from the SPC700 and use a second HDMA channel to receive it, or do the I/O manually outside the range of the HDMA.
As mentioned above, if you trim the screen to get more bandwidth to VRAM (as Doom does), you leave yourself less room for HDMA. This hits harder if you're trying to keep some space between bursts to allow a high-precision music engine.
If you're trying to leave space between bursts, you ABSOLUTELY NEED to ensure that the music engine loop
cannot take longer than the space you've allotted. HDMA won't wait for confirmation from the SPC700. If you miss the "prepare to receive data" command, you will miss the data burst,
and you will probably end up misinterpreting data as instructions and doing something stupid. It might be feasible to design a leaner "quick loop" for this purpose that only handles stuff that can't wait until VBlank.
Finally, there's the sync issue. Not only do you have to make sure your pickup loop isn't going to desync during a burst, you have to make sure that if you're playing a streamed sound, you don't get buffer overrun or underrun. There are a couple of possible ways to handle this: 1) APU-side sync, where the pitch is adjusted to match the incoming data rate, or 2) CPU-side sync, where the amount of data per frame can be adjusted to stay in step with the playback. If you're just doing a 32 kHz stereo demo, you might be able to get away with using all of ARAM as a giant ring buffer, meaning sync probably won't become an issue for at least a couple of minutes regardless, but if you're using small streaming buffers to save ARAM in a game scenario you'll want to pay attention to this.
Also, sync pulse handling is very important if you want any of the above to work properly. The SPC700 has to have a reasonably precise idea of what the clock ratio is before you start trying to stream something, so the audio HDMA channel may have to run all the time regardless of what the game is doing, just so the sync pulse will fire on the same scanline every frame like clockwork. Just measuring the clock ratio at boot is dicey because of thermal drift, so it's probably wise to keep it a live measurement.
This is going to be fun...