Okay, let's try this again, a little less cryptic this time (and better code too, IMO):
I'm trying to get enough bandwidth for 32 kHz stereo streaming or better while preserving sub-frame audio processing resolution and not loading the S-CPU down very much. This is apropos of the Street Fighter Alpha 2 discussion earlier in the thread (three 22 kHz streams would be great), but I have applications of my own in mind for it.
So what I've done is I've tried to design an approach using HDMA, that uses single commands (well, pairs of) to set up block transfers with no fine-grained handshaking, so as to be able to feed the APU data in chunks during active display while allowing it to do other processing in between the chunks.
In my concept, the HDMA pattern would consist of a series of data blocks, each one consisting of (a) a "data incoming" command, (b) a gap long enough for the audio engine to notice said command and begin polling I/O, (c) a "data start" command, along with the data length in scanlines and a data ID for multi-stream support, and (d) four bytes of data per line, for the number of lines given in (c). Possibly also (e) a "no action" command, so the APU doesn't misinterpret part of the last data shot as some other command...
...
My first attempt,
a bit upthread, used a lot of zero-page memory and was limited to three sample buffers at a time, because it relied on 16-bit code modification at runtime. Last night I modified the code to make better use of the index registers, freeing up a significant chunk of direct page and allowing up to 6 simultaneous buffers, although each buffer now has to start on a page boundary. I also moved the buffer metadata to page one, freeing up the rest of zero page (not sure how much that matters).
Unfortunately the new code imposes an additional restriction on the streaming format. The use of X as the low byte of the streaming buffer pointer means the data chunk size has to divide evenly into 256 bytes; otherwise I have to deal with overflow in the pickup loop, and there's no time for that. The buffer should probably also be a multiple of 9 bytes, unless the streaming data is already formatted with the buffer size in mind, studded with end-of-sample bits and padded with zeroes... actually that sounds like a good idea regardless...
The key question, which at the moment is totally outside my expertise, is how long a high-granularity but full-featured audio engine can be expected to take
at most between I/O port checks. If my math is correct, an engine that turns around in 9 scanlines or so should allow up to 640 bytes per frame in 32-line (128-byte) chunks, with five engine slots per frame plus whatever fits in VBlank (roughly 4 at max length) for a total of about 37% of the total compute time (ie: streaming eats 63%). An engine that turns around in 3 scanlines (which may be unrealistic) could allow the same bandwidth in 16-line chunks, with ten engine slots per active display period (in this case streaming eats 70% of total compute time). Paired 16-line chunks (two chunks back-to-back with no processing in between) could do it given 6 scanlines of turnaround time, leaving room for five of those 6-line engine slots in active display. I haven't yet cycle-counted past the end of the streaming routine (partly because I haven't written anything more than this yet), so these numbers are approximate.
Any thoughts? Keep in mind I haven't ever coded for the APU before, and this mess hasn't even been assembled, never mind run...
Code: Select all
; BUFFER METADATA STRUCTURE (WIP):
; byte 0-1: current buffer write position
; byte 2: buffer start page
; byte 3-4: buffer end address
; byte 5: data ID
; In other words, using six buffers burns 36 bytes of direct page. If using
; zero page for this is acceptable, the SETP/CLRP instructions can be removed
; and the timing headroom goes from ~8 cycles to ~12.
; HDMA STREAMING CODE:
data_incoming_HDMA:
mov A, #data_start_HDMA ; 2 cycles - load data start flag value
- cbne $F4, - ; 7 cycles - listen for the write
; This point is reached roughly 3-9 cycles after $2140 is written, assuming
; CBNE loads the comparison value before the branch target.
mov A, $F5 ; 3 cycles - load data ID number
mov X, $F6 ; 3 cycles - load chunk size in scanlines
; TOTALS: 6 cycles since start code noticed in $F4, 9-15 cycles since $2140 written
; Find the buffer to which the data ID was assigned when the APU sent the data
; request (or processed a streaming SFX request from the S-CPU):
setp ; 2 cycles - switch to page one (optional)
cbne buf6_id, buf5check ; 7/5 cycles - check buffer 6, proceed to next if no match
mov Y, #$04 ; waste 0/2 cycles
- dbnz Y, - ; waste 0/22 cycles
cmp A, (X) ; waste 0/3 cycles
mov Y, #buf6 ; 0/2 cycles - load direct-page address for buffer 6 data
jmp buf_found ; 0/3 cycles - skip ahead
buf5check:
cbne buf5_id, buf4check ; 7/5 cycles - check buffer 5, proceed to next if no match
mov Y, #$03 ; waste 0/2 cycles
- dbnz Y, - ; waste 0/16 cycles
cmp A, (X) ; waste 0/3 cycles
mov Y, #buf5 ; 0/2 cycles - load direct-page address for buffer 5 data
jmp buf_found ; 0/3 cycles - skip ahead
buf4check:
cbne buf4_id, buf3check ; 7/5 cycles
cmp (X), (Y) ; waste 0/5 cycles
cmp (X), (Y) ; waste 0/5 cycles
nop ; waste 0/2 cycles
nop ; waste 0/2 cycles
mov Y, #buf4 ; 0/2 cycles
jmp buf_found ; 0/3 cycles
buf3check:
cbne buf3_id, buf2check ; 7/5 cycles
cmp (X), (Y) ; waste 0/5 cycles
nop ; waste 0/2 cycles
mov Y, #buf3 ; 0/2 cycles
jmp buf_found ; 0/3 cycles
buf2check:
cbne buf2_id, buf1 ; 7/5 cycles
mov Y, #buf2 ; 0/2 cycles
jmp buf_found ; 0/3 cycles
buf1:
mov Y, #buf1 ; 2/0 cycles
buf_found:
; TOTALS: 39-40 since buffer check started, 48-55 since $2140 written
; If no assigned buffer was found, data will be sent to buffer #1. Now the
; data pickup loop must be rewritten to target the selected buffer:
mov A, $01+Y ; 4 cycles - get high byte of buffer pointer
mov !(get_data_HDMA+4), A ; 5 cycles - write buffer page address
mov !(get_data_HDMA+9), A ; 5 cycles
mov !(get_data_HDMA+14), A ; 5 cycles
mov !(get_data_HDMA+19), A ; 5 cycles
; TOTALS: 24 cycles since buf_found, 72-79 since start flag written to $2140
; The index registers will now be set up for the loop. The buffer metadata
; pointer will be saved for later, and X and Y will be loaded with the low
; byte of the buffer pointer and the chunk size in scanlines, respectively:
mov A, X ; 2 cycles - move chunk size from X to A
mov X, $00+Y ; 4 cycles - get low byte of buffer pointer
push Y ; 4 cycles - store buffer pointer address
mov Y, A ; 2 cycles - get chunk size in scanlines
clrp ; 2 cycles - switch back to page zero (if using page one)
; TOTALS: 14 cycles since loop rewritten, 86-93 since start flag written to $2140
; Ideally, one scanline should be almost exactly 65 cycles long. The port reads
; are between cycles 3 and 30 past this point, putting them between 24 cycles after
; the first HDMA write and about 8 cycles before the fourth one on the next line.
; That should be good for at least several scanlines regardless of clock drift, no?
; STREAMING LOOP:
get_data_HDMA:
mov A, $F4 ; 3 cycles - get byte 0 of the data shot
mov !$0000+X, A ; 6 cycles - write it to the current buffer position
mov A, $F5 ; 3 cycles - get byte 1
mov !$0001+X, A ; 6 cycles - write it to the current buffer position plus one
mov A, $F6 ; 3 cycles - get byte 2
mov !$0002+X, A ; 6 cycles
mov A, $F7 ; 3 cycles - get byte 3
mov !$0003+X, A ; 6 cycles
inc X ; 2 cycles - increment the current buffer position four times
inc X ; 2 cycles
inc X ; 2 cycles
inc X ; 2 cycles
cmp (X), (Y) ; waste 5 cycles
cmp (X), (Y) ; waste 5 cycles
cmp (X), (Y) ; waste 5 cycles
dbnz Y, get_data_HDMA ; 6/4 cycles - repeat for next scanline, or exit if done
; TOTAL: 65 cycles
; The final loop ends ~19-26 cycles after the first byte would be written on the line
; immediately following the last line of the data chunk.
; Now it remains only to store X back in the zero page data structure and check for
; page rollover and end-of-buffer, updating the high byte of the buffer pointer as
; appropriate:
end_data_HDMA:
setp ; 2 cycles - switch to page one (if using page one for buffer metadata)
mov A, X ; 2 cycles - load the new buffer address low byte from X
pop X ; 4 cycles - pick up the buffer pointer address
mov (X), A ; 4 cycles - store the new buffer pointer low byte
bne + ; 4/2 cycles - check if X had rolled over to zero (POP doesn't affect flags)
inc $00+X ; 5 cycles - increment high byte of buffer pointer
+ mov A, $01+X ; 4 cycles - pick up high byte
cbne $04+X, done_HDMA ; 8/6 cycles - check high byte against buffer end address
mov A, (X) ; 0/3 cycles - pick up low byte
cbne $03+X, done_HDMA ; 0/8/6 cycles check low byte against buffer end address
mov A, $02+X ; 0/0/4 cycles - if end of buffer reached, load buffer start page
mov $01+X, A ; 0/0/5 cycles - store to buffer pointer high byte
done_HDMA:
clrp ; 2 cycles - switch back to page zero (if using page one)
; This code ends ~49-75 cycles after the first non-chunk HDMA slot. In other words, it
; brackets the second slot, unless it will be more than 16 cycles until the next read.
; Which is quite probable. And that means the next read will get whatever was written
; TWO scanlines after the last data shot. Or, simply put, there are two scanlines of
; overhead after the chunk ends.
I've taken a cursory look at Super SNESMod, and at the APU code from N-Warp Daisakusen. The latter is interesting because it's doing almost exactly what I'm trying to do, but it's handled differently and seems to have some disadvantages compared with my approach (though to be fair, it is a field-proven capability, while mine is very much not). It also uses 66 cycles instead of 65 for the loop, but that seems to be a PAL thing.
Now that I think about it, if I wanted my code to be able to handle 32-line chunks on PAL, I'd probably have to partially unroll the pickup loop to take 131 cycles per two scanlines. PAL is nominally 65.632 cycles per scanline, vs. 65.033 on NTSC, give or take quite a bit (nearly 0.2 as I understand it), so over a chunk that long the timing would be unreliable with any single-line loop... and if I need two instances of the pickup code, it will take 20 extra cycles to overwrite the high byte, so I'm back to 3 buffers...
Wait... I have 15 cycles in that loop during which nothing whatsoever is happening:
Code: Select all
; FOR PAL, REPLACE 15-CYCLE TIME DELAY IN DATA PICKUP LOOP WITH:
mov A, Y ; 2 cycles
and #$01 ; 2 cycles
beq + ; 4/2 cycles
cmp A, (X) ; 0/3 cycles
+ nop ; 2 cycles
cmp (X), (Y) ; 5 cycles
That's either 15 or 16 cycles depending on the low bit of the line counter. Problem solved.