Enjoying your froyo?

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
Sik
Posts: 1589
Joined: Thu Aug 12, 2010 3:43 am

Re: Enjoying your froyo?

Post by Sik »

Espozo wrote:You can stick as many sprites as you want, and the horizontal shrink for one sprite can change, and the ones on the right of it will move over accordingly, but they won't shrink also. You manually have to horizontal shrink all of them.
This is in large part because sprites are always 16 pixels large, and as such the shrink value goes from 1 pixel to 16 pixels (it's only 4 bits long). By having each sprite keep its horizontal shrink value, you still keep the ability to shrink with pixel granurality (precisely because not all the shrink values have to be identical).

Would have been easier if the shrink value was larger and only the first sprite had to be set, but oh well, probably hardware implementation details (given it uses a look-up table internally to do the shrinking, keeping it as small as possible was probably a good idea).
93143
Posts: 1830
Joined: Fri Jul 04, 2014 9:31 pm

Re: Enjoying your froyo?

Post by 93143 »

93143 wrote:I figured a single scanline was probably too fast to fit a high-precision music engine in around the data streaming, but two might just work, and three would be easier. At three bytes every three scanlines, 24 kHz mono just barely fits in a frame.
Okay, that's a load of rubbish. Try this on for size:

Code: Select all

; A streaming HDMA table consists of multiple data chunks, with gaps to allow
; the APU to do tasks mid-frame.  Each data chunk is preceded by a heads-up
; to the APU, followed by a gap long enough to guarantee that the audio
; engine will finish what it was doing, notice the incoming data flag, and
; begin polling port 0 for the data start flag.  After this gap, the data
; start flag, data ID number (for multiple stream capability) and data length
; in scanlines is sent, and on the next scanline the data chunk begins.

data_incoming_HDMA:
	mov A, #data_start_HDMA	; 2 cycles - load data start flag value
-  cbne $F4, -					; 7 cycles - listen for the write

; This point is reached roughly 3-9 cycles after $2140 is written, assuming
; CBNE loads the comparison value before the branch target.
	mov A, $F5					 ; 3 cycles - load data ID number
	mov temp, $F6				 ; 5 cycles - load chunk size in scanlines
	
; The APU should have assigned the data ID to a buffer when it sent the
; data request, so all I have to do is find the buffer in question.
	cbne buf1_id, buf2check	; 7/5 cycles - check buffer 1, skip next two instructions if no match
	mov X, #buf1				  ; 0/2 cycles - load direct-page address for buffer 1 data
	jmp buf_found				 ; 0/3 cycles - skip ahead
buf2check:
	cbne buf2_id, buf3		  ; 7/5 cycles - check buffer 2, skip next two instructions if no match
	mov X, #buf2				  ; 0/2 cycles - load direct-page address for buffer 2 data
	jmp buf_found				 ; 0/3 cycles - skip next instruction
buf3:
	mov X, #buf3				  ; 2/0 cycles - load direct-page address for buffer 3 data
; TOTALS:  18-25 since start flag noticed in $F4, 21-34 since $2140 written

; Okay, I've got the buffer (or, if none was assigned, I'll be corrupting buffer #3).
; Now I need to rewrite some MOV instructions in the streaming loop with the desired
; absolute addresses:

buf_found:
	mov A, (X)						 ; 3 cycles - get low byte of buffer address
	mov Y, $01+X					  ; 4 cycles - get high byte of buffer address
	movw (get_data_HDMA+3), YA	; 5 cycles - write buffer address
	clrc								 ; 2 cycles - not sure if needed
	addw YA, one					  ; 5 cycles - I've wasted a byte of zero page memory on a constant
	movw (get_data_HDMA+8), YA	; 5 cycles
	addw YA, one					  ; 5 cycles - no need to CLRC as these cannot overflow
	movw (get_data_HDMA+13), YA  ; 5 cycles
	addw YA, one					  ; 5 cycles
	movw (get_data_HDMA+18), YA  ; 5 cycles
; TOTALS:  44 cycles since buf_found, 65-78 since start flag written to $2140
	
; Okay, with that done, I just need to get the chunk size, zero X, and start the read loop:
	mov Y, temp					; 3 cycles - pick up chunk size in scanlines
	push X						  ; 4 cycles - store buffer pointer address
	mov X, #$00					; 2 cycles - set X to zero
	jmp !get_data_HDMA		  ; 3 cycles - goto streaming loop in zero page
; TOTALS:  12 cycles since loop rewritten, 77-90 since start flag written to $2140

; Ideally, one scanline should be almost exactly 65 cycles long.  The port reads
; are between cycles 3 and 30 past this point, putting them between 15 cycles after
; the first HDMA write and about 11 cycles before the fourth one on the next line.
; That should be good for at least several scanlines regardless of clock drift, no?

;===================================================================

; STREAMING LOOP IN ZERO PAGE:

get_data_HDMA:
	mov A, $F4					 ; 3 cycles
	mov !buf+X, A				 ; 6 cycles
	mov A, $F5					 ; 3 cycles
	mov !(buf+1)+X, A			; 6 cycles
	mov A, $F6					 ; 3 cycles
	mov !(buf+2)+X, A			; 6 cycles
	mov A, $F7					 ; 3 cycles
	mov !(buf+3)+X, A			; 6 cycles
	inc X							; 2 cycles
	inc X							; 2 cycles
	inc X							; 2 cycles
	inc X							; 2 cycles
	cmp (X), (Y)				  ; waste 5 cycles
	cmp (X), (Y)				  ; waste 5 cycles
	cmp (X), (Y)				  ; waste 5 cycles
	dbnz Y, get_data_HDMA	  ; 6/4 cycles
; TOTAL:  65 cycles
	jmp !end_data_HDMA		  ; 3 cycles

; ZERO PAGE: 32 bytes code, 21 bytes buffer metadata, 2 bytes misc. storage
;            = 55 bytes total, or ~21%.  Maybe I should be using page 1 for this...

;===================================================================

end_data_HDMA:
	mov A, X						; 2 cycles - load the buffer address index into A
	pop X							; 4 cycles - pick up the buffer pointer address
	clrc							 ; 2 cycles - clear carry
	adc A, (X)					 ; 3 cycles - add the index to the low byte of the buffer pointer
	mov (X), A					 ; 4 cycles - store the result back
	mov A, Y						; 2 cycles - Y should be zero
	adc A, $01+X				  ; 4 cycles - add zero to the high byte of the buffer pointer, with carry
	mov $01+X, A				  ; 5 cycles - store the result back
	cbne $05+X, done_HDMA	  ; 8/6 cycles - check high byte against buffer end address
	mov A, (X)					 ; 0/3 cycles - pick up low byte
	cbne $04+X, done_HDMA	  ; 0/8/6 cycles check low byte against buffer end address
	mov A, $02+X				  ; 0/0/4 cycles - if end of buffer reached, load buffer start address low byte
	mov (X)+, A					; 0/0/4 cycles - store to buffer pointer low byte and increment X
	mov A, $02+X				  ; 0/0/4 cycles - load buffer start address high byte
	mov (X), A					 ; 0/0/4 cycles - store to buffer pointer high byte
done_HDMA:
Caveat: I haven't tried this code, so I don't know if it's even correct, never mind if it works or not.
Last edited by 93143 on Thu Jan 28, 2016 2:50 pm, edited 3 times in total.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Enjoying your froyo?

Post by Drew Sebastino »

Well, it's good enough that it has sprite shrinking anyway. You know, can you really do much in terms of raster effects on the Neo Geo? I mean, you could pull of the Axelay effect if you were to change the first sprite's vertical shrinking value, but in order to also shrink horizontally, that would probably take way too much bandwidth than what's available.
Sik wrote:probably hardware implementation details
They didn't seem to have any trouble implementing other random features, like the hardware animation.

If it's just a lookup table though, why don't other systems have the same thing? I mean, no other system really has 512 sized sprites, so the lookup table wouldn't need to be that big, just like 32x32 or 64x64. You would also need it horizontally though.
Sik wrote:you still keep the ability to shrink with pixel granurality
Do some systems that support scrolling not support it pixel per pixel? Well, I mean, I guess you could say the Atari 2600 has "sprite scaling". :lol:

I think it's kind of funny that this topic was brought up again. I'm sad that I never got to try the Splatoon yogurt though, and I'm also sad that there's no new free junk coming out for it. :( The success of that game in Japan is pretty inkredible though. It sucks, because all the Japanese people occupy all the higher ranks in rank battle and my internet connection is about a solid second behind them.
93143 wrote:Okay, that's a load of rubbish. Try this on for size:
What am I looking at? :lol:
93143
Posts: 1830
Joined: Fri Jul 04, 2014 9:31 pm

Re: Enjoying your froyo?

Post by 93143 »

That is an attempt at an APU-side HDMA streaming engine capable of approaching four bytes per scanline, with the ability to smoothly back off on the bandwidth in such a way that the audio engine can achieve good time resolution by doing processing in between data chunks. This should work regardless of how long the engine cycles are (within limits - an audio engine that takes most of a frame just to turn over once ain't gonna fit).

The idea is that a single control word need not indicate only a single shot of HDMA data. Here, there's an "incoming" command, which tells the APU to start polling $F4 as soon as it notices, a "start" command, which is bundled with some metadata and tells the APU to actually start receiving data, and an arbitrary number of actual data shots (the number is part of the metadata that came with the "start" command). This pattern can be repeated any number of times per frame. I got the idea from the streaming engine in Super SNESMod, which looks like it relies on timed code to minimize handshaking requirements.

I have a plan that requires both high-bandwidth HDMA streaming (well beyond 32 kHz mono) and high-granularity audio engine timing (one frame is nearly 17 ms, which I don't expect to be acceptable for what I'm trying to do). I'm hoping this code makes enough sense that something like it can actually work. (This is my first attempt at writing SPC700 code; the earlier example in this thread doesn't count.)

...

It just occurred to me that if the audio engine can be relied on to have set up the streaming buffer pointer beforehand, it could also set up the streaming code, and it shouldn't be necessary to do the code modification in between the "start" command and the beginning of data pickup. Moving it to afterwards would allow more than three streaming buffers and remove the requirement for the data pickup loop to be in direct page, but it would eat roughly an extra scanline of compute time after every data chunk, and it would require a separate data pickup loop for each buffer...

With more robust use of X, I could reduce the amount of code modification required (in fact I might try that; I think I could get the loop out of direct page), but I don't see how to eliminate it without either limiting buffer size to 256 bytes (which is bad for 32 kHz because a frame is 300 bytes) or using multiple data pickup loops for each buffer...
Sik
Posts: 1589
Joined: Thu Aug 12, 2010 3:43 am

Re: Enjoying your froyo?

Post by Sik »

Espozo wrote:If it's just a lookup table though, why don't other systems have the same thing? I mean, no other system really has 512 sized sprites, so the lookup table wouldn't need to be that big, just like 32x32 or 64x64. You would also need it horizontally though.
Sega was going to implement it in the Mega Drive VDP, but ran out of die space (・~・) I suppose Space Harrier II and Super Thunder Blade were originally meant to use this feature?

And the problem with look-up tables is that, well, they take up a considerable amount of die space which made them really expensive compared to other stuff.
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Enjoying your froyo?

Post by Drew Sebastino »

Sik wrote:Sega was going to implement it in the Mega Drive VDP, but ran out of die space (・~・)
Same with everything else? :lol: What was the Mega Drive originally supposed to be, an arcade machine?
psycopathicteen
Posts: 3181
Joined: Wed May 19, 2010 6:12 pm

Re: Enjoying your froyo?

Post by psycopathicteen »

Espozo wrote:
Sik wrote:Sega was going to implement it in the Mega Drive VDP, but ran out of die space (・~・)
Same with everything else? :lol: What was the Mega Drive originally supposed to be, an arcade machine?
It sounds like the Mega Drive was originally supposed to be a prototype Super Famicom.
Sik
Posts: 1589
Joined: Thu Aug 12, 2010 3:43 am

Re: Enjoying your froyo?

Post by Sik »

More like they were trying to make something that could handle reasonably their then-current arcade games (including the superscaler ones). That didn't work out as you can see =P The Sega CD was originally an attempt to include the missing hardware, then somebody decided to add a CD drive (that was a last minute change) and then Digital Pictures lured Sega into thinking that FMV games were more important than, you know, all the other improvements >.<
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: Enjoying your froyo?

Post by Drew Sebastino »

Sik wrote:then Digital Pictures lured Sega into thinking that FMV games were more important than, you know, all the other improvements >.<
:lol: However though, it all still has to go through the VDP, which partially explains why the video looked like crap. (512 colors total isn't the greatest, but the bigger offender was the 4 4bpp color palettes. I imagine they could have saved bandwidth and made the video larger if they cropped off the top and bottom of the screen, but I guess they liked the look of those ugly looking boarders.) If only they knew about the phantom bitmap trick...
Sik wrote:More like they were trying to make something that could handle reasonably their then-current arcade games (including the superscaler ones).
If only they could have gone couple of years into the future to see this: https://www.youtube.com/watch?v=MTzyz2TgGls
Sik
Posts: 1589
Joined: Thu Aug 12, 2010 3:43 am

Re: Enjoying your froyo?

Post by Sik »

Espozo wrote:I imagine they could have saved bandwidth and made the video larger if they cropped off the top and bottom of the screen, but I guess they liked the look of those ugly looking boarders.
Eh, bandwidth wasn't the issue (you can easily get 15FPS without any sort of cheating), the problem was that videos had to be decompressed since let's say that an 1x drive wasn't exactly capable of streaming uncompressed video (I made some calculations, it'd have gone at like 2FPS if they tried that). This is also why the video size kept increasing over time, they simply improved the codecs being used (case in point, that one is nearly fullscreen)

(EDIT: also thinking about it, how much uncompressed video can you cram into a CD, really?)
Espozo wrote:If only they could have gone couple of years into the future to see this: https://www.youtube.com/watch?v=MTzyz2TgGls
To be fair, Sega wasn't aiming for kids either (a large part of why they could actually take over the NES in the US, their target audiences didn't exactly overlap).

Also what's at 1:01, F-Zero Kart? =P
93143
Posts: 1830
Joined: Fri Jul 04, 2014 9:31 pm

Re: Enjoying your froyo?

Post by 93143 »

Okay, let's try this again, a little less cryptic this time (and better code too, IMO):

I'm trying to get enough bandwidth for 32 kHz stereo streaming or better while preserving sub-frame audio processing resolution and not loading the S-CPU down very much. This is apropos of the Street Fighter Alpha 2 discussion earlier in the thread (three 22 kHz streams would be great), but I have applications of my own in mind for it.

So what I've done is I've tried to design an approach using HDMA, that uses single commands (well, pairs of) to set up block transfers with no fine-grained handshaking, so as to be able to feed the APU data in chunks during active display while allowing it to do other processing in between the chunks.

In my concept, the HDMA pattern would consist of a series of data blocks, each one consisting of (a) a "data incoming" command, (b) a gap long enough for the audio engine to notice said command and begin polling I/O, (c) a "data start" command, along with the data length in scanlines and a data ID for multi-stream support, and (d) four bytes of data per line, for the number of lines given in (c). Possibly also (e) a "no action" command, so the APU doesn't misinterpret part of the last data shot as some other command...

...

My first attempt, a bit upthread, used a lot of zero-page memory and was limited to three sample buffers at a time, because it relied on 16-bit code modification at runtime. Last night I modified the code to make better use of the index registers, freeing up a significant chunk of direct page and allowing up to 6 simultaneous buffers, although each buffer now has to start on a page boundary. I also moved the buffer metadata to page one, freeing up the rest of zero page (not sure how much that matters).

Unfortunately the new code imposes an additional restriction on the streaming format. The use of X as the low byte of the streaming buffer pointer means the data chunk size has to divide evenly into 256 bytes; otherwise I have to deal with overflow in the pickup loop, and there's no time for that. The buffer should probably also be a multiple of 9 bytes, unless the streaming data is already formatted with the buffer size in mind, studded with end-of-sample bits and padded with zeroes... actually that sounds like a good idea regardless...

The key question, which at the moment is totally outside my expertise, is how long a high-granularity but full-featured audio engine can be expected to take at most between I/O port checks. If my math is correct, an engine that turns around in 9 scanlines or so should allow up to 640 bytes per frame in 32-line (128-byte) chunks, with five engine slots per frame plus whatever fits in VBlank (roughly 4 at max length) for a total of about 37% of the total compute time (ie: streaming eats 63%). An engine that turns around in 3 scanlines (which may be unrealistic) could allow the same bandwidth in 16-line chunks, with ten engine slots per active display period (in this case streaming eats 70% of total compute time). Paired 16-line chunks (two chunks back-to-back with no processing in between) could do it given 6 scanlines of turnaround time, leaving room for five of those 6-line engine slots in active display. I haven't yet cycle-counted past the end of the streaming routine (partly because I haven't written anything more than this yet), so these numbers are approximate.

Any thoughts? Keep in mind I haven't ever coded for the APU before, and this mess hasn't even been assembled, never mind run...

Code: Select all

; BUFFER METADATA STRUCTURE (WIP):
; byte 0-1:  current buffer write position
; byte 2:  buffer start page
; byte 3-4:  buffer end address
; byte 5: data ID
; In other words, using six buffers burns 36 bytes of direct page.  If using
; zero page for this is acceptable, the SETP/CLRP instructions can be removed
; and the timing headroom goes from ~8 cycles to ~12.

; HDMA STREAMING CODE:
data_incoming_HDMA:
	mov A, #data_start_HDMA	; 2 cycles - load data start flag value
-  cbne $F4, -					; 7 cycles - listen for the write

; This point is reached roughly 3-9 cycles after $2140 is written, assuming
; CBNE loads the comparison value before the branch target.
	mov A, $F5					 ; 3 cycles - load data ID number
	mov X, $F6					 ; 3 cycles - load chunk size in scanlines
; TOTALS:  6 cycles since start code noticed in $F4, 9-15 cycles since $2140 written

; Find the buffer to which the data ID was assigned when the APU sent the data
; request (or processed a streaming SFX request from the S-CPU):
	setp							 ; 2 cycles - switch to page one (optional)
	cbne buf6_id, buf5check	; 7/5 cycles - check buffer 6, proceed to next if no match
	mov Y, #$04					; waste 0/2 cycles
-  dbnz Y, -					  ; waste 0/22 cycles
	cmp A, (X)					 ; waste 0/3 cycles
	mov Y, #buf6				  ; 0/2 cycles - load direct-page address for buffer 6 data
	jmp buf_found				 ; 0/3 cycles - skip ahead
buf5check:
	cbne buf5_id, buf4check	; 7/5 cycles - check buffer 5, proceed to next if no match
	mov Y, #$03					; waste 0/2 cycles
-  dbnz Y, -					  ; waste 0/16 cycles
	cmp A, (X)					 ; waste 0/3 cycles
	mov Y, #buf5				  ; 0/2 cycles - load direct-page address for buffer 5 data
	jmp buf_found				 ; 0/3 cycles - skip ahead
buf4check:
	cbne buf4_id, buf3check	; 7/5 cycles
	cmp (X), (Y)				  ; waste 0/5 cycles
	cmp (X), (Y)				  ; waste 0/5 cycles
	nop							  ; waste 0/2 cycles
	nop							  ; waste 0/2 cycles
	mov Y, #buf4				  ; 0/2 cycles
	jmp buf_found				 ; 0/3 cycles
buf3check:
	cbne buf3_id, buf2check	; 7/5 cycles
	cmp (X), (Y)				  ; waste 0/5 cycles
	nop							  ; waste 0/2 cycles
	mov Y, #buf3				  ; 0/2 cycles
	jmp buf_found				 ; 0/3 cycles
buf2check:
	cbne buf2_id, buf1		  ; 7/5 cycles
	mov Y, #buf2				  ; 0/2 cycles
	jmp buf_found				 ; 0/3 cycles
buf1:
	mov Y, #buf1				  ; 2/0 cycles
buf_found:
; TOTALS:  39-40 since buffer check started, 48-55 since $2140 written

; If no assigned buffer was found, data will be sent to buffer #1.  Now the
; data pickup loop must be rewritten to target the selected buffer:
	mov A, $01+Y					  ; 4 cycles - get high byte of buffer pointer
	mov !(get_data_HDMA+4), A	 ; 5 cycles - write buffer page address
	mov !(get_data_HDMA+9), A	 ; 5 cycles
	mov !(get_data_HDMA+14), A   ; 5 cycles
	mov !(get_data_HDMA+19), A   ; 5 cycles
; TOTALS:  24 cycles since buf_found, 72-79 since start flag written to $2140

; The index registers will now be set up for the loop.  The buffer metadata
; pointer will be saved for later, and X and Y will be loaded with the low
; byte of the buffer pointer and the chunk size in scanlines, respectively:
	mov A, X						; 2 cycles - move chunk size from X to A
	mov X, $00+Y				  ; 4 cycles - get low byte of buffer pointer
	push Y						  ; 4 cycles - store buffer pointer address
	mov Y, A						; 2 cycles - get chunk size in scanlines
	clrp							 ; 2 cycles - switch back to page zero (if using page one)
; TOTALS:  14 cycles since loop rewritten, 86-93 since start flag written to $2140

; Ideally, one scanline should be almost exactly 65 cycles long.  The port reads
; are between cycles 3 and 30 past this point, putting them between 24 cycles after
; the first HDMA write and about 8 cycles before the fourth one on the next line.
; That should be good for at least several scanlines regardless of clock drift, no?

; STREAMING LOOP:
get_data_HDMA:
	mov A, $F4					 ; 3 cycles - get byte 0 of the data shot
	mov !$0000+X, A			  ; 6 cycles - write it to the current buffer position
	mov A, $F5					 ; 3 cycles - get byte 1
	mov !$0001+X, A			  ; 6 cycles - write it to the current buffer position plus one
	mov A, $F6					 ; 3 cycles - get byte 2
	mov !$0002+X, A			  ; 6 cycles
	mov A, $F7					 ; 3 cycles - get byte 3
	mov !$0003+X, A			  ; 6 cycles
	inc X							; 2 cycles - increment the current buffer position four times
	inc X							; 2 cycles
	inc X							; 2 cycles
	inc X							; 2 cycles
	cmp (X), (Y)				  ; waste 5 cycles
	cmp (X), (Y)				  ; waste 5 cycles
	cmp (X), (Y)				  ; waste 5 cycles
	dbnz Y, get_data_HDMA	  ; 6/4 cycles - repeat for next scanline, or exit if done
; TOTAL:  65 cycles

; The final loop ends ~19-26 cycles after the first byte would be written on the line
; immediately following the last line of the data chunk.

; Now it remains only to store X back in the zero page data structure and check for
; page rollover and end-of-buffer, updating the high byte of the buffer pointer as
; appropriate:
end_data_HDMA:
	setp							 ; 2 cycles - switch to page one (if using page one for buffer metadata)
	mov A, X						; 2 cycles - load the new buffer address low byte from X
	pop X							; 4 cycles - pick up the buffer pointer address
	mov (X), A					 ; 4 cycles - store the new buffer pointer low byte
	bne +							; 4/2 cycles - check if X had rolled over to zero (POP doesn't affect flags)
	inc $00+X					  ; 5 cycles - increment high byte of buffer pointer
+  mov A, $01+X				  ; 4 cycles - pick up high byte
	cbne $04+X, done_HDMA	  ; 8/6 cycles - check high byte against buffer end address
	mov A, (X)					 ; 0/3 cycles - pick up low byte
	cbne $03+X, done_HDMA	  ; 0/8/6 cycles check low byte against buffer end address
	mov A, $02+X				  ; 0/0/4 cycles - if end of buffer reached, load buffer start page
	mov $01+X, A				  ; 0/0/5 cycles - store to buffer pointer high byte
done_HDMA:
	clrp							 ; 2 cycles - switch back to page zero (if using page one)
; This code ends ~49-75 cycles after the first non-chunk HDMA slot.  In other words, it
; brackets the second slot, unless it will be more than 16 cycles until the next read.
; Which is quite probable.  And that means the next read will get whatever was written
; TWO scanlines after the last data shot.  Or, simply put, there are two scanlines of
; overhead after the chunk ends.
I've taken a cursory look at Super SNESMod, and at the APU code from N-Warp Daisakusen. The latter is interesting because it's doing almost exactly what I'm trying to do, but it's handled differently and seems to have some disadvantages compared with my approach (though to be fair, it is a field-proven capability, while mine is very much not). It also uses 66 cycles instead of 65 for the loop, but that seems to be a PAL thing.

Now that I think about it, if I wanted my code to be able to handle 32-line chunks on PAL, I'd probably have to partially unroll the pickup loop to take 131 cycles per two scanlines. PAL is nominally 65.632 cycles per scanline, vs. 65.033 on NTSC, give or take quite a bit (nearly 0.2 as I understand it), so over a chunk that long the timing would be unreliable with any single-line loop... and if I need two instances of the pickup code, it will take 20 extra cycles to overwrite the high byte, so I'm back to 3 buffers...

Wait... I have 15 cycles in that loop during which nothing whatsoever is happening:

Code: Select all

; FOR PAL, REPLACE 15-CYCLE TIME DELAY IN DATA PICKUP LOOP WITH:
	mov A, Y		; 2 cycles
	and #$01		; 2 cycles
	beq +			; 4/2 cycles
	cmp A, (X)	 ; 0/3 cycles
+  nop			  ; 2 cycles
	cmp (X), (Y)  ; 5 cycles
That's either 15 or 16 cycles depending on the low bit of the line counter. Problem solved.
psycopathicteen
Posts: 3181
Joined: Wed May 19, 2010 6:12 pm

Re: Enjoying your froyo?

Post by psycopathicteen »

On the subject of Capcom beat'mups, is their a reason for the 3-enemy limit in the SNES Final Fight games other than perceived CPU speed?
tepples
Posts: 22861
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Enjoying your froyo?

Post by tepples »

My first guess would be CHR RAM limits. The Super NES sprite page is 128x256 pixels, and that has to cover the player, the player's pickup weapon, enemies, and hit sparks. Though DMA is pretty fast at filling it, it takes about three frames without letterboxing to refill it, which is why Final Fight has letterboxing.

My second guess would be overdraw. The Super NES's maximum total sprite width is 272 pixels (34 8x1 pixel slivers) per scanline. Say you have Haggar and three Andore, and they all decide to do an attack where they spread their arms to be 64 pixels wide. Then you've used up most of the available slivers. It's also why Mighty Final Fight for NES used much smaller sprites.
Sik
Posts: 1589
Joined: Thu Aug 12, 2010 3:43 am

Re: Enjoying your froyo?

Post by Sik »

Definitely memory is the issue. I doubt sprite overflow was much of a worry, the scenario tepples decided is probably not that common and can be usually ignored =P
psycopathicteen
Posts: 3181
Joined: Wed May 19, 2010 6:12 pm

Re: Enjoying your froyo?

Post by psycopathicteen »

You could have variable sized slots, and reorganize the slots when an enemy dies.
Post Reply