How fast is dynamic sprite loading?

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

psycopathicteen
Posts: 3001
Joined: Wed May 19, 2010 6:12 pm

How fast is dynamic sprite loading?

Post by psycopathicteen »

In [url=http://forums.nesdev.com/viewtopic.php?p=149852#p149852]this post[/url], tepples wrote:"Is this Battletoads?"

Some people are fans of CHR ROM because it allows rapid switching of tiles for smooth animation of the player character. But in Kirby's Adventure, it ends up causing a lot of duplication because all frames of all enemies on screen at once need to fit in the same 2K bank of enemy tiles. So instead, I'm a fan of the Battletoads technique of loading sprite tiles into video memory as they're needed. I've already described how this works on Game Boy Advance, but the NES has far less video memory bandwidth and thus needs a bit more clever technique.

The engine I'm developing for this project has four object slots in video memory: one for the hero and three for enemies. These occupy CHR RAM $1800-$19FF, $1A00-$1BFF, $1C00-$1DFF, and $1E00-$1FFF. Each slot is divided into a pair of 16-tile buffers, plus several variables in main RAM:
  • Current cel: The cel ID currently being displayed in this slot.
  • Next cel: The cel ID whose tile data needs to be loaded into the back buffer of this slot.
  • Current buffer: Whether the slot's first or second buffer is its front buffer.
  • Information about what data has been loaded into each buffer of each slot.
In addition, a set of request flags controls which sprites should be switched to the next cel as soon as they are completely loaded.

On each frame that doesn't have any updates to tiles or map caused by scrolling, the sprite cel loader finds pieces of a cel to load. It prioritizes slots whose request bit is set, switching buffers and clearing the request bit if the cel is ready and loading a piece into the VRAM transfer buffer if not. Up to 8 tiles can be copied in each frame (NTSC without extended blanking). If a particular frame uses all 16 tiles, its update is split across two frames.

If there is still no scheduled VRAM transfer after the loader has processed all request bits, it loads pieces of the next cel speculatively. Speculative loading sets the next cel to the frame most likely to follow a slot's current cel, such as the next cel of a walk cycle. I count about five mispredicts per second on average, usually when an enemy spawns or when the player takes an unpredicted action, such as jumping, stopping a walk, beginning a punch combo, allowing a punch combo to expire, or taking a hit. A mispredict may delay loading a cel for a frame or two But otherwise, speculative loading puts a cel into VRAM just when it is needed, allowing the player and enemies to be animated at an acceptable frame rate.

The metasprite drawing code uses values $00-$7F normally for constant tiles. It uses $80-$8F for these switchable slots, ORing in the start tile of current buffer of the slot being drawn.
Is the NES really that bad with sprites? I wrote down a quick loading routine and counted the cycles and ended up with:

Code: Select all

-;
lda ({tile_address}),y	//5
sta {vram_port}		//4 9
iny			//2 11
cpy #$10		//2 13
bne -			//3 16
It would take only 2048 cycles to upload 8 tiles, and vblank is more than 4096 cycles long.
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Looking for NES Coder ( Paid )

Post by tepples »

psycopathicteen wrote:Is the NES really that bad with sprites? I wrote down a quick loading routine and counted the cycles and ended up with:

Code: Select all

-;
lda ({tile_address}),y	//5
sta {vram_port}		//4 9
iny			//2 11
cpy #$10		//2 13
bne -			//3 16
It would take only 2048 cycles to upload 8 tiles, and vblank is more than 4096 cycles long.
Vblank on NTSC NES is closer to 2270 cycles long because the NES PPU always runs in 240-line mode. This also needs to include about 600 cycles of other tasks, such as OAM DMA and setting the scroll position. So the pattern loading routine is unrolled by a factor of 16 and always copies from a buffer in an otherwise unused part of the stack page ($0100-$017F).
psycopathicteen
Posts: 3001
Joined: Wed May 19, 2010 6:12 pm

Re: Looking for NES Coder ( Paid )

Post by psycopathicteen »

I thought most games used forced blank.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Looking for NES Coder ( Paid )

Post by tokumaru »

psycopathicteen wrote:

Code: Select all

-;
lda ({tile_address}),y	//5
sta {vram_port}		//4 9
iny			//2 11
cpy #$10		//2 13
bne -			//3 16
You really shouldn't compare and branch every byte, when you know that each tile is 16 bytes. Unrolling this loop to copy 16 bytes at a time already represents a big speed boost. Still, having to increment Y for every byte and using indirect indexed addressing is too slow for my taste. I'd rather interleave the bytes and use indexed addressing with increasing base addresses in an unrolled loop, or even buffer the tiles in RAM beforehand and copy them to VRAM with an unrolled loop.
It would take only 2048 cycles to upload 8 tiles, and vblank is more than 4096 cycles long.
As it's been pointed out, your math is a little off. With only 2273 cycles of VBlank, you have to do better than this if you expect to animate objects and update other things, such as backgrounds, palettes and OAM.
psycopathicteen wrote:I thought most games used forced blank.
Most games don't! The ones that do are usually unlicensed.
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Looking for NES Coder ( Paid )

Post by tepples »

This is what happens in a typical unrolled tile copy, at 140 cycles per 16-byte tile:

Code: Select all

vram_copybuf = $0100
PPUDATA = $2007

; prep code omitted
; carry is clear at this point
copyloop:
  .repeat 16, I
    lda vram_copybuf+I,x
    sta PPUDATA
  .endrepeat
  txa
  adc #16
  tax
  cpx vram_copylen
  bcc copyloop
; fixup code omitted
The .repeat block in ca65 expands into this:

Code: Select all

  lda $0100,x
  sta $2007
  lda $0101,x
  sta $2007
  lda $0102,x
  sta $2007
  ; ...
  lda $010F,x
  sta $2007
User avatar
rainwarrior
Posts: 8062
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Looking for NES Coder ( Paid )

Post by rainwarrior »

If you store your data to upload on the stack and unroll your code, you can easily get 8 cycles per byte:

Code: Select all

.repeat 16
    pla ; 4 cycles
    sta $2007 ; 4 cycles
.endrepeat
If you want to write a generator to unroll and store your tiles as code in ROM, you can get down to 6 cycles or less per byte:

Code: Select all

    lda #$05 ; 2 cycles
    sta $2007 ; 4 cycles
    ldx #$39 ; 2 cycles
    stx $2007 ; 4 cycles
    ldy #$73 ; 2 cycles
    sty $2007 ; 4 cycles
    ...
If you can order the choice of register to make loads redundant (e.g. if you lda #$00 you can sta $2007 many bytes of zeroes), to save 2 more cycles each time. (You probably wouldn't do this in combination with a forced vblank, though, since you'd normally need a consistent cycle count for that.)

You can also dynamically build this code in RAM if you want to save ROM space, at the expense of extra setup time outside of vblank.


As for games that use forced vblank, there are very few. If you have bankable CHR-ROM, there's generally not a need, it's mostly just for animating tiles with CHR-RAM. Not a lot of games actually did that.
User avatar
rainwarrior
Posts: 8062
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Looking for NES Coder ( Paid )

Post by rainwarrior »

Tepples, I hope your carry is clear before that adc #16.
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Looking for NES Coder ( Paid )

Post by tepples »

rainwarrior wrote:Tepples, I hope your carry is clear before that adc #16.
The prep code clears it.

And PLA is as slow as LDA a,X.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Looking for NES Coder ( Paid )

Post by tokumaru »

rainwarrior wrote:If you store your data to upload on the stack and unroll your code, you can easily get 8 cycles per byte:
You can also get 8 cycles per byte straight off the ROM if you're OK with copying groups of tiles instead of single tiles, and interleaving the bytes of all the groups (creating structures of arrays, as the 6502 likes it). For example, if using groups of 64 bytes (4 tiles) you could address 16KB of CHR data with an 8-bit index:

Code: Select all

	offset = 0
.repeat 64
	lda $8000+offset, x
	sta $2007
	offset = offset + 256
.endr
This would work well for UNROM for example.
If you want to write a generator to unroll and store your tiles as code in ROM, you can get down to 6 cycles or less per byte:
That's something I considered doing for a handful of animated objects, as well as the main character. Definitely not for all the graphics in a game.
You can also dynamically build this code in RAM if you want to save ROM space, at the expense of extra setup time outside of vblank.
I have to say I'm not a fan of spending so much time just preparing data like that.

BTW, I just noticed we've had this conversation before.

Anyway, you know what would've been sweet? If there was an option to select $2004 or $2007 as the target for DMA writes. It wouldn't do much for name table updates (besides allowing a full background update in a single frame), but it would've been a great help for managing CHR-RAM. I know it's silly to think of what could have been... the console is what it is and we must accept it's limitations, but wouldn't it be nice if a mapper could add this feature?
User avatar
rainwarrior
Posts: 8062
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: Looking for NES Coder ( Paid )

Post by rainwarrior »

tokumaru wrote:Anyway, you know what would've been sweet? If there was an option to select $2004 or $2007 as the target for DMA writes. It wouldn't do much for name table updates (besides allowing a full background update in a single frame), but it would've been a great help for managing CHR-RAM. I know it's silly to think of what could have been... the console is what it is and we must accept it's limitations, but wouldn't it be nice if a mapper could add this feature?
Wasn't that basically what the dual WRAM/CHR-RAM mapper idea was for?
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Looking for NES Coder ( Paid )

Post by tokumaru »

rainwarrior wrote:Wasn't that basically what the dual WRAM/CHR-RAM mapper idea was for?
That was nice, but way to complicated to implement, IMO. A DMA feature built from the ground up would be complicated too, I know. Being able to reuse the existing DMA functionality but routing writes to $2007 instead would be the really cool thing I think, but that's probably not possible.
psycopathicteen
Posts: 3001
Joined: Wed May 19, 2010 6:12 pm

Re: How fast is dynamic sprite loading?

Post by psycopathicteen »

So it could do 16 tiles per frame even without forced blank. So that means that if DKC got ported to the NES, the sprites would be half their size, half the amount, and half the framerate.
User avatar
rainwarrior
Posts: 8062
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: How fast is dynamic sprite loading?

Post by rainwarrior »

psycopathicteen wrote:if DKC got ported to the NES
The NES has bankable CHR-ROM solutions, though. Why not just use that? They probably would have used that on SNES if it was capable.
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: How fast is dynamic sprite loading?

Post by tepples »

rainwarrior wrote:
psycopathicteen wrote:if DKC got ported to the NES
The NES has bankable CHR-ROM solutions, though. Why not just use that?
The four windows of MMC3 work for the player and three enemies at once. If there are more independently animated enemies, you have to group enemies into enemy sets and duplicate each enemy's sprite tiles in the tile bank associated with each enemy set in which it appears, as Kirby's Adventure does. This is part of why Teenage Mutant Ninja Turtles II stops the scroll so often, so that the two players never encounter more than two distinct enemy types at once.
User avatar
rainwarrior
Posts: 8062
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: How fast is dynamic sprite loading?

Post by rainwarrior »

You could make a mapper that divides it as fine as you need? 16 slots would allow 16 characters with up to 16 tiles each (even though you could only display half of it in any given frame). If your characters aren't overlapping vertically, you could also use the MMC3's scanline counter to multiplex its existing 4 banks.

Also, we're forgetting that Hummer Team already ported Donkey Kong Country to the NES:
https://www.youtube.com/watch?v=fBeD-kEHy3E

(As you might have guessed, it uses 1k CHR-ROM banking.)
Post Reply