Page 1 of 1

Updating VRAM via stack

Posted: Tue Nov 08, 2011 9:04 am
by Wave
I think this has been discussed several times in different topics but I would like to know if it's a good decision.

I would like to avoid complexity on my general purpose framework so I'm wondering about updating the palette every frame (from a buffer)

Assumming I will steal the buffer from the stack (up to 160 bytes)

Having 33 bytes (32 bytes palette + backgroud color) on $0100 for example, then in NMI I could do something like:

Code: Select all

lda #$3F
sta PPU.ADDR
lda #$00
sta PPU.ADDR

tsx
stx _stackTmp
ldx #$00
txs

//33 times this
pla
sta PPU.IO

ldx _stackTmp
txs
To copy the whole palette (plus setting background color), is it a good idea? or would it have any mayor drawbacks (assumming interrupts are off or will never happen)?

If it works I could use it to update nametable or attribute table too.

Posted: Tue Nov 08, 2011 9:31 am
by Kasumi
You don't need a 33rd byte for the background color. The universal background color is set by the first byte of the first background palette. The first byte of the other background palettes isn't even displayed. Just like the first byte in each of the sprite palettes doesn't matter because they're always drawn transparent.

Using the stack is a great idea and you seem to be going about it correctly. I created a general purpose stream format for writing to the PPU, which seems to be what you want to do. I can write an address, a number of bytes, and the bytes themselves to the stack and my NMI routine will write them to the PPU.

If all you plan to do is update the palettes: Nothing really bad will happen if interrupts are off or will never happen except, of course, the palette not updating. There wouldn't be a crash or anything with what you have so far. If you plan to add nametable updates and have a variable to determine when they're ready, you may end up with an infinite loop if interrupts are disabled for some reason.

Is there any specific reason you would ever need to disable the nmi happening at the start of each frame, though?

Edit2: Ah, right, a general purpose framework. I would just discourage the user from disabling the NMI interrupts. I think anyone interested in something that updates the PPU in an easy way would heed this warning, and anyone who wanted to mess with it would know what they were getting into.

Edit: Oh yeah. You probably want to transfer #$FF to the stack, not #$00. When pla is called and the stack pointer is $00, it actually pulls from ram location $0101, not $0100. The stack pointer points to where you'll push a byte to, not where you'll pull it from.

Posted: Tue Nov 08, 2011 9:51 am
by Dwedit
There seems to be three different "Very fast" ways of writing to PPU memory.

There's the LDA nnnn,X \ STA $2007 method, the PLA \ STA $2007 method, and the LDA #xx \ STA $2007 method.

To do the LDA nnnn,x \ STA 2007 method, you write a series of writes like this:
;x starts at zero
LDA nnnn,x
STA $2007
LDA nnnn+1,x
STA $2007
... unroll to 32 times
LDA nnnn+31,x
STA $2007

Then if you have more bytes to write, you can increase X by 32 and repeat the loop.
This takes 4 cycles for the LDA nnnn,X instruction (or 5 cycles if you cross a 256 byte boundary, don't do that), and 4 cycles for the PPU write to $2007.


PLA \ STA $2007 is the same speed as LDA nnnn,X \ STA $2007. The advantage I see to using PLA is that you don't need to increase X every once in a while to continue looping, and you get smaller code. The only disadvantage is that you need to be more careful about disabling interrupts and NMIs. And don't forget to read $2002 to clear the VBL flag at some point before re-enabling NMIs, otherwise you trigger a second NMI in the same vblank.

Battletoads uses PLA \ STA $2007 for the nametables, and LDY nnnn,X \ STY $2007 for the graphics data. Battletoads also happens to cross pages much of the time, costing a couple scanlines of draw time.

To get faster, you'd need a sequence of LDA #xx \ STA 2007 instructions sitting in RAM, you set the immediate values using self-modifying code. MC Kids uses this method to update the nametables. 32 writes = 160 bytes of code in RAM. Since MC Kids has WRAM, it can afford the RAM usage of this method.
It takes 2 cycles for the immediate load, and 4 cycles for the $2007 store, so you get 6 cycles per PPU write, instead of 8 cycles.

Posted: Tue Nov 08, 2011 10:57 am
by tokumaru
Kasumi wrote:The universal background color is set by the first byte of the first background palette. The first byte of the other background palettes isn't even displayed. Just like the first byte in each of the sprite palettes doesn't matter because they're always drawn transparent.
Actually, the first color of the first sprite palette will be your background color if you write all 32 bytes sequentially, because $3F10 is a mirror of $3F00. I'm too lazy to check, but I'm almost sure this is the case.

BTW, I'm of the opinion that using 32 bytes to define the palette is a bit of a waste... I use only 25 (3 colors for each palette * 8 + the background color), and I repeat the background color for all the palettes. In this case, the stack code could look like this:

Code: Select all

	pla
	tax ;keep a copy of the background color in X

	;repeat this 8 times
	stx $2007 ;color 0
	pla
	sta $2007 ;color 1
	pla
	sta $2007 ;color 2
	pla
	sta $2007 ;color 3
EDIT: Heh, this talk about fast ways to write to VRAM gave me a crazy idea... Since I'm using CHR-RAM and my main character has lots of animations, I would like to update its patterns as fast as possible. I don't have any RAM to spare for long LDA #$XX / STA $2007 chains, but I might have the ROM to store the character's graphics that way. I know it's crazy to expand the graphics to 5 times the original size, but since only a portion of the game's graphics will be stored that way this might not be so bad.

Posted: Tue Nov 08, 2011 11:26 am
by Shiru
Since you don't really need to write data for 4th, 8th and 12th palette entries, you can skip them by reading $2007. I.e. four writes, read, three writes, repeat from read.

Posted: Tue Nov 08, 2011 12:24 pm
by Wave
Kasumi wrote:You don't need a 33rd byte for the background color. The universal background color is set by the first byte of the first background palette. The first byte of the other background palettes isn't even displayed. Just like the first byte in each of the sprite palettes doesn't matter because they're always drawn transparent.

Using the stack is a great idea and you seem to be going about it correctly. I created a general purpose stream format for writing to the PPU, which seems to be what you want to do. I can write an address, a number of bytes, and the bytes themselves to the stack and my NMI routine will write them to the PPU.

If all you plan to do is update the palettes: Nothing really bad will happen if interrupts are off or will never happen except, of course, the palette not updating. There wouldn't be a crash or anything with what you have so far. If you plan to add nametable updates and have a variable to determine when they're ready, you may end up with an infinite loop if interrupts are disabled for some reason.

Is there any specific reason you would ever need to disable the nmi happening at the start of each frame, though?

Edit2: Ah, right, a general purpose framework. I would just discourage the user from disabling the NMI interrupts. I think anyone interested in something that updates the PPU in an easy way would heed this warning, and anyone who wanted to mess with it would know what they were getting into.

Edit: Oh yeah. You probably want to transfer #$FF to the stack, not #$00. When pla is called and the stack pointer is $00, it actually pulls from ram location $0101, not $0100. The stack pointer points to where you'll push a byte to, not where you'll pull it from.
I think I'm gonna pass on the 33th write, I'll just make a label to the background color to write it directly.
I don't think I have a reason to disable nmi, I was thinking about the other interrupts (could mess up stack).
And thanks for the stack pointer tip, I didn't know that.
Dwedit wrote:There seems to be three different "Very fast" ways of writing to PPU memory.

There's the LDA nnnn,X \ STA $2007 method, the PLA \ STA $2007 method, and the LDA #xx \ STA $2007 method.

To do the LDA nnnn,x \ STA 2007 method, you write a series of writes like this:
;x starts at zero
LDA nnnn,x
STA $2007
LDA nnnn+1,x
STA $2007
... unroll to 32 times
LDA nnnn+31,x
STA $2007

Then if you have more bytes to write, you can increase X by 32 and repeat the loop.
This takes 4 cycles for the LDA nnnn,X instruction (or 5 cycles if you cross a 256 byte boundary, don't do that), and 4 cycles for the PPU write to $2007.


PLA \ STA $2007 is the same speed as LDA nnnn,X \ STA $2007. The advantage I see to using PLA is that you don't need to increase X every once in a while to continue looping, and you get smaller code. The only disadvantage is that you need to be more careful about disabling interrupts and NMIs. And don't forget to read $2002 to clear the VBL flag at some point before re-enabling NMIs, otherwise you trigger a second NMI in the same vblank.

Battletoads uses PLA \ STA $2007 for the nametables, and LDY nnnn,X \ STY $2007 for the graphics data. Battletoads also happens to cross pages much of the time, costing a couple scanlines of draw time.

To get faster, you'd need a sequence of LDA #xx \ STA 2007 instructions sitting in RAM, you set the immediate values using self-modifying code. MC Kids uses this method to update the nametables. 32 writes = 160 bytes of code in RAM. Since MC Kids has WRAM, it can afford the RAM usage of this method.
It takes 2 cycles for the immediate load, and 4 cycles for the $2007 store, so you get 6 cycles per PPU write, instead of 8 cycles.
I readed about the LDA nnnn,X \ STA $2007 method just after writing this post, what I don't understand is, what's the "need to increase X every once in a while to continue looping" to copy more than 32 bytes?

Posted: Tue Nov 08, 2011 12:44 pm
by Dwedit
If you have a loop unrolled 32 times, and want to copy more than 32 bytes, you add 32 to X every time it loops. (32 is just an example number, but it comes up a lot)

like this

;Y = number of times to copy 32 bytes
loop:
lda buffer,x
sta $2007
lda buffer+1,x
sta $2007
...
lda buffer+31,x
sta $2007
dey
beq exitLoop
txa
clc
adc #32
tax
jmp loop

This lets you use the same code to copy any multiple of 32 bytes from any X within the buffer. When you've copied 32 bytes and want to continue, you add 32 to X and copy some more bytes out.
Of course, you don't need to do this if you're using the stack pull method, but you can only transfer bytes from the stack page.

Posted: Tue Nov 08, 2011 2:12 pm
by Kasumi
I was thinking about the other interrupts (could mess up stack).
I didn't think of them at the time. :oops: You can get around this "the poor man's way" by starting your stream from $0103 (or even higher up, if you think an interrupt will happen in an interrupt in your NMI interrupt). That way if the absolute worst case happens and an IRQ occurs immediately after you transfer #$02 to the stack, the program counter and process status flags will be put in a place that won't corrupt your stack when it wraps to $01FF. Of course, the beginning of the data stream can still be corrupt, but it should only affect data you have already pulled from the stack. Unless I'm wrong again.
tokumaru wrote: Actually, the first color of the first sprite palette will be your background color... I'm too lazy to check, but I'm almost sure this is the case.
You're absolutely right, I just checked. Sorry for that bit of misinformation.

I think I finally understand the point of the 33rd write. It was to rewrite the background color after the byte from the sprite palette overwrote it. Writing to it separately isn't needed either, though. Tokumaru's code shows that.
BTW, I'm of the opinion that using 32 bytes to define the palette is a bit of a waste... I use only 25 (3 colors for each palette * 8 + the background color), and I repeat the background color for all the palettes.
Heh. I had even written this out in my post, but deleted it. It saves only 7 bytes, and 28 cycles. I'm often told I go too far with that sort of thing.

For reference, here's my NMI routine. It has two formats.

The first thing it does is pull a "check byte". If the check byte doesn't set the negative flag, it means the check byte is actually the high byte of an address to write a string of bytes to. It writes this to $2006, then it pulls the second part of the address and writes that to $2006. The next byte is the number of bytes to write. The next byte is whether the PPU should increment by 1 or 32. Following that are the actual data bytes to write to the PPU.

If the check byte does set the negative flag, it checks if the byte is equal to #$FF. The stream ends on a "check byte" that is #$FF.

If the negative flag is set, but it's not #$FF that means a "One byte per address" (OBPA) stream is starting. The check byte is not used for this type of stream. So it pulls another byte, and that is the number of bytes to write-1. (We'll call it Z). 0 means there is one byte to write. The next Z+1 bytes are the bytes to write to the PPU. The next (Z+1)*2 bytes are the high and low bytes of the address the corresponding bytes need to be written to.

It has unrolled code for this type of stream.

OBPA mode is used for y attributes of course, but it could also be used for updating only a few palette colors or whatever else isn't sequential. It fails right now if you need to write more than 10. You have to add more obpa macros for that to write more than 10.

Apologies in advance for the nesasm format.

Code: Select all

;Note: The NMI jumps to the "NMI" label and NOT the "NMI.minus" label.
;ppustream is $0100.
NMI.minus:
	cmp #$FF
	beq spriteDMA.stackres
	
	pla;Loads the number of bytes to write (minus one)
	tay
	
	lda obpa.jmplow,y
	sta <nmiaddrlow
	
	lda obpa.jmphigh,y
	sta <nmiaddrhigh
	
	tsx;If ppustream,x is loaded, you'd get the number of bytes to write(minus one)
	
	txa;Since we need to add the number of bytes in the stream to the
	;current address to get the index location of the addresses
	
	tay;The current index location is needed for y

	
	sec;Adds one extra to make up for the one missing since the jmp
	adc PPUstream,x;But this still only gives us the index location of the
	;last byte in the byte stream
	;Since we didn't start from the first byte in the byte stream
	tax

	txs;But since the stack reads the NEXT byte, we don't need to add one.

	iny; y now contains the start of the byte stream
	
	jmp [nmiaddrlow];jumps to the unrolled loop
NMI:;2270 cycles?
	sta <nmia;Storing the registers so when this returns from the interrupt
	stx <nmix;A, X, and Y can be reloaded so the expected values will be there
	sty <nmiy;rather than the ones the nmi used
	
       ;One should probably use the stack to backup a, x, and y. I don't because... I don't.
	
	lda <safetiles;A flag that tells if the stream is fully written. If it's not
	bpl spriteDMA;We only sprite DMA
	
	tsx
	stx <nmistack
	ldx #$FF
	txs
	
	inx
	stx <safetiles
nmitileloopstart:
	pla
	bmi NMI.minus;If the high bit isn't set
	sta $2006;It's an address
	
	pla;Byte 2 of the address
	sta $2006
	
	pla;Number of Bytes to write
	tay
	
	lda <PPUmirror
	and #%11111011
	sta <PPUmirror
	
	pla
	ora <PPUmirror
	sta <PPUmirror
	sta $2000

nmitileloop:
	pla
	sta $2007
	dey
	bne nmitileloop
	beq nmitileloopstart
	
spriteDMA.stackres:
	ldx <nmistack
	txs

spriteDMA:
	ldy #$00	; Must be done before a sprite DMA
	sty $2003  ; Must be done before a sprite DMA

	lda #$07
	sta $4014
	
	;sta $401F;remove
	
	lda <PPUmirror
	and #%11111100
	sta $2000
	sta <PPUmirror
	
	lda <scrollxhigh
	and #%00000001
	beq nminametablexsetskip
	
	ora <PPUmirror
	sta <PPUmirror
	
nminametablexsetskip:

	lda <scrollyscreenhigh
	and #%00000001
	beq nminametableysetskip
	
	asl a
	
	ora <PPUmirror
	sta <PPUmirror
	
nminametableysetskip:

	lda <PPUmirror
	sta $2000
	
	lda <scrollxlow
	sta $2005
	
	lda <scrollyscreenlow
	sta $2005

	

	
	lda #$FF
	sta <vblank
	
	lda <nmia
	ldx <nmix
	ldy <nmiy
	
	rti
	
	.macro obpabody
	pla
	sta $2006
	
	pla
	sta $2006
	
	lda PPUstream,y
	sta $2007
	iny
	
	.endm
	
;obpa = one byte per address
obpa.10:
	obpabody
obpa.9:
	obpabody
obpa.8:
	obpabody
obpa.7:
	obpabody
obpa.6:
	obpabody
obpa.5:
	obpabody
obpa.4:
	obpabody
obpa.3:
	obpabody
obpa.2:
	obpabody
obpa.1:
	obpabody
NMIreturntostream:
	jmp nmitileloopstart
obpa.jmplow:
	.db low(obpa.1)
	.db low(obpa.2)
	.db low(obpa.3)
	.db low(obpa.4)
	.db low(obpa.5)
	.db low(obpa.6)
	.db low(obpa.7)
	.db low(obpa.8)
	.db low(obpa.9)
	.db low(obpa.10)
obpa.jmphigh:
	.db high(obpa.1)
	.db high(obpa.2)
	.db high(obpa.3)
	.db high(obpa.4)
	.db high(obpa.5)
	.db high(obpa.6)
	.db high(obpa.7)
	.db high(obpa.8)
	.db high(obpa.9)
	.db high(obpa.10)
There are ways to make it better I'm sure, like not changing how the PPU increments for every regular stream, or using the check byte for OBPA by anding out the high bit and using that to specify the number of bytes. I could also partially unroll the regular stream format. Still, I'm pretty happy with it right now. If any part of it is unclear or stupid, let me know. I didn't really clean it up for posting, but it does work and is fast enough to scroll 8 pixels in each direction in the same frame.