New technique for pushing video data faster

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

User avatar
slembcke
Posts: 172
Joined: Fri Nov 24, 2017 2:40 pm
Location: Minnesota

Re: New technique for pushing video data faster

Post by slembcke »

RTS Chaining, I think they meant this:
http://wiki.nesdev.com/w/index.php/RTS_Trick

I've was experimenting with this stuff yesterday and had pretty good luck. Other than it being non-obvious, it's pretty easy to do. Basically just writing a 2 byte command token instead of just one. You can easily avoid interfering with the stack if you write to the very bottom. My terminator function restores the stack, so to execute the buffer I just write the terminator func and the current stack pointer, then reset the stack pointer to FF and return. Voila! It like it.
pubby wrote:One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.
Say what? How does that work? Won't you end up jumping to the address you push? Not sure I understand.
Drakim
Posts: 85
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

slembcke wrote:I've was experimenting with this stuff yesterday and had pretty good luck. Other than it being non-obvious, it's pretty easy to do. Basically just writing a 2 byte command token instead of just one. You can easily avoid interfering with the stack if you write to the very bottom. My terminator function restores the stack, so to execute the buffer I just write the terminator func and the current stack pointer, then reset the stack pointer to FF and return. Voila! It like it.
Yup, that's the beauty of it. If you add more stuff later in the frame you simply overwrite the terminator function and write a new terminator function at the new offset in the stack.

I've decided to use 127 bytes for my video stack so that I can use the BMI instruction to quickly check if the video stack pointer has become too large each time I try to push in more.

I've also been experimenting with making the terminator function's address be symmetrical to speed it up further (something like $C0C0), since then you can just mindlessly fill the bottom of the stack in advance (with an unrolled loop), and not have to bother appending it after have written in your "2 byte command token". It speeds up high-video update situations but slows down low-video update situations, which is generally a good tradeoff.

Another cool trick I've mentioned earlier is that you can jump an arbitrary length into any of your subroutines as a starting point, which can be used in very neat ways.
slembcke wrote:
pubby wrote:One trick that can come in handy for stuff like this is using jsr to push addresses onto the stack. It's slightly faster than using lda+pha, especially if you're pushing multiple addresses in a row.
Say what? How does that work? Won't you end up jumping to the address you push? Not sure I understand.
Yeah, I've been twisting my head trying to figure out how it would work, but I just can't see it. Does he somehow exit out of the routines with a JMP? But even if he did that it would be the wrong part of the stack that's filled up. :|
Drakim
Posts: 85
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Oziphantom wrote: Nah

Put your On X table at the bottom of the Stack so $100 + then you do a 1 byte stack dispatch

get X to have value x2
txs
rts

If all your code runs in 1 frame then the stack position should be fixed per frame, so you can always just restore the same point at the end of the NMI.
You are going to have to explain that again, I'm not able to follow. :oops:
If you go over a frame and interrupt yourself X_X
Now that's a pretty good point. I guess I should either make really sure that no interrupt can happen while my video routine does it's work, or that interrupts are disabled while it works.
The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical
That's not completely true, if you leave yourself some room you can still use the faux video stack as a regular stack as long as you are careful. JSR, RTS, PHA and PLA will work just as expected as long as you don't push the stack past the $0/$FF point.
and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.
It increases the static cost of each video "segment" from 6 to 12 cycles, and you lose the ability to cram in extra values for each video subroutine to pull out with PLA, and each "jump entry" in your queue takes 4 bytes of RAM intead of 2.

Not so sure I agree it's almost as good, it seems like an inferior technique (but with the added bonus of being able to place it anywhere in RAM, or even have ROM versions for cutscenes and the like where things are exactly the same)
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Re: New technique for pushing video data faster

Post by Bregalad »

Drakim wrote: Which what you are saying is true in the strict technical sense, I'm not sure I agree with your overall point.
Well there's nothing to agree or disagree upon. I was just trying to explain why using stack as VRAM-buffer is not so common even though we all agree it's a cool trick. My point was that while this trick is indeed very cool, it's neither the fasest nor the easiest. To which you respond, "but this trick is so cool" - I never said it wasn't !
But if you are making an actual real game, which sometimes has a screen full of text, sometimes scrolls horizontally, sometimes vertically, sometimes cycles colors for shiny treasures, sometimes prints one and one letter in dialog, I'm not so certain you could beat the RTS technique.
Doing it with normal buffers without any optimisation allows a bandwith of aprox 80 bytes + sprite DMA, which is enough to update one line of nametable plus one line of attribute table plus palettes, or two lines of nametables plus attribute table but without palettes. With some partially unrolled loops or zero-page usage, it's possible easily to update two lines + attribute table + palettes + sprite DMA, with only light optimization. So the only application where further optimization is needed is for CHR-RAM updates, or maybe if updating a really large NT area in a single frame is needed.
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: New technique for pushing video data faster

Post by tepples »

Drakim wrote:Adding another subroutine is super trivial, you don't have to dig into popslide's source code to carefully figure out where to insert your new functionality. So I'd personally grade it easier and faster.
Popslide uses the same packet format as Super Mario Bros., which is described on both Data Crystal and NESdev Wiki. There are four packet types (horizontal literal, horizontal run, vertical literal, and vertical run). It's designed to be generic, where you can make your own subroutine that adds packets in a particular shape to the list without having to touch the updater code. The stuff in nstripe.s is mostly convenience for forming packets based on shapes that arose during the development of The Curse of Possum Hollow.

Having only one interpreter also simplifies the NMI handler, which is important to reduce the size of code in the fixed bank or the pseudo-fixed repeating area in a game using 32K switching.
User avatar
pubby
Posts: 555
Joined: Thu Mar 31, 2016 11:15 am

Re: New technique for pushing video data faster

Post by pubby »

Drakim wrote:Yeah, I've been twisting my head trying to figure out how it would work, but I just can't see it. Does he somehow exit out of the routines with a JMP? But even if he did that it would be the wrong part of the stack that's filled up. :|
It's just a jmp to a jsr which jumps back, and is only useful in a few very specific situations. I did use it in my music engine though.

Code: Select all

    jmp foo
continue:

; ...

foo:
    jsr continue
    ; some code
    rts
Rahsennor
Posts: 476
Joined: Thu Aug 20, 2015 3:09 am

Re: New technique for pushing video data faster

Post by Rahsennor »

FrankenGraphics wrote:You rang?
Sorry...
Drakim wrote:Are you sure you got it off the wiki? I can't find a single instance of the phrase "RTS chaining" or even "chaining" on the wiki. Heck, I can't even find "RTS chaining" related to programming anywhere on the web. Are you sure that's the proper name for the technique, and that you aren't just mixing it up with the "RTS Trick"? (or is my google-fu too weak?)

Edit: Oh wait, I read that wrong, you called it RTS chaining. Any idea what it was called in the wiki?
I can't find it on the wiki now either. My devlog notes the technique on 2016-05-18 as "RTS-display-list-on-stack" and then never mentions it again, but it's definitely right there in the source:

Code: Select all

.org $C000

bank .byte 0,1,2,3,4,5,6,7 ; UNROM conflict-avoidance table

irq:
reset:
	lda #0
	sta $2000  ; disable NMI
	sta $2001  ; disable rendering
	sta bank+0 ; switch in bank 0
	jmp init


.dsb $FF00 - *

; copy 0-32 bytes to the PPU
nmi_copy:
	nop ; offset for RTS (I'm lazy)
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20 ; PLA/STA x32
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20 ; 4 bytes each
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
.byte $68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20,$68,$8D,$07,$20
	pla
	sta $2006
	pla
	sta $2006
	rts

; write 32 zeroes to the PPU
nmi_zero:
	nop
	ldy #8
	lda #0
 nmi_zero1:
	sta $2007
	sta $2007
	sta $2007
	sta $2007
	dey
	bne nmi_zero1
	pla
	sta $2006
	pla
	sta $2006
	rts

; close the RTS chain
nmi_end:
	nop

	lda scroll_x
	sta $2005
	lda scroll_y
	sta $2005
	lda #$88     ; TODO: nametable select
	sta $2000

	ldx nmi_s
	txs
	ldy nmi_y
	ldx nmi_x
	lda nmi_a
nmi_abort:
	rti

nmi:
	lsr nmi_flag
	bcc nmi_abort
	sta nmi_a
	stx nmi_x
	sty nmi_y
	tsx
	stx nmi_s
	lda #>oam
	sta $4014
	ldx #$FF
	txs
	pla
	sta $2006
	pla
	sta $2006
	rts

.dsb $FFFA - *       ; pad to interrupt vectors

.word nmi
.word reset
.word irq
I also used this method on PC a fair bit, back when I was doing assembler coding for DOS and so on, so maybe I mentally inserted it into the "RTS Trick" page as a matter of course?

EDIT: I also saw Tokumaru's thread the day it was posted, so I may have got it from there. My treatment of $2006 is identical so it's likely that was it.
Drakim
Posts: 85
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

Bregalad wrote:Well there's nothing to agree or disagree upon. I was just trying to explain why using stack as VRAM-buffer is not so common even though we all agree it's a cool trick. My point was that while this trick is indeed very cool, it's neither the fasest nor the easiest. To which you respond, "but this trick is so cool" - I never said it wasn't !
Ah, my bad, sorry!
Doing it with normal buffers without any optimisation allows a bandwith of aprox 80 bytes + sprite DMA, which is enough to update one line of nametable plus one line of attribute table plus palettes, or two lines of nametables plus attribute table but without palettes. With some partially unrolled loops or zero-page usage, it's possible easily to update two lines + attribute table + palettes + sprite DMA, with only light optimization. So the only application where further optimization is needed is for CHR-RAM updates, or maybe if updating a really large NT area in a single frame is needed.
Thanks, that's a pretty good point. It didn't occur to me that a game's scope might allow it to push everything to vram on every frame without hitting the ceiling, my mental idea of how much you could push was much much lower.
pubby wrote:It's just a jmp to a jsr which jumps back, and is only useful in a few very specific situations. I did use it in my music engine though.
Ah, I see. That does indeed shave off 1 cycle (10 vs 9).

If you REALLY wanna go crazy on the savings you can do LDA, PHA, PHA if the subroutine is placed on an address where it's high and low byte are identical, then you have it down to 8 cycles. :D
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: New technique for pushing video data faster

Post by tokumaru »

I definitely posted a method exactly like this a while back.
Oziphantom
Posts: 1163
Joined: Tue Feb 07, 2017 2:03 am

Re: New technique for pushing video data faster

Post by Oziphantom »

Drakim wrote:
Oziphantom wrote: Nah

Put your On X table at the bottom of the Stack so $100 + then you do a 1 byte stack dispatch

get X to have value x2
txs
rts

If all your code runs in 1 frame then the stack position should be fixed per frame, so you can always just restore the same point at the end of the NMI.
You are going to have to explain that again, I'm not able to follow. :oops:
Its a One Byte dispatch method. I.e your $100 looks like
$100 <MyFunc-1
$101 >MyFunc-1
$102 <OtherFunc-1
$103 >OtherFunc-1

then if you want to dispatch OtherFunc you do
LDX #2
TXS
RTS
however normally you would not immed load X and put it on some "on X" variable.
The other catch with RTS chaining is you can't use the stack, not really a problem if you are just doing speed code from an NMI, but as a general system it not very practical
That's not completely true, if you leave yourself some room you can still use the faux video stack as a regular stack as long as you are careful. JSR, RTS, PHA and PLA will work just as expected as long as you don't push the stack past the $0/$FF point.
and making a self mod JSR chain(+6 per call) gives almost as much speed without the limitations.
It increases the static cost of each video "segment" from 6 to 12 cycles, and you lose the ability to cram in extra values for each video subroutine to pull out with PLA, and each "jump entry" in your queue takes 4 bytes of RAM intead of 2.

Not so sure I agree it's almost as good, it seems like an inferior technique (but with the added bonus of being able to place it anywhere in RAM, or even have ROM versions for cutscenes and the like where things are exactly the same)
I was imagining that you would have a mostly fixed "display list" as things move, but don't get added or removed from the screen that often. Thus you save time by not regenerating the whole list each frame. If you are regenerating it each frame then sure any other stack use, will trash stuff, but you don't care as you are going to rebuilt it before next NMI. However if you FRAME out and don't get all the stack rebuilt before the NMI and some of your other systems have "trashed" the display list X_X
Drakim
Posts: 85
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

tokumaru wrote:I definitely posted a method exactly like this a while back.
Indeed you did, so I'm calling it tokumaru's video stack from now on, if you don't mind :D

In your thread you asked for improvements, and there are a couple:

1. Your stack format is 2 bytes for the desired vram address, and then 2 bytes for the desired subroutine, chaining that for as much as you need, and then ending with 2 bytes for a fake vram address $0000 and then 2 bytes for the terminator function that restores the stack.

You can improve this by putting the vram address AFTER the desired subroutine address in your stack, and letting the subroutine itself PLA out the vram address and set it. Why? Because we can create hundreds of different subroutines if we desire, to do specialized tasks. For instance, we can make one to change the background fill color in the PPU, which is always $3F00, so it's better to write LDA #$3F, LDA #$00 than to use PLA, PLA. Not only is it faster cyclewise during the vblank, but it uses up less of the faux stack.

But even more important than that....

2. To loop back and continue to the next subroutine, you use "JMP UpdateVRAM -> PLAx2 vram address -> RTS", but now that the vram address isn't in the way you can drop all that and just use a normal "RTS" at the end of each subroutine, and it will magically jump to the next subroutine! This shaves off a LOT of cycles and simplifies the code as well.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: New technique for pushing video data faster

Post by tokumaru »

Drakim wrote:Indeed you did, so I'm calling it tokumaru's video stack from now on, if you don't mind :D
If this is supposed to be immortalized in the wiki or something I don't mind being credited somewhere, but having it named after me is a bit too much IMO. :mrgreen:
1. Your stack format is 2 bytes for the desired vram address, and then 2 bytes for the desired subroutine, chaining that for as much as you need, and then ending with 2 bytes for a fake vram address $0000 and then 2 bytes for the terminator function that restores the stack.

You can improve this by putting the vram address AFTER the desired subroutine address in your stack, and letting the subroutine itself PLA out the vram address and set it.
The reason I did it like that was so I could reuse the same 32x unrolled loop of PLA+STA to transfer any number of bytes from 1 to 32, by simply JSRing to the middle of the sequence. If I were to set the address inside the routine, I'd need 32 separate routines for the same functionality without any loss of performance.
Why? Because we can create hundreds of different subroutines if we desire, to do specialized tasks.
I did end up ditching the generic approach in favor of completely specialized routines myself, but aside from the rare instances where you'd use hardcoded addresses, my old method could do specialized routines too, which could set other VRAM addresses if needed.
2. To loop back and continue to the next subroutine, you use "JMP UpdateVRAM -> PLAx2 vram address -> RTS", but now that the vram address isn't in the way you can drop all that and just use a normal "RTS" at the end of each subroutine, and it will magically jump to the next subroutine! This shaves off a LOT of cycles and simplifies the code as well.
Sure, that's much cleaner and straightforward, but like I said, setting the VRAM address before calling the next routine is merely a ROM saving feature so that the same generic transfer routine can deal with data lengths from 1 to 32. The superfluous address setting at the end of the chain will waste a few cycles, but that's all.

I wasn't 100% satisfied with that solution either, but someone looking for a completely generic VRAM update system of "copy X bytes from the stack to PPU address $XXXX" that doesn't have much ROM space might want to go with that.

I ended up switching to completely specialized update routines, so I have complete freedom to store the data in whatever format is more convenient for each task (many times outside of the stack, even), but I still have the RTS directly call the next update routine.
Drakim
Posts: 85
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

tokumaru wrote:The reason I did it like that was so I could reuse the same 32x unrolled loop of PLA+STA to transfer any number of bytes from 1 to 32, by simply JSRing to the middle of the sequence. If I were to set the address inside the routine, I'd need 32 separate routines for the same functionality without any loss of performance.
You can solve this by first queuing up a specialized "set address" subroutine, and then queuing up the unrolled loop subroutine right after (plus the offset to get the desired amount of loops). It adds 6 cycles (the extra RTS) to do this, but just the unnecessary JMP statement at the end of the old solution costs 3 cycles, so this technique's unrolled loop only really costs 6 - 3 = 3 cycles more. As long as you do a few more video operations queued up afterwards you end up using less cycles, by avoiding the extra JMP each time.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: New technique for pushing video data faster

Post by tokumaru »

But the extra JMP and the bogus VRAM address are used only once, and that's during a not so precious time (i.e. they can safely take place during the pre-render scanline) but an extra 6 cycles per update routine call when CPU time is at a premium (i.e. vblank), that's much worse IMO.

Note that I'm not arguing that this is the best possible solution, just that there might be demand out there for all of these solutions. The VRAM address presetting is for those who're looking for a fast generic vblank handler but don't have much ROM to spare (e.g. NROM projects with more VRAM updates than typical).
Drakim
Posts: 85
Joined: Mon Apr 04, 2016 3:19 am

Re: New technique for pushing video data faster

Post by Drakim »

I think I'm misunderstanding you, or you are misunderstanding me. :wink:

Here is the gist what I understood you to be saying:
We have to arrange the stack so that it's [videoaddress] first and [subroutineaddress-1] afterwards. When it's vblank time, we pull out the [videoaddress], use it to set $2006 correctly, and then we use RTS to jump to our [subroutineaddress-1], which now begins pushing data into $2007. When it's done, it calls JMP UpdateVRAM to begin the process again.

The reason we cannot do the opposite order, beginning by using RTS to jump to [subroutineaddress-1], and then have the subroutine itself pull the [videoaddress] and set $2006 correctly, and then begin pushing data into $2007, is because the subroutine might be an unrolled loop, and we might want to jump to [subroutineaddress-1+60] to skip a bunch of the unrolled loop.

If we did that, we'd be jumping past the part of the subroutine that pulls out the [videoaddress] and set $2006 correctly. Therefore that part cannot be inside the subroutine.
And here is the gist what I'm saying:
To solve this issue, we simply have to avoid setting the address in the unrolled loop's subroutine. We do this by setting the stack like this: [subroutineSetAddress-1][videoaddress][subroutineUnrolledLoop-1+60]

What happens now is that our first RTS jumps us to subroutineSetAddress, which uses two PLA to set $2006 correctly, and then calls RTS, which jumps us to subroutineUnrolledLoop (plus the offset!) with the address already set.

This means we are using an additional RTS instruction for unrolled loops, which costs 6 more cycles, but we save 3 cycles for every single video subroutine by avoiding the "JMP UpdateVRAM", which means it more than makes up for it, and we end up using less cycles in the long run.
Edit: Actually, re-reading a couple of times, what do you mean "But the extra JMP and the bogus VRAM address are used only once", how does that work? What exactly do you do to end your video subroutines if you are only using the JMP one time during the entire sequence? Don't you have to do it one time per subroutine?
Post Reply