Stack juggling? Is this dangerous?

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

Post Reply
User avatar
za909
Posts: 249
Joined: Fri Jan 24, 2014 9:05 am
Location: Mijn hart woont al in Nederland

Stack juggling? Is this dangerous?

Post by za909 »

Hi, the first question is, does the thing I'm doing have a better name?

I have designed a PPU update buffer structure for my game in the $0100 page, and as I was starting to work on the NMI side of this system I realized that I need to suppress interrupts while I am working with another stack. Or do I?

Initially I thought the page would be divided up neatly like this:

Code: Select all

$0100-$017F for the nametable update buffer (S is set to $FF and values are pulled into A)
$0180-$01FF for the normal stack
But then I thought I could "cheat the system" and allow IRQs to get through if I offset the start of the nametable buffer enough so that the information that gets pushed in the IRQ routine can never cause an accidental stack underflow and corrupt the bottom of the normal stack. So in the end my page 1 structure looks like this:

Code: Select all

$0100-$0103 safe zone for IRQs (the IRQ routine only pushes A to the stack in addition to the mandatory P and return address)
$0104-$017F for the nametable update buffer (S is set to $03 and values are pulled into A)
$0180-$01FF for the normal stack
So any IRQ that happens while I am pulling data from the buffer will only overwrite the safety area or data that I have already accessed and don't need anymore. The buffer will be filled starting from $0104 again next time.

This seems like an unusual method of using the stack and I'm wondering how dangerous, error prone, or common it is. Is there some established name for what I'm doing? Thanks!
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Re: Stack juggling? Is this dangerous?

Post by Dwedit »

Why do you need interrupts during the time when NMI is handled?
Perhaps it's for synchronizing the DMC IRQ, or something that involves a mapper timer and audio, but otherwise you probably won't need to handle them.

Also PLA isn't all that fast compared to LDA abs,X, they're both 4 cycle instructions. You save code size by using PLA in your unrolled VRAM update code, but it's probably not significantly faster.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Stack juggling? Is this dangerous?

Post by tepples »

A term I've seen used is "stack copy". I coined the name "popslide" for a stack copy that uses a computed jump into the middle of an unrolled loop, on analogy with "NOP slide" and "clockslide".

In the case of my own Popslide library, I need the NMI handler to keep time during blank screen transitions, which means I need to leave enough space at the top of the stack for the NMI handler to push registers. That's why I have it use $0108-$01BF for the update buffer and $01C0-$01FF for the stack proper.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Stack juggling? Is this dangerous?

Post by tokumaru »

My VRAM update system is similar to what you're describing. I'm setting aside 212 bytes of stack space for my update buffer, and I use it not only for data but also for the addresses (minus 1) of the subroutines that will copy the data. To kick off the updates, I simply place the stack pointer at the bottom of the VRAM update buffer and do an RTS to execute the first update. Each update ends with an RTS, which calls the next update. At the top of the buffer there's the address (minus 1) of the code to execute after all updates are done, which is where the stack pointer is restored to it's original value.

I normally don't bother with the stack being disturbed by interrupts during VRAM updates, because I don't use stack instructions to fill the update buffer, only to consume it, something that's normally done in the NMI handler, so the CPU's I flag prevents any IRQs from being serviced.

I do occasionally have to use the VRAM update system outside of the NMI handler though, and in those cases it is indeed possible for an IRQ to interrupt the update. My original solution was the same as yours - adding a few bytes of padding at the beginning of the buffer - but later I realized that I'd never need all 212 bytes in these situations, so whenever I need to use this system in the main thread I just skip a number of bytes at the beginning of the buffer (enough for all of my IRQ handlers to function correctly) before loading it with the actual update data, and then I start the updates from that offset location instead of from the beginning of the buffer.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 569
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: Stack juggling? Is this dangerous?

Post by Jarhmander »

Dwedit wrote: Wed Jun 07, 2023 9:33 am Also PLA isn't all that fast compared to LDA abs,X, they're both 4 cycle instructions. You save code size by using PLA in your unrolled VRAM update code, but it's probably not significantly faster.
Yeah, but PLA increment S, so the next element is accessible with the next PLA; with LDA abs, X you have to increment X, which is 2 more cycles.
((λ (x) (x x)) (λ (x) (x x)))
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Stack juggling? Is this dangerous?

Post by lidnariq »

You have the same unrolled code in both cases, either
pla / sta $2007 / pla / sta $2007 / pla / sta $2007
or
lda addr,x / sta $2007 / lda addr+1,x / sta $2007 / lda addr+2 / sta $2007
so that you don't need the inx
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 569
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: Stack juggling? Is this dangerous?

Post by Jarhmander »

Wait, what X is for? Doesn't it need to be adjusted in the loop at some point? Or... The unrolled loop is big enough not to loop?.. hmm...
((λ (x) (x x)) (λ (x) (x x)))
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Stack juggling? Is this dangerous?

Post by lidnariq »

X is so that you can choose an arbitrary starting point inside the buffer while holding the transfer length in the program counter. Same as S and PC in the stack-based version of the same.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Stack juggling? Is this dangerous?

Post by tokumaru »

LDA ABS, X is still significant less convenient than PLA, since you have to calculate the values of X that you're gonna need (using CPU time), store them somewhere (using memory), and finally spend time setting up X before each transfer, slightly slowing down your transfers.
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: Stack juggling? Is this dangerous?

Post by lidnariq »

It's the exact same as pla.

In the other case you have to "calculate the values of S that you're going to need (using CPU time), store them somewhere (using memory), and finally spend time setting up S before each transfer, slightly slowing down your transfers".
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Stack juggling? Is this dangerous?

Post by tokumaru »

I don't have to calculate anything for S. The buffer is always read from the beginning, so I can just set S to a (constant) value for the first transfer - from then on, S will be automatically incremented after each byte read until all transfers are over.

As for filling the buffer, I only need one variable to keep track of what the next free position in the buffer is, which I increment by the number of bytes I'm reserving for each transfer.

The ABS, X is way trickier to set up. Let's say that my buffer is 128 bytes long, and, to simplify things, that I can transfer at most 4 bytes at a time (in an actual game you'd most likely be able to do 32), meaning I have the following unrolled loop:

Code: Select all

lda buffer+0, x
sta $2007
lda buffer+1, x
sta $2007
lda buffer+2, x
sta $2007
lda buffer+3, x
Calculating a value for X is easy if you're copying exactly 4 bytes, since it'll just be the offset of the data you're copying. But if you only want to copy, say, 2 bytes, you're gonna be jumping to the middle of that unrolled loop, where the base address used is buffer+2, making it impossible to access the beginning of the list.

To fix that, we actually have to write the unrolled loop using addresses lower than where the buffer actually is, so that we can access the beginning of the buffer even when jumping to the middle of the unrolled loop:

Code: Select all

lda buffer-128, x
sta $2007
lda buffer-127, x
sta $2007
lda buffer-126, x
sta $2007
lda buffer-125, x
We can now use the following formula to calculate the values of X needed for each transfer: 128 + position - length. In the case of the previous example (transferring 2 bytes from the beginning of the buffer), X would have to be 128 + 0 - 2 = 126. When the lda buffer-126, x instruction is executed, it will load the first byte in the buffer, as expected, and lda buffer-125, x will read the second byte.

Unless I'm missing something really obvious here, if you want to use an ABS, X unrolled loop without increments, compares or branches, you absolutely need to calculate the values of X based on the position and the length of each block of data you're transferring, and also store these values somewhere and load them back for the actual transfers, slightly slowing the process down compared to the stack-based approach.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Stack juggling? Is this dangerous?

Post by tokumaru »

The difference is that PLA uses a global index (the S register) to read from the buffer, so it doesn't matter in which order the transfers are or how long they are, you're always gonna grab the next buffered byte no matter what.

With ABS, X instructions, you're simulating an auto-incrementing index by using instructions with increasing base addresses, but that's relative to the previous instruction only, and doesn't carry over to the next transfer.

The ABS, X method is a valid technique for sure, but the PLA method is more straightforward and has the potential to be slightly faster.
Drag
Posts: 1615
Joined: Mon Sep 27, 2004 2:57 pm
Contact:

Re: Stack juggling? Is this dangerous?

Post by Drag »

This is all describing Duff's Device.

The way I use LDA ABS,X for it is:

Code: Select all

Y = Length of transfer
X = Buffer position
N = Size of unrolled loop

WHILE (Y >= N)
    LDA ppubuffer,X
    STA $2007
    LDA ppubuffer+1,X
    STA $2007
    ...
    LDA ppubuffer+(N-1),X
    STA $2007
    Y = Y-N
    X = X+N
END WHILE
WHILE (Y > 0)
    LDA ppubuffer,X
    STA $2007
    X++
    Y--
END WHILE
I actually grabbed this technique from StarTropics 2, and it works well enough even though there's more optimal ways to do this.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Stack juggling? Is this dangerous?

Post by tokumaru »

As far as efficiency goes, I've accepted the 8 cycles per byte cap as the maximum that the NES can reasonably do. The way I do it, every byte in the buffer does indeed cost 8 cycles to consume, including the VRAM addresses and transfer code addresses, so each transfer reliably uses 4 + length bytes and executes in (4 + length) * 8 cycles, which is pretty straightforward to manage. I made a point to not add any extra overhead between transfers in order to keep the timing consistent regardless of how many transfers there are.

To do any better than 8 cycles per byte, one would have to either use ZP for the buffer and somehow read from arbitrary parts of it without using indexed addressing (7 cycles per byte), or use self modifiable code, at the cost of expanding the data by a factor of 5 and having it laid out in a very inconvenient way (6 cycles per byte).
Post Reply