I don't have to calculate anything for S. The buffer is always read from the beginning, so I can just set S to a (constant) value for the first transfer - from then on, S will be automatically incremented after each byte read until all transfers are over.
As for filling the buffer, I only need one variable to keep track of what the next free position in the buffer is, which I increment by the number of bytes I'm reserving for each transfer.
The ABS, X is way trickier to set up. Let's say that my buffer is 128 bytes long, and, to simplify things, that I can transfer at most 4 bytes at a time (in an actual game you'd most likely be able to do 32), meaning I have the following unrolled loop:
Code: Select all
lda buffer+0, x
sta $2007
lda buffer+1, x
sta $2007
lda buffer+2, x
sta $2007
lda buffer+3, x
Calculating a value for X is easy if you're copying exactly 4 bytes, since it'll just be the offset of the data you're copying. But if you only want to copy, say, 2 bytes, you're gonna be jumping to the middle of that unrolled loop, where the base address used is
buffer+2, making it impossible to access the beginning of the list.
To fix that, we actually have to write the unrolled loop using addresses lower than where the buffer actually is, so that we can access the beginning of the buffer even when jumping to the middle of the unrolled loop:
Code: Select all
lda buffer-128, x
sta $2007
lda buffer-127, x
sta $2007
lda buffer-126, x
sta $2007
lda buffer-125, x
We can now use the following formula to calculate the values of X needed for each transfer:
128 + position - length. In the case of the previous example (transferring 2 bytes from the beginning of the buffer), X would have to be 128 + 0 - 2 = 126. When the
lda buffer-126, x instruction is executed, it will load the first byte in the buffer, as expected, and
lda buffer-125, x will read the second byte.
Unless I'm missing something really obvious here, if you want to use an ABS, X unrolled loop without increments, compares or branches, you absolutely need to calculate the values of X based on the position and the length of each block of data you're transferring, and also store these values somewhere and load them back for the actual transfers, slightly slowing the process down compared to the stack-based approach.