ppunroll - optimize PPU writes

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

Post Reply
User avatar
miau
Posts: 189
Joined: Mon Nov 06, 2006 9:34 am
Location: Potsdam, Germany
Contact:

ppunroll - optimize PPU writes

Post by miau »

Hey there,

today I'll share a little command line tool, probably most useful if your project uses CHR RAM.

It does this:

Code: Select all

;data.chr
00 00 00 00 BB BB BC 00

Code: Select all

C:\>ppunroll data.chr
ldx #$00
stx $2007 ;00
stx $2007 ;00
stx $2007 ;00
stx $2007 ;00
ldy #$BB
sty $2007 ;BB
sty $2007 ;BB
iny
sty $2007 ;BC
stx $2007 ;00
;-O7
;size:29
;cycles:38
Personally, I use it for the animation of the main character in Banana Nana, freeing up a substantial amount of space for enemies in the sprite pattern table - while still leaving enough vblank time for animated background tiles.
ROM space does not seem to be much of an issue for homebrews nowadays, so why not!

Another use case: Let's say you have 70 cycles left in vblank...

Code: Select all

C:\>ppunroll -Oc 70 data.chr
ldx #$00
jsr stx4  ;00 00 00 00
ldy #$BB
jsr sty2  ;BB BB
iny
sty $2007 ;BC
stx $2007 ;00
;-O1
;size:17
;cycles:62
The tool automatically chooses a lower optimization level in this case, which stays below 70 cycles, but generates smaller code (see subroutines.asm in the zip archive).

Be warned, the output may often not be optimal, but it should be usable at least. I'll see if I can improve the algorithm one of these days.

Let me know if you find any bugs.

ATTENTION: Check out russelsprouts's post below for his tool, which should generate better output!
Attachments
ppunroll-1.0.zip
source and windows binary
(36.14 KiB) Downloaded 108 times
Last edited by miau on Sun Feb 14, 2016 1:12 pm, edited 3 times in total.
User avatar
thefox
Posts: 3139
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Re: ppunroll - optimize PPU writes

Post by thefox »

miau wrote:

Code: Select all

;data.chr
00 00 00 BB BB AA BC 00
Looks like you made a mistake here (doesn't match output).
miau wrote:Download:
ppunroll-1.0 (C++ source and windows binary)
Might be better to use attachments for posterity.
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi
User avatar
miau
Posts: 189
Joined: Mon Nov 06, 2006 9:34 am
Location: Potsdam, Germany
Contact:

Re: ppunroll - optimize PPU writes

Post by miau »

Of course that had to happen! :)
Fixed and attached file instead.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: ppunroll - optimize PPU writes

Post by tokumaru »

Ah, I started working on this idea a few years back (thread), but never released a proper converter. Recently I even worked on a new encoder, which attempted to use other instructions (ASL, LSR, ROR, ROL, INX, DEX, INY, DEY) to save a little space, but the results were less than exciting so I dropped it. I'm still considering using this for animating the player's sprite in my current project, but I'm trying more conventional approaches first.
User avatar
miau
Posts: 189
Joined: Mon Nov 06, 2006 9:34 am
Location: Potsdam, Germany
Contact:

Re: ppunroll - optimize PPU writes

Post by miau »

Heh, I didn't even consider bit shifts. Some interesting contributions in that thread. It would be nice to have a tool that generates a perfect sequence of instructions or comes very close at least, but it seems to be quite a challenge indeed. I feel I'm not good enough of a programmer to pull that off in any reasonable amount of time and it probably wouldn't be worth it. Yet, if anyone else feels up to the task, I'd be extremely interested to see what he/she comes up wih.

I needed something to integrate into my build chain and this tool does nicely so far. If there's a need to squeeze off a few bytes for the final build, once all graphics are set in stone, one can still optimize manually.
User avatar
Myask
Posts: 965
Joined: Sat Jul 12, 2014 3:04 pm

Re: ppunroll - optimize PPU writes

Post by Myask »

Isn't NES code that does this into (W)RAM basically what one wants to do for optimizing PPU bandwidth?
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: ppunroll - optimize PPU writes

Post by tokumaru »

Yes, but WRAM isn't always available, and filling a buffer like this is somewhat slow, since you only write every 5th byte, but if you have some ROM available, you can still benefit from the faster VRAM transfers for commonly used data, such as patterns.
russellsprouts
Posts: 53
Joined: Sun Jan 31, 2016 9:55 pm

Re: ppunroll - optimize PPU writes

Post by russellsprouts »

I thought about it, and I wrote a program that can find the shortest (in bytes) program that takes the least number of cycles. There is a linear algorithm, but unfortunately it has very high constants, so it is really slow -- probably impractical. I'll see if I can speed it up and release it. EDIT: Turns out the program does work on smallish inputs.

Basically, it performs a graph search to find the shortest path to writing all the bytes, where the length is decided by cycle count, and ties are broken by byte length. A node is a 4-tuple of index, a, x, and y, which represents the current state of the registers and the current index in the buffer. The start node is all registers uninitialized and an index of -1 (nothing written). The goal node is any node with index at the end of the buffer. An edge goes from one state to another if there is an operation to go from the state to the other. Each operation is something like "lda #00; sta PPUDATA" or "stx PPUDATA".

The graph is a DAG -- the index always increases by 1 when going from one node to another. DAGs have a O(V) shortest path algorithm -- visit each node in topographic order. The shortest path to the node is the shortest path from any incoming edge.

So, the algorithm looks like this:


Code: Select all

Initialize a queue with the start node.
While there are nodes in the queue:
  pop a node from the queue.
  if the node's index is the last in the input buffer, check if it is the shortest.
  otherwise,
  Find all of the possible next nodes (nodes that are reachable with one operation. Up to 3 -- it only makes sense to consider
  the shortest node for each register. If the value is already in a register, just store it)
  For each possible next node:
    add the node to the queue if it isn't in it already.
    update the path to the node if it is cheaper to get to the node through the current node.
This algorithm is linear based on the number of nodes. To show that it is linear on the number of input bytes, consider the possible states at index i. One of the registers must be set to the input byte at i. Another can be uninitialized, or one of the other 255 values. The third can be uninitialized, or one of the remaining 254 values. (Note that the registers will never be equal). That gives 3 * 256 * 255 = 195840 possible states. This *is* a constant, so the algorithm is technically linear. In practice, the constant will be lower, so it is practical for < 1000 byte inputs.

I've attached the code. I'm on a Linux machine at the moment, but I'll attach a Windows binary later I've attached a Windows binary. It should only take a second for most inputs.
Attachments
unroll.cpp
Fixed rol, ror instructions again.
(8.48 KiB) Downloaded 79 times
unroll.zip
(17.07 KiB) Downloaded 95 times
Last edited by russellsprouts on Thu Feb 18, 2016 10:22 am, edited 3 times in total.
User avatar
miau
Posts: 189
Joined: Mon Nov 06, 2006 9:34 am
Location: Potsdam, Germany
Contact:

Re: ppunroll - optimize PPU writes

Post by miau »

Awesome!!
I didn't get around to trying it, yet, but I've added a note to my original post so people will not miss it.
lidnariq
Posts: 10677
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: ppunroll - optimize PPU writes

Post by lidnariq »

russellsprouts wrote:I've attached the code.
These are wrong:

Code: Select all

    register_state rol() {
      return register_state(i+1, a << 1 | ((a >> 7) & 0x1), x, y, ai, xi, yi, ROL);
    }
    register_state ror() {
      return register_state(i+1, (a >> 1 & 0x7F) | ((a << 7) & 0x80), x, y, ai, xi, yi, ROR);
    }
—ROL and ROR rotate through the carry bit, not just within A.
russellsprouts
Posts: 53
Joined: Sun Jan 31, 2016 9:55 pm

Re: ppunroll - optimize PPU writes

Post by russellsprouts »

You're right. In that case, I will have to keep track of the current state of the carry flag, along with the possibility that it is uninitialized. I'll post fixed code soon.

EDIT: the fixed code is above.
lidnariq
Posts: 10677
Joined: Sun Apr 13, 2008 11:12 am
Location: Seattle

Re: ppunroll - optimize PPU writes

Post by lidnariq »

That's closer but still not quite right. :( ROL and ROR should update Carry the same as ASL and LSR.

Actually, now I'm just trying to elicit the behavior of "emitting a long series of ROL or ROR" with a byte sequence like FF 7F BF DF EF F7 FB FD FE FF and it's not cooperating, and I'm insufficiently C++-savvy to know why it seems to be not following the DAG down the path that involves the accumulator
russellsprouts
Posts: 53
Joined: Sun Jan 31, 2016 9:55 pm

Re: ppunroll - optimize PPU writes

Post by russellsprouts »

You're right again, it slipped my mind when I was making the update. I've fixed the source code above. (But the Windows binary is still the original version.)
FF 7F BF DF EF F7 FB FD FE FF
The reason it won't generate rotates in this example is because the rotate strategy takes 60 cycles and 41 bytes.
It always picks the fastest, breaking ties by number of bytes.
Best: cycles: 58, bytes: 47 (inflation factor 4.7, 5.8 cycles per byte)
ldy #255
sty $2007 ; 255
ldx #127
stx $2007 ; 127
ldx #191
stx $2007 ; 191
ldx #223
stx $2007 ; 223
ldx #239
stx $2007 ; 239
ldx #247
stx $2007 ; 247
ldx #251
stx $2007 ; 251
ldx #253
stx $2007 ; 253
inx
stx $2007 ; 254
sty $2007 ; 255
However, if you remove the last two bytes so that there is no possibility of using an increment or decrement, you get the result you were looking for:
Best: cycles: 48, bytes: 33 (inflation factor 4.125, 6 cycles per byte)
lda #255
sta $2007 ; 255
lsr
sta $2007 ; 127
ror
sta $2007 ; 191
ror
sta $2007 ; 223
ror
sta $2007 ; 239
ror
sta $2007 ; 247
ror
sta $2007 ; 251
ror
sta $2007 ; 253
Post Reply