6502 ASM trick
Moderator: Moderators
6502 ASM trick
Has it occurred to anyone that it might be useful to have a page of ROM filled with values 0 through 255, so that you can perform operations between the accumulator and the index registers?
Instructions like ADC, SBC, AND, ORA and EOR all have "Absolute, X" and "Absolute, Y" modes, so if you point to the table with the values and use one of the index registers, the value fetched will be the same as the one in the index register, making it seem like the operation used the values of both registers as operands.
I guess I have used this trick before, but just for a small subset of the numbers, but now that I think of it, having the full table seems very useful, specially for avoiding temporary variables.
I don't know exactly why I started this topic, but I'm sure we can discuss other useful 6502 ASM tricks. I also like very much the one where you push an address (minus 1) to the stack and then use the RTS instruction to jump to that address. This can be useful for implementing jump tables, and I'm using this a lot in my game.
Instructions like ADC, SBC, AND, ORA and EOR all have "Absolute, X" and "Absolute, Y" modes, so if you point to the table with the values and use one of the index registers, the value fetched will be the same as the one in the index register, making it seem like the operation used the values of both registers as operands.
I guess I have used this trick before, but just for a small subset of the numbers, but now that I think of it, having the full table seems very useful, specially for avoiding temporary variables.
I don't know exactly why I started this topic, but I'm sure we can discuss other useful 6502 ASM tricks. I also like very much the one where you push an address (minus 1) to the stack and then use the RTS instruction to jump to that address. This can be useful for implementing jump tables, and I'm using this a lot in my game.
Great idea! It makes up for the 65xx's general lack of operations that combine A and X directly. Here's an asm summary in case not everyone gets it:
On the other hand, this only saves 2 clocks and 1 byte, so it'd have to be used more than 256 times or in a time-critical area to pay off.
EDIT: corrected clock count and major inefficiency in original code (tax?!? thanks tepples)
Code: Select all
; Original code
stx temp
eor temp ; A = A EOR X
; New solution
eor table,x ; A = A EOR X
table:
.byte $00,$01,$02...$0F
.byte $10,$11,$12...$1F
...
.byte $F0,$F1,$F2...$FFEDIT: corrected clock count and major inefficiency in original code (tax?!? thanks tepples)
Last edited by blargg on Mon Nov 12, 2007 4:58 pm, edited 3 times in total.
Re: 6502 ASM trick
If you have the ROM space for such a table, and it's aligned, it saves a byte and two cycles over the temporary variable way:tokumaru wrote:Has it occurred to anyone that it might be useful to have a page of ROM filled with values 0 through 255, so that you can perform operations between the accumulator and the index registers?
Code: Select all
A:
stx $FF ; 2b 3c
ora $FF ; 2b 3c
B:
ora table,x ; 3b 4c
It saves about four bytes off the temporary variable way to implement jump tables but is one cycle slower:I also like very much the one where you push an address (minus 1) to the stack and then use the RTS instruction to jump to that address. This can be useful for implementing jump tables, and I'm using this a lot in my game.
Code: Select all
A:
lda hi
pha ; 1b 3c
lda lo
pha ; 1b 3c
rts ; 1b 6c
B:
lda hi
sta $01 ; 2b 3c
lda lo
sta $00 ; 2b 3c
jmp ($0000) ; 3b 5c
Last edited by tepples on Sun Nov 11, 2007 6:06 pm, edited 1 time in total.
You guys are right, the savings are not that incredible. But I always feel bad about using temp variables (because the code looks messy), and it's very hard not to do so with the 6502, that has very few work registers. I liked the illusion that it's possible to have these few operations between the accumulator and index registers.
256 bytes out of a whole game ROM is not such a high price to pay for cleaner code and slightly more speed. And this can be used for other things too, such as mapper writes on boards with bus conflicts.
And tepples, as far as I know, "ora table,x" takes 4 cycles to execute, not 5, as long as the table is perfectly aligned to a memory page. Or am I wrong?
About the jump tables, yeah, it depends if you're aiming at speed or size.
256 bytes out of a whole game ROM is not such a high price to pay for cleaner code and slightly more speed. And this can be used for other things too, such as mapper writes on boards with bus conflicts.
And tepples, as far as I know, "ora table,x" takes 4 cycles to execute, not 5, as long as the table is perfectly aligned to a memory page. Or am I wrong?
About the jump tables, yeah, it depends if you're aiming at speed or size.
Change your idea of messy. The main problem with zero page variables is when two routines try to use one at once. The best way to avoid this is to have temp variables that aren't used across subroutine calls, and aren't used by more than one thread at once (like main code and interrupt handler). But the 6502 has a ton of extended registers: 256 of them. That's why X and Y can't be used directly by arithmetic, only for indexing and counting. Embrace zero page!But I always feel bad about using temp variables (because the code looks messy), and it's very hard not to do so with the 6502, that has very few work registers.
Try coding for the Z80/8085/GB-Z80 for a while and you'll appreciate the elegance of the 65xx. Sure, you can do lots of register to register operations, but everything has a layer of bloat on it.
OK, but how do you do that and still keep things looking nice? Saying that a byte can only be used by one subroutine is a waste of space, as that byte will probably be unused most of the time. And reusing bytes is not easy when you have many nested subroutines. For routines that need a few bytes of work RAM, you can have a few groups of bytes, each dedicated to a different depth, but then you can't go very deep. And recursion is out of the question. What do you guys do about this?blargg wrote:The best way to avoid this is to have temp variables that aren't used across subroutine calls, and aren't used by more than one thread at once (like main code and interrupt handler).
Fair enough. I've heard the argument that the 6502 has 256 bytes worth of registers, and I guess this is mostly right.But the 6502 has a ton of extended registers: 256 of them.
I've done very little Z80 work, but enjoyed the fact that I could perform some fairly complicated work without having to touch a byte of RAM. And those shadow registers... that feature has to be useful! I know that instructions take more CPU cycles than on a 6502 though, probably even more than equivalent 6502 code using zero page RAM.Try coding for the Z80/8085/GB-Z80 for a while and you'll appreciate the elegance of the 65xx. Sure, you can do lots of register to register operations, but everything has a layer of bloat on it.
By the way, this is a very interesting topic about tricks on the 6502.
I must say I absolutely love the 6502 way to do thing, for me it largely beats PIC, 8080 and Atmel so far, baybe some other CPUs/MUCs I haven't tried yet.
I have never trought of having such a table of constants, I guess it's only for use if you're short of temp variables and/or if speed is very important. I think wasting 256 bytes is significant on the NES. (unless you have maybe more than 256 KB of PRGROM). This thing could go if you know the number is small enough (something like 0-16) and that a such table is needed anyway (on a cart with bus conflicts). I have encountered a few temporary variable problems so far, I did a whole game engine with only 4 "Temp" variables, and 4 "NMITemp" variables (used in and outside NMI code separately, to avoid pushing the Temp variables or a stupid time-wasting thing like this in the NMI handler). I have encountered problem when I wrote a routine that for example uses Temp1 and Temp2, which calls a routine that uses Temp3 and Temp4, and that itself calls a routine which also uses Temp2 (and exept it to be fully available), this is a real pain to debug. Pushing Temp2 before calling the second routine is the way to go (or do it another way). Eventually it's better to give explicit names to variables. The best way could be to have an assembler which can undefine zero page variables to re-use them, so that the same adress can be used by two pieces of code if the programmer safely says they will never call eachother and that the routine does not expect a particular value to be in when called.
I also never trought of the push-push-rts way to do indirect jump, I always use the jmp []. The main problem is that the rts adress is not the real adress, and I never know how many it should be added/removed to work. However, it becomes interesting if you use this a lot, as saves a lot time four bytes may become significant. Plus the code looks more messy (this can also add to the geek factor in the other side).
I have never trought of having such a table of constants, I guess it's only for use if you're short of temp variables and/or if speed is very important. I think wasting 256 bytes is significant on the NES. (unless you have maybe more than 256 KB of PRGROM). This thing could go if you know the number is small enough (something like 0-16) and that a such table is needed anyway (on a cart with bus conflicts). I have encountered a few temporary variable problems so far, I did a whole game engine with only 4 "Temp" variables, and 4 "NMITemp" variables (used in and outside NMI code separately, to avoid pushing the Temp variables or a stupid time-wasting thing like this in the NMI handler). I have encountered problem when I wrote a routine that for example uses Temp1 and Temp2, which calls a routine that uses Temp3 and Temp4, and that itself calls a routine which also uses Temp2 (and exept it to be fully available), this is a real pain to debug. Pushing Temp2 before calling the second routine is the way to go (or do it another way). Eventually it's better to give explicit names to variables. The best way could be to have an assembler which can undefine zero page variables to re-use them, so that the same adress can be used by two pieces of code if the programmer safely says they will never call eachother and that the routine does not expect a particular value to be in when called.
I also never trought of the push-push-rts way to do indirect jump, I always use the jmp []. The main problem is that the rts adress is not the real adress, and I never know how many it should be added/removed to work. However, it becomes interesting if you use this a lot, as saves a lot time four bytes may become significant. Plus the code looks more messy (this can also add to the geek factor in the other side).
Note my limitation of "that aren't used across subroutine calls". This rules out using them for loop counters, for example (if the loop makes a subroutine call). I admit setting up local variables on the stack is cumbersome and inefficient.tokumaru wrote:OK, but how do you do that and still keep things looking nice? Saying that a byte can only be used by one subroutine is a waste of space, as that byte will probably be unused most of the time. And reusing bytes is not easy when you have many nested subroutines.
What's so bad about touching RAM? You're constantly reading it anyway for the opcodes.I've done very little Z80 work, but enjoyed the fact that I could perform some fairly complicated work without having to touch a byte of RAM.
I think it's mainly to allow extremely quick interrupt response, where the handler just exchanges registers then processes the interrupt. It doesn't have to save the previous values, and it can keep values in the shadow registers across interrupt handlings. For normal code, it doesn't seem very useful because it swaps so much.And those shadow registers... that feature has to be useful!
That's one problem, always paying for those extras even when the 6502's lean register set would be sufficient. My main gripe is the inconsistencies that you constantly run into. I actually like the SPC-700 sound processor in the SNES a bit better than the 6502. It's like a 6502 with fewer limitations on X and Y, and many instructions to really treat direct (zero) page variables as first-class registers. Most arithmetic and move instructions can use a direct page variable just as easily as A.I know that instructions take more CPU cycles than on a 6502 though, probably even more than equivalent 6502 code using zero page RAM.
Use RTI then:Bregalad wrote:The main problem is that the rts adress is not the real adress, and I never know how many it should be added/removed to work.
Code: Select all
lda #>addr
pha
lda #<addr
pha
php ; RTI will restore status, so push it now
rti-
Celius
- Posts: 2158
- Joined: Sun Jun 05, 2005 2:04 pm
- Location: Minneapolis, Minnesota, United States
- Contact:
That's actually a really clever idea about the table thing. I never really thought about it. One thing that I do that I'm very glad I thought about is my NMI routine can do anything whenever it wants:
There may be a slight delay at the end of every routine, but I think it's worth it. There's nothing more I hate than doing a bunch of comparisons to have the NMI figure out what to do and when. Bytes 0-$21 are used up in Zero Page, and $0-$1F start out containing the High and Low parts of the "Return" address. $20/$21 contain the High/Low parts of wherever you are in the NMI routine. I personally am very very happy with it. I almost considered it a trick, or cheating when I first thought about it.
Code: Select all
nmi:
jmp ($00)
jmp ($02)
jmp ($04)
jmp ($06)
jmp ($08)
jmp ($0A)
jmp ($0C)
jmp ($0E)
jmp ($10)
jmp ($12)
jmp ($14)
jmp ($16)
jmp ($18)
jmp ($1A)
jmp ($1C)
jmp ($1E)
lda #$00
sta $20
rti
Return:
inc $20
inc $20
inc $20
jmp ($20)
My solution to have a bunch of different NMI routines is this:
That is all there is to the actual NMI routine indicated by the vector at the end of the ROM. The label "NMIAddress" points to a zero page location, and depending on where in the game we are, that location points to one of many different NMI routines:
I'm defining lots of "program modes" for my game, where each section (title screen, player select, title card screen, main game, bonus stage, etc) is represented by a program mode, that when initialized sets the address of the NMI routine it uses. All modes have triggers that enable the transition to other modes.
This way does not waste RAM (only 2 bytes are used to hold the address of the current NMI routine), and there is no speed penalty besides the time taken by the JMP instruction.
Code: Select all
NMI:
jmp (NMIAddress)Code: Select all
NMITitleScreen:
(...)
rti
NMIPlayerSelect:
(...)
rti
NMIMainGame:
(...)
rti
NMIEndingSequence:
(...)
rtiThis way does not waste RAM (only 2 bytes are used to hold the address of the current NMI routine), and there is no speed penalty besides the time taken by the JMP instruction.
Why not just have the main code wait in a loop until NMI fires and sets a global flag? Then you don't have to worry about taking too long before the next NMI and having it interrupt a previous invocation. Or maybe you are saying you do this, you just also have a settable NMI routine that does things that must be done every frame, even if that frame appears the same as the previous.
I have other projects that require constant calculation in order to avoid severe slowdown. Waiting for the NMI would be a waste of time when you could already be preparing data for the next frame. Not in my current project though.blargg wrote:Why not just have the main code wait in a loop until NMI fires and sets a global flag?
That is certainly true for the music routine, for example. And since I also enable rendering late in the frame, I need NMI's to always use the same ammount of time. I can't ignore a single VBlank, or else the screen will jump.Or maybe you are saying you do this, you just also have a settable NMI routine that does things that must be done every frame, even if that frame appears the same as the previous.
Blargg, the method you describe is close to the one used in Final Fantasy, where the NMI just returns doing absolutely nothing. The game is free to call an NMI when it want without problems. The only true problem is that's it's possible to completely miss an NMI.
I also remember Zelda and SMB happens to lag, with the music too. This looks extremely bad.
Final Fantasy III does exactly what Tokumary says, it has a "variable NMI vector", wich is slightly better, instead of wasting a jmp [xxx] instruction, the NMI vector directly points to RAM where a jmp instruction is stored (takes less time). This instruction is also ocasionally changed to a rti to completely ignore NMIs.
Celuis : I don't undersand anything to the method you described to us. It looks interesting however. Could you try to clarify it a little please ?
By the way I personally went the way of defining NMI the standard way (in ROM), and have it do the main graphics update and sound. That way, the music never lags, and even if the game lags, the NMI will still update the screen as fast as it can. It's even possible on the main program to synchronise on the graphic update flag (instead of the NMI flag) so if you want to update lots of graphics at once it takes more than a frame and causes no problem.
This just sounds sort of logical, and as long as different parts of the game use the same format of graphic buffers, the same NMI handler can be used for the whole game. That would be unoptimal for really big games, I think. (games with lots of unrelated minigames or such, which all have independant use of the screen, or a RPG where battle/field/menus, etc could be separated because they manage their screen completely differently in each case).
In theory, Final Fantasy's music would also lag if the game does, but it does never lag anyways. You can however hear this in Final Fantasy II when you change rooms, the music don't change (like in Final Fantasy) and the music seriously lags (the game also seems to silent all channels for some reason, so the music will stop and restart a bit late on the next note), this also applies when entering/exiting menu.That is certainly true for the music routine, for example. And since I also enable rendering late in the frame, I need NMI's to always use the same ammount of time. I can't ignore a single VBlank, or else the screen will jump.
I also remember Zelda and SMB happens to lag, with the music too. This looks extremely bad.
Final Fantasy III does exactly what Tokumary says, it has a "variable NMI vector", wich is slightly better, instead of wasting a jmp [xxx] instruction, the NMI vector directly points to RAM where a jmp instruction is stored (takes less time). This instruction is also ocasionally changed to a rti to completely ignore NMIs.
Celuis : I don't undersand anything to the method you described to us. It looks interesting however. Could you try to clarify it a little please ?
By the way I personally went the way of defining NMI the standard way (in ROM), and have it do the main graphics update and sound. That way, the music never lags, and even if the game lags, the NMI will still update the screen as fast as it can. It's even possible on the main program to synchronise on the graphic update flag (instead of the NMI flag) so if you want to update lots of graphics at once it takes more than a frame and causes no problem.
This just sounds sort of logical, and as long as different parts of the game use the same format of graphic buffers, the same NMI handler can be used for the whole game. That would be unoptimal for really big games, I think. (games with lots of unrelated minigames or such, which all have independant use of the screen, or a RPG where battle/field/menus, etc could be separated because they manage their screen completely differently in each case).
-
Celius
- Posts: 2158
- Joined: Sun Jun 05, 2005 2:04 pm
- Location: Minneapolis, Minnesota, United States
- Contact:
Oh, sorry about that. Let's pretend that there was an Indirect Absolute JSR instruction:Bregalad wrote:
Celuis : I don't undersand anything to the method you described to us. It looks interesting however. Could you try to clarify it a little please ?
Code: Select all
nmi:
jsr ($00)
jsr ($02)
jsr ($04)
...
jsr ($1E)
rti
Code: Select all
nmi:
jsr($00) ;The Low/High parts the lable "ScreenUpdate" are stored in $0/$01
jsr($02) ;The Low/High parts of the lable "Control" are stored in $02/$03.
...
jsr ($1E) ;The Low/High parts of the lable "Nothing" are stored in $04/$05.
rti
ScreenUpdate:
.... ;We come here at the beggining of the NMI routine.
rts
Control:
.... ;We come here after ScreenUpdate
rts
Nothing:
rts ;This is a blank routine we come to if we have nothing to do
Oh, your idea looks quite good ! Maybe a little TOO flexible, but this can come in usefull.
I guess there is plenty way to improve it, have the NMI point in a ROM adress wich does jmp($00), then the code at $00 would automatically do jmp($02) when it's done, etc... The problem is that the order cannot be nested, and I guess you don't want to have this limitation. You can also have a big jsr xxxx jsr xxxx jsr xxxx table in RAM, have the NMI point to it, and just change the adress as you wish. You can replace the jsr by a cmp or something to skip the routine without wasting time, you can change the first jsr by rti to completely ignore the NMI, or you can just replace the adress after the last jsr by a rti, making a variable-lenght NMI routine (but still have a maximum of course).
You'd still want to push the registers on the stack before the first jsr.
I guess there is plenty way to improve it, have the NMI point in a ROM adress wich does jmp($00), then the code at $00 would automatically do jmp($02) when it's done, etc... The problem is that the order cannot be nested, and I guess you don't want to have this limitation. You can also have a big jsr xxxx jsr xxxx jsr xxxx table in RAM, have the NMI point to it, and just change the adress as you wish. You can replace the jsr by a cmp or something to skip the routine without wasting time, you can change the first jsr by rti to completely ignore the NMI, or you can just replace the adress after the last jsr by a rti, making a variable-lenght NMI routine (but still have a maximum of course).
You'd still want to push the registers on the stack before the first jsr.