6502 ASM trick

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

Celius
Posts: 2159
Joined: Sun Jun 05, 2005 2:04 pm
Location: Minneapolis, Minnesota, United States
Contact:

Post by Celius »

Bregalad wrote: Speaking of 6502 tricks, I just figured out how many times I had to do 4 consecutive ASL A or LSR A, so I just made 2 routine that does them and return, and call them, saving one byte everytime this trick is used. I was able to save about 30 bytes doing that, which is great.
I was thinking about a trick to do divisions by 16/multiplications by 16. If you wanted to save about 3 cycles, you could do something like this:

ldx SomeVariable
lda Table,x

Table:
.db $00,$10,$20,$30,$40,$50,$60,$70,$80,$90,$A0,$B0,$C0,$D0,$E0,$F0

That would be good for needing to multiply 4-bit values by 16. But you could make a 256-byte table that holds those values every $10 bytes, so you could multiply 4-bit values by 16 and save 3 cycles. The same could be applied for dividing, but it would pretty much require a 256-byte table. while it's a huge waste of ROM, it may end up saving you a scanline or two from the very frequent divisions/multiplications of 16.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

Bregalad wrote:That identity thing is a bit fun, it's true it allows easy and fast operations with index registers (adding, substract, and even and, or etc..)
Hey, I just thought of another very good use for the identity table:

Code: Select all

	ldx identity, y

Code: Select all

	ldy identity, x
These work like TYX and TXY, which obviously don't exist. I was just coding my game and felt the need to do a TYX, when I noticed this could in fact be done with the table. Seriously, for anyone that still thinks that this table is not worth the 256 bytes it uses: It really increases the functionality of X and Y, usually saving RAM that would be used as temporary storage, and saving ROM that would be used by the extra code needed to perform the same tasks. This table makes me feel like I gained a lot of new opcodes. =) If you can spare a bit of ROM, you really should use this table.

EDIT: I'm almost creating some macros named like the pseudo-opcodes resulting from the functionality provided by this table... It'd be like legal undocumented opcodes! =)
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »


These work like TYX and TXY, which obviously don't exist. I was just coding my game and felt the need to do a TYX, when I noticed this could in fact be done with the table. Seriously, for anyone that still thinks that this table is not worth the 256 bytes it uses: It really increases the functionality of X and Y, usually saving RAM that would be used as temporary storage, and saving ROM that would be used by the extra code needed to perform the same tasks. This table makes me feel like I gained a lot of new opcodes. =) If you can spare a bit of ROM, you really should use this table.
Well, you need a couple of temporary storage variables ANYWAY whathever you're going to do. I remember having lot of headaches to stick with only 4 temporary variables, and whenever I need more I use different named half-general purpose variables.
Also, even if it could save a couple of byte in the code at a couple of places, I guess it would be very rare to actually save 256 bytes that way. You'll do it only if you have unrolled loop with use of this table inside of something like that. So memory-wise, this isn't a good solution, but time-wise or easy-to-use wise, maybe it is.

Also, TXA/TAY and TYA/TAX takes 1 less byte than ldx Identity,Y and ldy Identity,X, and take the exact same time so I don't know why you'd want to do this. And yeah it overwrites A, but usually in a single loop/iteration you affect X and Y to a single usage so I don't see much the trick. The only reson it would be really usefull is if you use an instruction like rol $xx,X which can't be done with Y, and then sta [$xx],Y which can't be done with X, but you want the same "index", and you don't want to overwrite A in the process, so yeah in that case it's usefull, but that's not really frequent.

Honnestly, with 256 bytes you can have a very large additional level in your game or a new music with 3 tracks, wich are much better usage than a stupid identity table.

@Celius : Yeah your idea should be great for the other guys that want really fast code, however it's not great for me who want to save bytes, even if that slow the process a little. Using your trick uses 5 bytes instead of 4, or even instead of 3 if you have a subroutine that does 4 ASL and RTS (I do, and as mentionned above I use it above 25 times in the whole code).
And if you want the equivalent table for LSR, you could have an assembler place a byte with $00, $01, $02, etc... all 16 bytes and manage to have 15 very small subroutine that takes 15 bytes or less intervealed in here. Such things that a routine that polls $2002 and return, or write to the mapper while avoidinc bus conflicts, etc... I'm pretty sure a complete game engine would have 15 routines that takes 15 bytes or less.
User avatar
never-obsolete
Posts: 403
Joined: Wed Sep 07, 2005 9:55 am
Location: Phoenix, AZ
Contact:

Post by never-obsolete »

i just recently discovered the BIT trick:

Code: Select all

Sub1:	ldx #00
		  .db $2C
Sub2:	ldx #07
		  .db $2C
Sub3:	ldx #11
		  stx somewhere
		  ; go about business
its not terribly useful, but has its moments.
. That's just like, your opinion, man .
Bananmos
Posts: 551
Joined: Wed Mar 09, 2005 9:08 am
Contact:

Post by Bananmos »

Can't resist bumping this old thread to mention I've found a use for combining BIT and the identity table.

I kind of really miss the bit immediate instruction available on the 65C02. There's quite a few cases where I'd like to test certain bits in a byte with a bitwise AND without destroying the contents of the accumulator:

Code: Select all

lda mapFlags,X
bit #FLAG1+FLAG2
beq :+
jsr DoSomething
:
bit #FLAG5+FLAG6
beq :+
jsr DoSomeOtherThing
:
But even though the bit immediate instruction is not there, it could be emulated with BIT absolute and an identity table, using 1 more byte and 2 more cycles:

Code: Select all

lda mapFlags,X
bit Identity+FLAG1+FLAG2
beq :+
jsr DoSomething
:
bit Identity+FLAG5+FLAG6
beq :+
jsr DoSomeOtherThing
:
A more optimized way would of course be to reserve a few zeropage bytes for the combinations you really need to test, but that's not as generic. Though with a powerful enough macroassembler, I guess you could have a BIT immediate macro that employs the identity table as a safe fallback, but uses zeropage locations for the most popular combinations. :)
psycopathicteen
Posts: 3001
Joined: Wed May 19, 2010 6:12 pm

Post by psycopathicteen »

If your doing 65816 in 16-bit mode, it would need to be a hiROM cart either at a $40-$7d bank or a $80-$ed bank.
User avatar
never-obsolete
Posts: 403
Joined: Wed Sep 07, 2005 9:55 am
Location: Phoenix, AZ
Contact:

Post by never-obsolete »

Celius wrote:The same could be applied for dividing, but it would pretty much require a 256-byte table. while it's a huge waste of ROM, it may end up saving you a scanline or two from the very frequent divisions/multiplications of 16.

Here's the code I use to divide 12-bit numbers by 16 with an 8-bit quotient (n = $000 to $FFF):

Code: Select all

;	assume Y holds "n" lsb, X holds "n" msb, A will have result
	lda m16tbl_hi, Y
	ora m16tbl_lo, X
and the tables:

m16tbl_hi

Code: Select all

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03
04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 04
05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
06 06 06 06 06 06 06 06 06 06 06 06 06 06 06 06
07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07
08 08 08 08 08 08 08 08 08 08 08 08 08 08 08 08
09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09
0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A 0A
0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B 0B
0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C 0C
0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D 0D
0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F
m16tbl_lo

Code: Select all

00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0
I think it can be extended to work with 16-bit numbers by using X to index into m16tbl_hi for the high byte. I haven't verified it though.
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »

You know Baramos it was a great idea to revive this thread.
How many times I've had to do something like

pha
and #$xx
...blah blah
pla
...blah blah

when the BIT instruction would have been better but couldn't be used because there is no BIT immediate !
Without going as far as having a 256 byte identity table, simply use .db with the constant you need (this is a single byte table...)
Since you'd usually do the bit instruction with only a single bit set, and that bit 7 can be directly tested with the N flag, a 7-byte table should be enough for anything (in fact only 6 will be necessary - I'll explain below) :
.db $01, $02, $04, $08, $10, $20, $40


Also I've found another trick which is so simple but worth mentionning. If you load a value in A, the only bit you can "quickly" test is the 7th one, with the N flag.
But if you use the ASL A instruction then you can "quickly" then C=7th bit and N=6th bit, so you can quickly test 2 bits without using AND or BIT instructions ! Pretty useful !

Another very simple, but clever thing is to keep that in mind : If you made a subroutine that you're only going to call once in your code, then replace it by a macro. You'll save a 6 cycles and 4 bytes (assuming there were no branch instructions around the call). This make the code as structured as if it was a subroutine and you can always change back to a subroutine if you are going to call it somewhere else.

Finally, how many times you call a subroutine and you need more than 3 bytes of arguments ? (more than what A, X and Y can handle) ?
Well the solution to that comes from SMB disassembly...

Code: Select all

   jsr MyRoutine
   .dw Pointer1, Pointer2   ;4 bytes arguments

MyRoutine
   jsr GetArguments
  .....
   rts

GetArguments
   tsx
   lda $103,X     ;Get return adress from the stack
   sta PtrL
   clc
   adc #$04     ;Add 4 to return adress
   sta $103,X
   lda $104,X
   sta PtrH
   adc #$00
   sta $104,X
   ldy #$04       ;Copy arguments to Temp variables
-  lda [Ptr],Y     ;We should not forget the adress in the stack is return adress -1 !
   sta Temp-1,Y
   dey
   bne -
   rts
Sure the GetArgument routine can be pretty long and bit, but in the end, it will save you possibly hundred of times to do something like :

Code: Select all

    lda #BlahBlah
    sta Temp
    lda #BlahBlah
    ldy #BlahBlah
    ldx #Blah
    jsr Routine
So this saves aproximately 6 bytes for each call, with can end up a lot if this is done frequently.

My GetArguments routine is 32 bytes, so if you use this trick more than 10 or so times it's definitely a gain.
However you can't use variable arguments unless you place your code in RAM (which ends up being unpractical on the NES as you'll have to copy it here).

I wonder if there is any way to improve this argument thing to save even more bytes, I have some feeling that it is possible.
Useless, lumbering half-wits don't scare us.
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Post by mic_ »

when the BIT instruction would have been better but couldn't be used because there is no BIT immediate !
The special case where you want to BIT #$80, #$40, #$20, etc should be doable without destroying A:

Code: Select all

lda foobar
bmi bit7_set
cmp #$40  ; we know that bit 7 wasn't set
bcs bit6_set
cmp #$20
bcs bit5_set
; and so on
bogax
Posts: 34
Joined: Wed Jul 30, 2008 12:03 am

Post by bogax »

Bregalad wrote:
However you can't use variable arguments unless you place your code in RAM (which ends up being unpractical on the NES as you'll have to copy it here).

I wonder if there is any way to improve this argument thing to save even more bytes, I have some feeling that it is possible.
Coincidentally there's a similar discussion (inlining paramters) on
AtariAge.

Don't know if I'd call it an improvement but I have used code something
like this.
The list is zero terminated the code consumes a pair of zeros and outputs
a single zero, an unpaired zero terminates the list.
the first two bytes are the address of the target routine.
it only deals with the least significant byte of the return address so the
parameter list can't srtraddle a page boundary.
Pointer needs to be inialized to zero.

Code: Select all

GET_PARAMETERS
 pla
 tay
 pla
 pha                    ; put the high byte back
 sta pointer+1
 ldx #$00
 beq SKIP
LOOP 
 sta parameters,x
 inx
SKIP
 iny                    ; pointing one short first pass here fixes that 
 lda (pointer),y 
 bne LOOP     
 iny
 lda (pointer),y 
 beq LOOP     

 dey                    ; fix the return address guess we can't return to a
                        ;  break       
 tya 
 pha 
 jmp (parameters)
krzysiobal
Posts: 891
Joined: Sun Jun 12, 2011 12:06 pm
Location: Poland

Post by krzysiobal »

How to check if a number at mem_ptr is a power of 2?

Code: Select all

ldx mem_ptr
dex
txa
and mem_ptr
be power_of_two
;it is not a power of two
User avatar
thefox
Posts: 3139
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Post by thefox »

krzysiobal wrote:How to check if a number at mem_ptr is a power of 2?

Code: Select all

ldx mem_ptr
dex
txa
and mem_ptr
be power_of_two
;it is not a power of two
As long as mem_ptr is not 0.
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »

Well you could say that 2^(-infitine) = 0, therfore 0 is a power of two
but it's a bit cheating.

@bogax : I think your approach is interesting.
I like the idea of using pla instructions cleverly to retrieve the return adress !
However, in many cases you'll need more than a single subroutine which uses the "advanced argument" system.
Therefore of course you do not want to copy ALL the argument fetching code in ALL the routines that uses the system, or else it will waste a lot of space, killing the idea of this advanced argument system.

This is why in my example, the called routine starts by calling itself the argument fetching routine, which reads 3 bytes ahead in the stack (it ignores it's own return address and goes to the return address before that).

And yet I am still sure there is a way to improve it...
Useless, lumbering half-wits don't scare us.
bogax
Posts: 34
Joined: Wed Jul 30, 2008 12:03 am

Post by bogax »

Bregalad wrote:@bogax : I think your approach is interesting.
I like the idea of using pla instructions cleverly to retrieve the return adress !
However, in many cases you'll need more than a single subroutine which uses the "advanced argument" system.
Therefore of course you do not want to copy ALL the argument fetching code in ALL the routines that uses the system, or else it will waste a lot of space, killing the idea of this advanced argument system.

This is why in my example, the called routine starts by calling itself the argument fetching routine, which reads 3 bytes ahead in the stack (it ignores it's own return address and goes to the return address before that).

And yet I am still sure there is a way to improve it...
I'm not sure I understand your objection.
Rather than JSR to a routine that then JSRs to the argument fetching
code you just JSR to the argument fetching code supplying it with
the address of the routine you want to JSR to as a parameter
which the argument fetch code then jumps to.
You shouldn't need to duplicate the argument fetch code, but you
have to include the argument fetch routine address in each JSR
to a routine that uses the argument fetching code (while saving
a JSR to the argumewnt fetch code in each of those routines)
And while it may save a little messing with the stack, it costs
25 cycles or so.
Like I said, I'm not sure it's any improvement
but it will fetch a variable number of arguments.
Perhaps it would make more sense to do something closer to
your code, but pass it the number of arguments.
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »

Sorry - my bad - I didn't think hard enough.

I haven't decided really if "my" (really Nintendo's) solution or yours solution is better.
I should take the time to analyze of much bytes each solution saves.

Anyway I think your solution is very elegant, while mine would need a "jsr FetchArguments" at the start of every routine which needs arguments - so I think your solution is probably better.


I wonder if there should be a wiki page about 6502 asm optimisations, as it might be easier to find info from a wiki page than from a thread that is eventually going to be at the 432nd page even if the forums are fully preserved.
Useless, lumbering half-wits don't scare us.
Post Reply