6502 ASM trick

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »

One thing that bothers me is that useless byte table :

Code: Select all

stx Temp
clc
adc Temp
tax
Would take about the same time as :

Code: Select all

   tay
   clc
   adc ByteTable, x
   tax 
and would save 256 bytes.

Also, why use indirect adressing in the Displacement routine ? This could potentially waste cycles if a page boundary is crossed, but well... anyway that was just for the idea. It looks good for really fast transfer, as it would be almost as fast as self-modified code but wouldn't waste a lot of RAM.
Using $300-$7ff (or $200-$6ff or anything else) you're able to store a chain of exactly 256 lda#$xx/sta $2007 instruction which is cool.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

Bregalad wrote:One thing that bothers me is that useless byte table
In this case, yeah, there is no advantage in using it, but I used because I have this table in my game and got used to using it for X and Y operations (as described in the first post of this topic). Since I have the table anyway, and in some cases it really does reduce the complexity of the code a lot, I used it here too, just to avoid the temp.
Also, why use indirect adressing in the Displacement routine ? This could potentially waste cycles if a page boundary is crossed, but well...
Indirect? Where? Do you mean indexed? If you think about this code a bit more, you'll see that LDA absolute can't be used for this, as it wouldn't allow for random access to the buffer. X is used to displace the start of the buffer around. As far as I know, the way I did it would be the only way to have access to any part of the buffer, starting anywhere in the list of LDAs STAs (depending on the amount of bytes to copy) so that the copying process ends with the last byte you want to copy. If you know of a better way to do the same thing, I'd like to know.

Also, as long as you respect the rules I talked about (the limit of 128 bytes and the location of the buffer), there will be no page crossing.

EDIT: Just to make it clear, the beauty of this is that you can copy a variable amount of data from anywhere in the buffer, without having to manage an index register for every byte, and without having to check for an ending condition. This is what makes it fast, it's a flexible unrolled loop. With absolute addressing you'd only be able to read hardcoded positions of the buffer.
doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

Post by doynax »

I've actually been using something similar in my tile copying loop, it needs to be able to start and stop at arbitrary indices because a single transfer usually takes several frames. In my case I need to transfer up to 240 different characters from two separate buffers which would correspond to about 53k of unrolled code, perhaps a bit hefty ;)

Thankfully with a bit of arithmetic and you can get much the same benefits from partially unrolled loop. Now the following codes tries to copy the buffer to VRAM starting from x up continuing up to (but not including) limit. Unfortunately you lose as many bytes as the unrolling factor at the end of the array, so in this case we could only copy up to 248 bytes, and be sure to place those "wasted" bytes at the start of the page such that buffer-8 doesn't cross a page. Furthermore the initialization code and especially the computed goto could be sped up a bit but I'm thinking that it might as well be prepared outside of vblank anyway. Oh and "identity" is tokumaru's byte table which I've actually been using in my own code. It's damned nice having bytes to spare on such things for once :)

Code: Select all

copy:
	lda limit
	clc
	sbc identity,x
	and #$07
	tay
	adc identity,x
	tax

	lda .enter_lo,y
	sta trampoline+0
	lda .enter_hi,y
	sta trampoline+1

	lda #$ff
	jmp (trampoline)

.enter_lo:
	.byte <.enter7,<.enter6,<.enter5,<.enter4
	.byte <.enter3,<.enter2,<.enter1,<.enter0
.enter_hi:
	.byte >.enter7,>.enter6,>.enter5,>.enter4
	.byte >.enter3,>.enter2,>.enter1,>.enter0

.loop:
	sbx #-8

.enter0:
	ldy buffer-8,x
	sty $2007
.enter1:
	ldy buffer-7,x
	sty $2007
.enter2:
	ldy buffer-6,x
	sty $2007
.enter3:
	ldy buffer-5,x
	sty $2007
.enter4:
	ldy buffer-4,x
	sty $2007
.enter5:
	ldy buffer-3,x
	sty $2007
.enter6:
	ldy buffer-2,x
	sty $2007
.enter7:
	ldy buffer-1,x
	sty $2007

	cpx limit
	bne .loop
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

Yeah, the idea is very similar... I understand that you have to do things a bit differently, as the purposes aren't the same.

My motivation for this was my Sonic game, since because of the fast scrolling, I must be able to write up to 128 bytes of Name Table data, for the new columns and rows of metatiles that scroll in. I doubt it would be possible to write as much data within standard VBlank time, considering that there are still other things to write, such as Attribute Tables, sprite DMA, and so on.
doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

Post by doynax »

tokumaru wrote:Yeah, the idea is very similar... I understand that you have to do things a bit differently, as the purposes aren't the same.
I'd say it's exactly the same thing, the only difference is that I'm copying up to 8k which makes complete unrolling tad impractical. But my "innerloop" works exactly the same way as your raw copying, well except for that fact that you're copying downwards IIRC.
Perhaps even 768 bytes of code might become too some day if the interrupt code has to run from the fixed bank.
My motivation for this was my Sonic game, since because of the fast scrolling, I must be able to write up to 128 bytes of Name Table data, for the new columns and rows of metatiles that scroll in. I doubt it would be possible to write as much data within standard VBlank time, considering that there are still other things to write, such as Attribute Tables, sprite DMA, and so on.
Just how are you extending the blanking period by the way? I had all kinds of issues with this if you recall.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

doynax wrote:Perhaps even 768 bytes of code might become too some day if the interrupt code has to run from the fixed bank.
I see what you mean... I'm not worried about this because I have an 8KB ROM bank dedicated to the VBlank routines, enough to unroll everything.
Just how are you extending the blanking period by the way? I had all kinds of issues with this if you recall.
I'm not. I was when I was using the MMC1 with CHR-RAM, so that I could copy 256 bytes to it every frame, in addition to all the other PPU operations. I simply had all my VBlank code use a fixed amount of cycles (which was a bit hard, because I alternated some tasks, so I had to make sure that alternating tasks took the same amount of time), so I always enabled rendering at the same point, 16 scanlines past the start of the frame. It worked fine, except for a few sprite problems (I couldn't seem to find the exact moment to turn sprite rendering on, so I always had a few glitchy pixels at the top left corner of the screen), and the different dot crawl pattern that results from skipping the pre-render scanline, which may or may not bother people.

Because of how hard it was to keep the Vblank code constant-timed and because of the sprite glitches, I ended up switching to the MMC3 with CHR-ROM, and all PPU updates fit in the standard Vblank time now. I didn't have to sacrifice much, and ended up getting more usable frame time from not having to manually copy tiles to the PPU. I still blank the top 16 scanlines though, by having blank tiles switched in, and only switching the actual patterns when a mapper IRQ fires 16 scanlines into the frame. This is to avoid scrolling glitches usually visible when you use vertical mirroring and scroll vertically.

I know a lot of commercial games had visual glitches, but I just couldn't live with them in my game(s). If it is possible to get rid of them, I'll do everything I can! =)
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »

Well, time-optimising is definitely the exact opposite to byte optimising.
These times I had to do a lot of or byte-optimising, where I could save 4 bytes or so about everywhere in my game engine, and now you guys talk about 8k of unrolled code that makes me crazy.

That identity thing is a bit fun, it's true it allows easy and fast operations with index registers (adding, substract, and even and, or etc..), and each time you use it you save 1 byte, but still you have to use this trick more than 256 times for it to be really worth it.

Speaking of 6502 tricks, I just figured out how many times I had to do 4 consecutive ASL A or LSR A, so I just made 2 routine that does them and return, and call them, saving one byte everytime this trick is used. I was able to save about 30 bytes doing that, which is great.

However I had to do a lot of time-optimising for my mode-7 demo, as I perform raster-timed code and calculations at the same time.
Put some code in RAM and modify the argument of the instructions rocks, as you can add one level of indirection and then save time (lda #$xx can be equivalent to lda $xxxx, and lda $xxxx,Y can be the equivalent to lda [$xx],Y, plus you get the equivalent of the inexistant lda [$xx],X).
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

I know you are not using any kind of PRG bankswitching in your current project, so these tricks probably do not interest you much.

These tricks are for games with much larger PRG-ROM, which are usually more complex anyway. You'll hardly see an NROM or CNROM with complex scrolling and animations, large sprites, and so on...
User avatar
MottZilla
Posts: 2835
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla »

You can definitely make a very fun game with NROM or CNROM. You may have heard of a little game called Super Mario Bros. :p

But seriously there are alot of fun games that use no mapper. Atleast I find them fun.
Celius
Posts: 2159
Joined: Sun Jun 05, 2005 2:04 pm
Location: Minneapolis, Minnesota, United States
Contact:

Post by Celius »

Bregalad wrote: Speaking of 6502 tricks, I just figured out how many times I had to do 4 consecutive ASL A or LSR A, so I just made 2 routine that does them and return, and call them, saving one byte everytime this trick is used. I was able to save about 30 bytes doing that, which is great.
I was thinking about a trick to do divisions by 16/multiplications by 16. If you wanted to save about 3 cycles, you could do something like this:

ldx SomeVariable
lda Table,x

Table:
.db $00,$10,$20,$30,$40,$50,$60,$70,$80,$90,$A0,$B0,$C0,$D0,$E0,$F0

That would be good for needing to multiply 4-bit values by 16. But you could make a 256-byte table that holds those values every $10 bytes, so you could multiply 4-bit values by 16 and save 3 cycles. The same could be applied for dividing, but it would pretty much require a 256-byte table. while it's a huge waste of ROM, it may end up saving you a scanline or two from the very frequent divisions/multiplications of 16.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

Bregalad wrote:That identity thing is a bit fun, it's true it allows easy and fast operations with index registers (adding, substract, and even and, or etc..)
Hey, I just thought of another very good use for the identity table:

Code: Select all

	ldx identity, y

Code: Select all

	ldy identity, x
These work like TYX and TXY, which obviously don't exist. I was just coding my game and felt the need to do a TYX, when I noticed this could in fact be done with the table. Seriously, for anyone that still thinks that this table is not worth the 256 bytes it uses: It really increases the functionality of X and Y, usually saving RAM that would be used as temporary storage, and saving ROM that would be used by the extra code needed to perform the same tasks. This table makes me feel like I gained a lot of new opcodes. =) If you can spare a bit of ROM, you really should use this table.

EDIT: I'm almost creating some macros named like the pseudo-opcodes resulting from the functionality provided by this table... It'd be like legal undocumented opcodes! =)
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »


These work like TYX and TXY, which obviously don't exist. I was just coding my game and felt the need to do a TYX, when I noticed this could in fact be done with the table. Seriously, for anyone that still thinks that this table is not worth the 256 bytes it uses: It really increases the functionality of X and Y, usually saving RAM that would be used as temporary storage, and saving ROM that would be used by the extra code needed to perform the same tasks. This table makes me feel like I gained a lot of new opcodes. =) If you can spare a bit of ROM, you really should use this table.
Well, you need a couple of temporary storage variables ANYWAY whathever you're going to do. I remember having lot of headaches to stick with only 4 temporary variables, and whenever I need more I use different named half-general purpose variables.
Also, even if it could save a couple of byte in the code at a couple of places, I guess it would be very rare to actually save 256 bytes that way. You'll do it only if you have unrolled loop with use of this table inside of something like that. So memory-wise, this isn't a good solution, but time-wise or easy-to-use wise, maybe it is.

Also, TXA/TAY and TYA/TAX takes 1 less byte than ldx Identity,Y and ldy Identity,X, and take the exact same time so I don't know why you'd want to do this. And yeah it overwrites A, but usually in a single loop/iteration you affect X and Y to a single usage so I don't see much the trick. The only reson it would be really usefull is if you use an instruction like rol $xx,X which can't be done with Y, and then sta [$xx],Y which can't be done with X, but you want the same "index", and you don't want to overwrite A in the process, so yeah in that case it's usefull, but that's not really frequent.

Honnestly, with 256 bytes you can have a very large additional level in your game or a new music with 3 tracks, wich are much better usage than a stupid identity table.

@Celius : Yeah your idea should be great for the other guys that want really fast code, however it's not great for me who want to save bytes, even if that slow the process a little. Using your trick uses 5 bytes instead of 4, or even instead of 3 if you have a subroutine that does 4 ASL and RTS (I do, and as mentionned above I use it above 25 times in the whole code).
And if you want the equivalent table for LSR, you could have an assembler place a byte with $00, $01, $02, etc... all 16 bytes and manage to have 15 very small subroutine that takes 15 bytes or less intervealed in here. Such things that a routine that polls $2002 and return, or write to the mapper while avoidinc bus conflicts, etc... I'm pretty sure a complete game engine would have 15 routines that takes 15 bytes or less.
User avatar
never-obsolete
Posts: 403
Joined: Wed Sep 07, 2005 9:55 am
Location: Phoenix, AZ
Contact:

Post by never-obsolete »

i just recently discovered the BIT trick:

Code: Select all

Sub1:	ldx #00
		  .db $2C
Sub2:	ldx #07
		  .db $2C
Sub3:	ldx #11
		  stx somewhere
		  ; go about business
its not terribly useful, but has its moments.
. That's just like, your opinion, man .
Bananmos
Posts: 551
Joined: Wed Mar 09, 2005 9:08 am
Contact:

Post by Bananmos »

Can't resist bumping this old thread to mention I've found a use for combining BIT and the identity table.

I kind of really miss the bit immediate instruction available on the 65C02. There's quite a few cases where I'd like to test certain bits in a byte with a bitwise AND without destroying the contents of the accumulator:

Code: Select all

lda mapFlags,X
bit #FLAG1+FLAG2
beq :+
jsr DoSomething
:
bit #FLAG5+FLAG6
beq :+
jsr DoSomeOtherThing
:
But even though the bit immediate instruction is not there, it could be emulated with BIT absolute and an identity table, using 1 more byte and 2 more cycles:

Code: Select all

lda mapFlags,X
bit Identity+FLAG1+FLAG2
beq :+
jsr DoSomething
:
bit Identity+FLAG5+FLAG6
beq :+
jsr DoSomeOtherThing
:
A more optimized way would of course be to reserve a few zeropage bytes for the combinations you really need to test, but that's not as generic. Though with a powerful enough macroassembler, I guess you could have a BIT immediate macro that employs the identity table as a safe fallback, but uses zeropage locations for the most popular combinations. :)
psycopathicteen
Posts: 3001
Joined: Wed May 19, 2010 6:12 pm

Post by psycopathicteen »

If your doing 65816 in 16-bit mode, it would need to be a hiROM cart either at a $40-$7d bank or a $80-$ed bank.
Post Reply