tcc816 math optimization
Moderator: Moderators
Forum rules
- For making cartridges of your Super NES games, see Reproduction.
tcc816 math optimization
tcc816 is still one of the very few and the best free 65816 C compilers available, but it is certainly far from being perfect. One of things that make quite some trouble is that its math routines are really slow. It seems that they were done just to work somehow, without care for performance at all. I'm not too good in 65816 code, so I can improve it only moderately, but I think other people may suggest some routines and optimizations. Would be nice to join efforts to improve what we have for all snes-sdk or PVSnesLib users benefit.
Take a look at the existing code
Obvious improvements are to utilize 16 bit math where possible instead of 8 bit, and to unroll all loops, as few extra K of math code won't do big difference for SNES anyway. I'm sure there are more.
Take a look at the existing code
Obvious improvements are to utilize 16 bit math where possible instead of 8 bit, and to unroll all loops, as few extra K of math code won't do big difference for SNES anyway. I'm sure there are more.
Re: tcc816 math optimization
For 16/8 division & modulo and 8*8 multiplication (and maybe 16*8 multiplication?) there's hardware support on the SNES that could be used. See e.g. this code.
Re: tcc816 math optimization
I know, but the docs says it won't work in Mode7, and Mode7 is one of the most interesting and distinctive SNES features, as well as one that needs fast math.
Re: tcc816 math optimization
Hmm. Which doc is that?
Anyway, there's also the mode7 coefficient matrix math registers ($211B-$2120). And if you really need to do a lot of 2D or 3D coordinate calculation I would suggest looking at the DSP-1.
Anyway, there's also the mode7 coefficient matrix math registers ($211B-$2120). And if you really need to do a lot of 2D or 3D coordinate calculation I would suggest looking at the DSP-1.
Re: tcc816 math optimization
My mistake, it is related exactly to the matrix registers and the result that could be read from $2134-36. So at least part of the math code could be done using the HW resources. Still would be nice to improve remaing routines.
I know about DSP1, but I need to improve the compiled code performance itself, not a particular case.
I know about DSP1, but I need to improve the compiled code performance itself, not a particular case.
Re: tcc816 math optimization
Simple improvement for tcc__shldi3 (saves 5 cycles/iteration):
The same change could be applied for SHR.
Code: Select all
tcc__shldi3:
lda.b 6,s ; hi word
sta.b tcc__r1
lda.b 8,s ; shift count
beq +
tax
lda.b 4,s ; low word
- asl a
rol.b tcc__r1
dex
bne -
sta.b tcc__r0
rtl
+
lda.b 4,s ; low word
sta.b tcc__r0
rtl
Re: tcc816 math optimization
Thanks. However, it seems that this routine never gets called - not in Classic Kong, at least (~8000 lines of C code). Or am I missing something?
Re: tcc816 math optimization
tcc-816 optimizes some shits with immediate shift counts IIRC (unrolling).
Re: tcc816 math optimization
I checked how many times these math routines are called in Classic Kong, and results are unexpected (by me).
tcc__mul, tcc__div are called about 10 times.
tcc__udiv is called a bit more, still not much.
Anything else is never called. I have some modulo operations in the game, but none of routines that supposedly do modulo operation are called.
tcc__mul, tcc__div are called about 10 times.
tcc__udiv is called a bit more, still not much.
Anything else is never called. I have some modulo operations in the game, but none of routines that supposedly do modulo operation are called.
Re: tcc816 math optimization
If your modulo operations are "% some_immediate_power_of_two" then they will be (or should be) optimized to an AND operation.
Re: tcc816 math optimization
They are immediate not power of two (3, 6, 10, 100, 1000, 10000 etc), also variables. I don't rely on compiler and always use & or shifts in case of power of two values math.
Re: tcc816 math optimization
Shiru, doesn't tcc816 support generation of assembly listing? That's an easy way to find out how it implements those divisions/remainders.
Freudian slip?mic_ wrote:tcc-816 optimizes some shits with immediate shift counts IIRC (unrolling).
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi
Re: tcc816 math optimization
I wouldn't say it is easy. It generates assembly mess without any tabs and without C source in comments. Also, the code it generates is huge and unoptimal. Doable, of course, I just don't think it is that important right now how exactly it does (probably calls the division routine), I just didn't expect that most of the math routines are never used in a large program and that basically just three of them needs to be optimized.
Re: tcc816 math optimization
@thefox: Could be
Probably just work-induced afternoon coma though.
Here's what tcc-816 gives me:
Compiles to
And this
Compiles to
Here's what tcc-816 gives me:
Code: Select all
void opt_test3(unsigned int arg)
{
int j;
j /= 3;
j = arg % 10;
}
Code: Select all
opt_test3:
.ifgr __opt_test3_locals 0
tsa
sec
sbc #__opt_test3_locals
tas
.endif
lda -2 + __opt_test3_locals + 1,s
sta.b tcc__r0
tax
lda.w #3
jsr.l tcc__div ; <-- calls tcc_div
lda.b tcc__r9
sta -2 + __opt_test3_locals + 1,s
lda 3 + __opt_test3_locals + 1,s
sta.b tcc__r0
tax
lda.w #10
jsr.l tcc__udiv ; <-- calls tcc_udiv
stx.b tcc__r0
txa
sta -2 + __opt_test3_locals + 1,s
.ifgr __opt_test3_locals 0
tsa
clc
adc #__opt_test3_locals
tas
.endif
rtl
Code: Select all
void opt_test2(unsigned int arg)
{
int j;
j *= 7;
j /= 4;
j = arg % 16;
}
Code: Select all
opt_test2:
.ifgr __opt_test2_locals 0
tsa
sec
sbc #__opt_test2_locals
tas
.endif
lda -2 + __opt_test2_locals + 1,s
sta.b tcc__r0
asl a
asl a
asl a
sec
sbc.b tcc__r0 ; <-- strength reduction: j*7 converted to (j<<3)-j
sta -2 + __opt_test2_locals + 1,s
sta.b tcc__r0
cmp #$8000
ror tcc__r0
cmp #$8000
ror tcc__r0 ; <-- strength reduction: j/4 converted to j>>2
lda.b tcc__r0
sta -2 + __opt_test2_locals + 1,s
lda 3 + __opt_test2_locals + 1,s
and #15 ; <-- strength reduction: arg%16 converted to arg&15
sta.b tcc__r0
sta -2 + __opt_test2_locals + 1,s
.ifgr __opt_test2_locals 0
tsa
clc
adc #__opt_test2_locals
tas
.endif
rtl
Re: tcc816 math optimization
Don't have time to sit around counting cycles / working it all out, but I thought I'd share this. Taken directly from "Programming the 65816 including the 6502, 65C02, and 65802":
16-bit multiplication:
16-bit division w/ remainder:
An alternate solution -- and this is what most of us ended up doing on the 6502, 65c02, and 65816 universally -- is to generate a pre-calculated table of all values and simply do table lookups. It's honestly the fastest, aside from obvious multiple-of-2 cases (which we know aren't always the case in this scenario). 6502 folks do this all the time since 200+ cycles per multiply is considered extreme (bringing it down to around 90 cycles).
BTW, you could consider using $4202/3 for multiplication, but the end result may be slower than the above options since there's an 8-cycle delay between when $4203 is set and when you get your result in $4216/7. The same applies for division using $4204/5/6, with a 16-cycle delay after $4206 is set to when you can get your result from $4216/7. Most games I know of didn't use these registers because of that reason, but it depends ultimately on how much math you plan on doing. :-)
16-bit multiplication:
Code: Select all
; operand 1: 16-bits in DP location MCAND1
; operand 2: 16-bits in DP location MCAND2
; result: 16-bits in accmulator
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)
MCAND1 = $80 ; DP location $80
MCAND2 = $82 ; DP location $82
mymult:
lda #0 ; initialise result
- ldx MCAND1 ; get operand 1
beq done ; if operand 1 is zero, done
lsr MCAND1 ; get right bit, operand 1
bcc + ; if clear, no addition to previous products
clc ; else add operand 2 to partial result
adc MCAND2
+ asl MCAND2 ; now shift operand 2 left for possible addition next time
bra -
done:
rts
Code: Select all
; 16-bit divide: X / A -> QUOTNT; remainder in X
; QUOTNT is a 16-bit direct page cell
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)
; no special handling for divide by zero (returns $FFFF as quotient)
QUOTNT = $80 ; DP location $80
mydiv:
stz QUOTNT ; initialise quotient to 0
ldy #1 ; iniitalise shift count to 1
- asl a ; shift divisor; test leftmost bit
bcs + ; branch when get leftmost bit
iny ; else increment shift count
cpy #17 ; max count (all zeros in divisor)
bne - ; loop if not done
+ ror a ; push shifted-out bit back
; now divide by subtraction
- pha ; push divisor
txa ; get dividend into accumulator
sec
sbc 1,s ; subtract divisor from dividend
bcc + ; branch if can't subtract; dividend still in X
tax ; store new dividend; carry=1 for quotient
+ rol QUOTNT ; shift carry->quotient (1 for divide, 0 for not)
pla ; pull divisor
lsr a ; shift divisor right for next subtract
dey ; decrement count
bne - ; branch to repeat unless count is 0
rts
BTW, you could consider using $4202/3 for multiplication, but the end result may be slower than the above options since there's an 8-cycle delay between when $4203 is set and when you get your result in $4216/7. The same applies for division using $4204/5/6, with a 16-cycle delay after $4206 is set to when you can get your result from $4216/7. Most games I know of didn't use these registers because of that reason, but it depends ultimately on how much math you plan on doing. :-)