tcc816 math optimization

Discussion of hardware and software development for Super NES and Super Famicom.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
Shiru
Posts: 1161
Joined: Sat Jan 23, 2010 11:41 pm

tcc816 math optimization

Post by Shiru »

tcc816 is still one of the very few and the best free 65816 C compilers available, but it is certainly far from being perfect. One of things that make quite some trouble is that its math routines are really slow. It seems that they were done just to work somehow, without care for performance at all. I'm not too good in 65816 code, so I can improve it only moderately, but I think other people may suggest some routines and optimizations. Would be nice to join efforts to improve what we have for all snes-sdk or PVSnesLib users benefit.

Take a look at the existing code

Obvious improvements are to utilize 16 bit math where possible instead of 8 bit, and to unroll all loops, as few extra K of math code won't do big difference for SNES anyway. I'm sure there are more.
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Re: tcc816 math optimization

Post by mic_ »

For 16/8 division & modulo and 8*8 multiplication (and maybe 16*8 multiplication?) there's hardware support on the SNES that could be used. See e.g. this code.
Shiru
Posts: 1161
Joined: Sat Jan 23, 2010 11:41 pm

Re: tcc816 math optimization

Post by Shiru »

I know, but the docs says it won't work in Mode7, and Mode7 is one of the most interesting and distinctive SNES features, as well as one that needs fast math.
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Re: tcc816 math optimization

Post by mic_ »

Hmm. Which doc is that?

Anyway, there's also the mode7 coefficient matrix math registers ($211B-$2120). And if you really need to do a lot of 2D or 3D coordinate calculation I would suggest looking at the DSP-1.
Shiru
Posts: 1161
Joined: Sat Jan 23, 2010 11:41 pm

Re: tcc816 math optimization

Post by Shiru »

My mistake, it is related exactly to the matrix registers and the result that could be read from $2134-36. So at least part of the math code could be done using the HW resources. Still would be nice to improve remaing routines.

I know about DSP1, but I need to improve the compiled code performance itself, not a particular case.
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Re: tcc816 math optimization

Post by mic_ »

Simple improvement for tcc__shldi3 (saves 5 cycles/iteration):

Code: Select all

tcc__shldi3:
lda.b 6,s ; hi word
sta.b tcc__r1
lda.b 8,s ; shift count
beq +
tax
lda.b 4,s ; low word
- asl a
rol.b tcc__r1
dex
bne -
sta.b tcc__r0
rtl
+
lda.b 4,s ; low word
sta.b tcc__r0
 rtl
The same change could be applied for SHR.
Shiru
Posts: 1161
Joined: Sat Jan 23, 2010 11:41 pm

Re: tcc816 math optimization

Post by Shiru »

Thanks. However, it seems that this routine never gets called - not in Classic Kong, at least (~8000 lines of C code). Or am I missing something?
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Re: tcc816 math optimization

Post by mic_ »

tcc-816 optimizes some shits with immediate shift counts IIRC (unrolling).
Shiru
Posts: 1161
Joined: Sat Jan 23, 2010 11:41 pm

Re: tcc816 math optimization

Post by Shiru »

I checked how many times these math routines are called in Classic Kong, and results are unexpected (by me).

tcc__mul, tcc__div are called about 10 times.
tcc__udiv is called a bit more, still not much.

Anything else is never called. I have some modulo operations in the game, but none of routines that supposedly do modulo operation are called.
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Re: tcc816 math optimization

Post by mic_ »

If your modulo operations are "% some_immediate_power_of_two" then they will be (or should be) optimized to an AND operation.
Shiru
Posts: 1161
Joined: Sat Jan 23, 2010 11:41 pm

Re: tcc816 math optimization

Post by Shiru »

They are immediate not power of two (3, 6, 10, 100, 1000, 10000 etc), also variables. I don't rely on compiler and always use & or shifts in case of power of two values math.
User avatar
thefox
Posts: 3139
Joined: Mon Jan 03, 2005 10:36 am
Location: Tampere, Finland
Contact:

Re: tcc816 math optimization

Post by thefox »

Shiru, doesn't tcc816 support generation of assembly listing? That's an easy way to find out how it implements those divisions/remainders.
mic_ wrote:tcc-816 optimizes some shits with immediate shift counts IIRC (unrolling).
Freudian slip?
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi
Shiru
Posts: 1161
Joined: Sat Jan 23, 2010 11:41 pm

Re: tcc816 math optimization

Post by Shiru »

I wouldn't say it is easy. It generates assembly mess without any tabs and without C source in comments. Also, the code it generates is huge and unoptimal. Doable, of course, I just don't think it is that important right now how exactly it does (probably calls the division routine), I just didn't expect that most of the math routines are never used in a large program and that basically just three of them needs to be optimized.
mic_
Posts: 922
Joined: Thu Oct 05, 2006 6:29 am

Re: tcc816 math optimization

Post by mic_ »

@thefox: Could be :P Probably just work-induced afternoon coma though.

Here's what tcc-816 gives me:

Code: Select all

void opt_test3(unsigned int arg)
{
	int j;

	j /= 3;
	j = arg % 10;
}
Compiles to

Code: Select all

opt_test3:
.ifgr __opt_test3_locals 0
tsa
sec
sbc #__opt_test3_locals
tas
.endif
lda -2 + __opt_test3_locals + 1,s
sta.b tcc__r0
tax
lda.w #3
jsr.l tcc__div   ; <-- calls tcc_div
lda.b tcc__r9
sta -2 + __opt_test3_locals + 1,s
lda 3 + __opt_test3_locals + 1,s
sta.b tcc__r0
tax
lda.w #10
jsr.l tcc__udiv   ; <-- calls tcc_udiv
stx.b tcc__r0
txa
sta -2 + __opt_test3_locals + 1,s
.ifgr __opt_test3_locals 0
tsa
clc
adc #__opt_test3_locals
tas
.endif
rtl
And this

Code: Select all

void opt_test2(unsigned int arg)
{
	int j;

	j *= 7;
	j /= 4;
	j = arg % 16;
}
Compiles to

Code: Select all

opt_test2:
.ifgr __opt_test2_locals 0
tsa
sec
sbc #__opt_test2_locals
tas
.endif
lda -2 + __opt_test2_locals + 1,s
sta.b tcc__r0
asl a
asl a
asl a
sec
sbc.b tcc__r0     ; <-- strength reduction: j*7 converted to (j<<3)-j
sta -2 + __opt_test2_locals + 1,s
sta.b tcc__r0
cmp #$8000
ror tcc__r0
cmp #$8000
ror tcc__r0   ; <-- strength reduction: j/4 converted to j>>2
lda.b tcc__r0
sta -2 + __opt_test2_locals + 1,s
lda 3 + __opt_test2_locals + 1,s
and #15   ; <-- strength reduction: arg%16 converted to arg&15
sta.b tcc__r0
sta -2 + __opt_test2_locals + 1,s
.ifgr __opt_test2_locals 0
tsa
clc
adc #__opt_test2_locals
tas
.endif
rtl
User avatar
koitsu
Posts: 4203
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Re: tcc816 math optimization

Post by koitsu »

Don't have time to sit around counting cycles / working it all out, but I thought I'd share this. Taken directly from "Programming the 65816 including the 6502, 65C02, and 65802":

16-bit multiplication:

Code: Select all

; operand 1: 16-bits in DP location MCAND1
; operand 2: 16-bits in DP location MCAND2
; result: 16-bits in accmulator
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)

MCAND1 = $80     ; DP location $80
MCAND2 = $82     ; DP location $82

mymult:
  lda #0         ; initialise result

- ldx MCAND1     ; get operand 1
  beq done       ; if operand 1 is zero, done
  lsr MCAND1     ; get right bit, operand 1
  bcc +          ; if clear, no addition to previous products
  clc            ; else add operand 2 to partial result
  adc MCAND2

+ asl MCAND2     ; now shift operand 2 left for possible addition next time
  bra -

done:
  rts
16-bit division w/ remainder:

Code: Select all

; 16-bit divide: X / A -> QUOTNT; remainder in X
; QUOTNT is a 16-bit direct page cell
; assumes native mode in use (all registers set to 16-bit modes, e.g. REP #$30)
; no special handling for divide by zero (returns $FFFF as quotient)

QUOTNT = $80     ; DP location $80

mydiv:
  stz QUOTNT     ; initialise quotient to 0
  ldy #1         ; iniitalise shift count to 1

- asl a          ; shift divisor; test leftmost bit
  bcs +          ; branch when get leftmost bit
  iny            ; else increment shift count
  cpy #17        ; max count (all zeros in divisor)
  bne -          ; loop if not done

+ ror a          ; push shifted-out bit back

; now divide by subtraction
- pha            ; push divisor
  txa            ; get dividend into accumulator
  sec
  sbc 1,s        ; subtract divisor from dividend
  bcc +          ; branch if can't subtract; dividend still in X
  tax            ; store new dividend; carry=1 for quotient

+ rol QUOTNT     ; shift carry->quotient (1 for divide, 0 for not)
  pla            ; pull divisor
  lsr a          ; shift divisor right for next subtract
  dey            ; decrement count
  bne -          ; branch to repeat unless count is 0
  rts
An alternate solution -- and this is what most of us ended up doing on the 6502, 65c02, and 65816 universally -- is to generate a pre-calculated table of all values and simply do table lookups. It's honestly the fastest, aside from obvious multiple-of-2 cases (which we know aren't always the case in this scenario). 6502 folks do this all the time since 200+ cycles per multiply is considered extreme (bringing it down to around 90 cycles).

BTW, you could consider using $4202/3 for multiplication, but the end result may be slower than the above options since there's an 8-cycle delay between when $4203 is set and when you get your result in $4216/7. The same applies for division using $4204/5/6, with a 16-cycle delay after $4206 is set to when you can get your result from $4216/7. Most games I know of didn't use these registers because of that reason, but it depends ultimately on how much math you plan on doing. :-)
Post Reply