discarded low bit of high byte of multiplier in 16-bit multiply

Discussion of hardware and software development for Super NES and Super Famicom. See the SNESdev wiki for more information.

Moderator: Moderators

Forum rules
  • For making cartridges of your Super NES games, see Reproduction.
turboxray
Posts: 348
Joined: Thu Oct 31, 2019 12:56 am

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by turboxray »

93143 wrote: Sun Jun 05, 2022 5:27 pm I made a couple of optimizations. It's now 73 or 75 fast cycles and 18 slow cycles, for a total of 91 or 93 cycles total, or 582 to 594 master cycles:
Nice! Just a nit pick, but shouldn't that be "97 or 99" fast cycles since you can't just add slow and fast cycles.
93143
Posts: 1718
Joined: Fri Jul 04, 2014 9:31 pm

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by 93143 »

rainwarrior wrote: Sun Jun 05, 2022 7:15 pmI put it in a Mode 7 demo I'm working on, and it seems to be working perfectly. (Running it on my SNES right now, as FastROM.)
Awesome.
Thanks for writing this, I was actually just about to need to write basically the same thing myself, and this has been very convenient timing for me. :)
Well, that's nice to hear. You're welcome!
Maybe a test ROM that compares all results against a reference implementation would be a good idea...
You mean a reference implementation like the one earlier in the thread? (Initially I was thinking of a reference table from an offline calculation, but even the MSU1 is too small for such a thing...)

If you literally mean all results, that would take a while. Just my code alone would take ~33 hours to run 2^32 times. With a slower reference implementation checking each result, it could easily take a week. Perhaps it would be better to break it up into regions, and have the user able to select a region to test...

A bunch of random numbers, plus the obvious limiting cases, would be quicker to run. Perhaps that would be good enough - it would certainly be easier to rerun the test if a problem crops up, or if an optimization occurs to somebody...
turboxray wrote: Sun Jun 05, 2022 8:22 pm
93143 wrote: Sun Jun 05, 2022 5:27 pm I made a couple of optimizations. It's now 73 or 75 fast cycles and 18 slow cycles, for a total of 91 or 93 cycles total, or 582 to 594 master cycles:
Nice! Just a nit pick, but shouldn't that be "97 or 99" fast cycles since you can't just add slow and fast cycles.
The reason I did that is because we were talking about 6502 code upthread, and the Commodore 64 doesn't know about Fast and Slow cycles, or even XSlow cycles, because all its cycles are USlow. So they just quote their program lengths in cycles, and I followed suit, figuring it was a decent way to benchmark against other candidate implementations. I also provided the master cycle count as a measure of absolute program time.

In principle you could run this in SlowROM, and the cycles would be mostly Slow (not all - I think there would still be 15 or 16 fast cycles in there)...
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by rainwarrior »

Well, I wrote a quick test. Quick as in quick to code, not to execute. ;)

ROM: multest_mul16.sfc
Source: https://github.com/bbbradsmith/SNES_stu ... st_mul16.s

It increments one operand by a large prime each time, and the other it just goes through all 2^16 in order. After 2^32 tests it should print out "Pass!" but I think this will take about 110 hours to find out. Not sure if I wanna leave my SNES on for that long, but I'll leave it overnight for now.

Snes9x can probably get through it in fast forward in about an hour.

(Not that I really think there's going to be one special number out there where it'll fail. I just wanted to run a test for the sake of it, and if it's reasonable to do all inputs maybe I'll let it.)
Last edited by rainwarrior on Mon Jun 13, 2022 11:08 pm, edited 1 time in total.
93143
Posts: 1718
Joined: Fri Jul 04, 2014 9:31 pm

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by 93143 »

I wrote it assuming that the 8 cycles I needed to wait for the result are CPU cycles, of whatever length they end up being. If you have to wait 64 master cycles, the low*high read is too soon and could fail for large inputs, but I'm pretty sure that's not the case.

Other than that, there might just be a bug that we've somehow missed during inspection and testing. I doubt there is, but if this method is to be published as an allegedly-working code sample, it's good to be as certain as possible...
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by rainwarrior »

Yeah, I made sure the test was FastROM for that reason as well. I had read that it was based on the CPU cycle count and not master cycles, so it should be fine in theory, but there's not much harm in giving it a thorough spin.
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by rainwarrior »

Snes9x made it through all cases with a pass. I think that confirms it's logically correct.

I stopped the SNES hardware test, but I think a few hours is probably far more than enough to confirm there isn't a timing issue.
creaothceann
Posts: 611
Joined: Mon Jan 23, 2006 7:47 am
Location: Germany
Contact:

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by creaothceann »

How long would the bsnes core need to run?
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by rainwarrior »

I dunno, you'd have to run it on your machine, put it in fast forward, and see how quickly the counter moves. Time it and multiply by 65536 to find out.
93143
Posts: 1718
Joined: Fri Jul 04, 2014 9:31 pm

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by 93143 »

rainwarrior wrote: Mon Jun 06, 2022 10:54 am Snes9x made it through all cases with a pass. I think that confirms it's logically correct.

I stopped the SNES hardware test, but I think a few hours is probably far more than enough to confirm there isn't a timing issue.
Excellent. I've had it running on my SNES for a while now, and it still looks good. I think we can say with reasonable confidence that it works.

I don't use my SNES much during the week. Is there any hardware reason I shouldn't just leave it on?
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by rainwarrior »

If you don't mind the SNES being on for a long while, I don't think there's any other reason not to run it to completion. I just don't usually like leaving my vintage stuff on for days at a time.
93143
Posts: 1718
Joined: Fri Jul 04, 2014 9:31 pm

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by 93143 »

All right, I'll try it.
93143
Posts: 1718
Joined: Fri Jul 04, 2014 9:31 pm

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by 93143 »

One more tweak, based on the code here. (In fact the two methods are almost identical - I swear I didn't know that page existed...) This routine now takes 71 or 73 fast cycles and 17 slow cycles, totaling 562 or 574 master cycles.

Code: Select all

	ldx x0		; 2 fast, 1 slow - pick up low byte of multiplicand x
	stx $4202	; 4 fast - write low byte of x to non-triggering multiplicand input
	ldy y0		; 2 fast, 1 slow - pick up low byte of multiplicand y
	sty $4203	; 4 fast - write low byte of y to triggering input (low*low multiplication starts now)
	ldx y1		; 2 fast, 1 slow - pick up high byte of multiplicand y
	stz prod+2	; 2 fast, 2 slow - zero upper word of product
	lda $4216	; 5 fast - read result of low*low multiplication
	stx $4203	; 4 fast - write high byte of y to triggering input (low*high starts now)
	sta prod	; 2 fast, 2 slow - store result in low word of product
	ldx x1		; 2 fast, 1 slow - pick up high byte of multiplicand x
	lda $4216	; 5 fast - read result of low*high multiplication
	stx $4202	; 4 fast - write high byte of x to non-triggering input
	sty $4203	; 4 fast - write low byte of y to triggering input (high*low starts now)
	clc		; 2 fast - clear carry
	adc prod+1	; 2 fast, 2 slow - add result with high byte of previous result (cannot set carry)
	ldy y1		; 2 fast, 1 slow - pick up high byte of multiplicand y
	adc $4216	; 5 fast - add result of high*low multiplication to previous result (may set carry)
	sty $4203	; 4 fast - write high byte of y to triggering input (high*high starts now)
	sta prod+1	; 2 fast, 2 slow - store result in middle word of product
	lda prod+2	; 2 fast, 2 slow - load top word of product, including top byte of previous result
	bcc cc\@	; 2 or 3 fast - branch if carry clear
	adc #$00FF	; 3 fast - increment top byte of result (clears carry)
cc\@:	adc $4216	; 5 fast - add current result to result of high*high multiplication (cannot set carry)
	sta prod+2	; 2 fast, 2 slow - store result to top word of product
Pro Tip: Do not name variables after the registers you plan to load them into. It can blind you to simple optimizations like this one. Basically, if you load y1 into X, you can leave y0 in Y and save three cycles. I did have to start the low*high earlier to leave enough time for the calculation, but this doesn't affect anything else.

The relevant differences between their method and mine seem to be:

1) their method assumes X/Y are 16-bit before and after, which adds 6 fast cycles to twiddle the X bit.
2) their method seems to use .w on a couple of instructions, adding two fast cycles for (apparently) busywaiting purposes. This seems to be due to subtle differences in instruction order, which without the extra cycles would not leave enough time to let the multiplications complete.
3) their method is a subroutine rather than a macro, resulting in an extra 8 fast and 4 slow cycles for jsr and rts.

Come to think of it, perhaps the RAM variables in my method should be macro parameters; it would make it more suitable for batch processing and better justify the lack of sep/rep activity...
User avatar
rainwarrior
Posts: 8735
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: discarded low bit of high byte of multiplier in 16-bit multiply

Post by rainwarrior »

Post Reply