Nice! Just a nit pick, but shouldn't that be "97 or 99" fast cycles since you can't just add slow and fast cycles.
discarded low bit of high byte of multiplier in 16-bit multiply
Moderator: Moderators
Forum rules
- For making cartridges of your Super NES games, see Reproduction.
Re: discarded low bit of high byte of multiplier in 16-bit multiply
Re: discarded low bit of high byte of multiplier in 16-bit multiply
Awesome.rainwarrior wrote: ↑Sun Jun 05, 2022 7:15 pmI put it in a Mode 7 demo I'm working on, and it seems to be working perfectly. (Running it on my SNES right now, as FastROM.)
Well, that's nice to hear. You're welcome!Thanks for writing this, I was actually just about to need to write basically the same thing myself, and this has been very convenient timing for me.
You mean a reference implementation like the one earlier in the thread? (Initially I was thinking of a reference table from an offline calculation, but even the MSU1 is too small for such a thing...)Maybe a test ROM that compares all results against a reference implementation would be a good idea...
If you literally mean all results, that would take a while. Just my code alone would take ~33 hours to run 2^32 times. With a slower reference implementation checking each result, it could easily take a week. Perhaps it would be better to break it up into regions, and have the user able to select a region to test...
A bunch of random numbers, plus the obvious limiting cases, would be quicker to run. Perhaps that would be good enough - it would certainly be easier to rerun the test if a problem crops up, or if an optimization occurs to somebody...
The reason I did that is because we were talking about 6502 code upthread, and the Commodore 64 doesn't know about Fast and Slow cycles, or even XSlow cycles, because all its cycles are USlow. So they just quote their program lengths in cycles, and I followed suit, figuring it was a decent way to benchmark against other candidate implementations. I also provided the master cycle count as a measure of absolute program time.
In principle you could run this in SlowROM, and the cycles would be mostly Slow (not all - I think there would still be 15 or 16 fast cycles in there)...
- rainwarrior
- Posts: 8734
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: discarded low bit of high byte of multiplier in 16-bit multiply
Well, I wrote a quick test. Quick as in quick to code, not to execute.
ROM: multest_mul16.sfc
Source: https://github.com/bbbradsmith/SNES_stu ... st_mul16.s
It increments one operand by a large prime each time, and the other it just goes through all 2^16 in order. After 2^32 tests it should print out "Pass!" but I think this will take about 110 hours to find out. Not sure if I wanna leave my SNES on for that long, but I'll leave it overnight for now.
Snes9x can probably get through it in fast forward in about an hour.
(Not that I really think there's going to be one special number out there where it'll fail. I just wanted to run a test for the sake of it, and if it's reasonable to do all inputs maybe I'll let it.)
ROM: multest_mul16.sfc
Source: https://github.com/bbbradsmith/SNES_stu ... st_mul16.s
It increments one operand by a large prime each time, and the other it just goes through all 2^16 in order. After 2^32 tests it should print out "Pass!" but I think this will take about 110 hours to find out. Not sure if I wanna leave my SNES on for that long, but I'll leave it overnight for now.
Snes9x can probably get through it in fast forward in about an hour.
(Not that I really think there's going to be one special number out there where it'll fail. I just wanted to run a test for the sake of it, and if it's reasonable to do all inputs maybe I'll let it.)
Last edited by rainwarrior on Mon Jun 13, 2022 11:08 pm, edited 1 time in total.
Re: discarded low bit of high byte of multiplier in 16-bit multiply
I wrote it assuming that the 8 cycles I needed to wait for the result are CPU cycles, of whatever length they end up being. If you have to wait 64 master cycles, the low*high read is too soon and could fail for large inputs, but I'm pretty sure that's not the case.
Other than that, there might just be a bug that we've somehow missed during inspection and testing. I doubt there is, but if this method is to be published as an allegedly-working code sample, it's good to be as certain as possible...
Other than that, there might just be a bug that we've somehow missed during inspection and testing. I doubt there is, but if this method is to be published as an allegedly-working code sample, it's good to be as certain as possible...
- rainwarrior
- Posts: 8734
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: discarded low bit of high byte of multiplier in 16-bit multiply
Yeah, I made sure the test was FastROM for that reason as well. I had read that it was based on the CPU cycle count and not master cycles, so it should be fine in theory, but there's not much harm in giving it a thorough spin.
- rainwarrior
- Posts: 8734
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: discarded low bit of high byte of multiplier in 16-bit multiply
Snes9x made it through all cases with a pass. I think that confirms it's logically correct.
I stopped the SNES hardware test, but I think a few hours is probably far more than enough to confirm there isn't a timing issue.
I stopped the SNES hardware test, but I think a few hours is probably far more than enough to confirm there isn't a timing issue.
-
- Posts: 611
- Joined: Mon Jan 23, 2006 7:47 am
- Location: Germany
- Contact:
Re: discarded low bit of high byte of multiplier in 16-bit multiply
How long would the bsnes core need to run?
My current setup:
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
Super Famicom ("2/1/3" SNS-CPU-GPM-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10
- rainwarrior
- Posts: 8734
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: discarded low bit of high byte of multiplier in 16-bit multiply
I dunno, you'd have to run it on your machine, put it in fast forward, and see how quickly the counter moves. Time it and multiply by 65536 to find out.
Re: discarded low bit of high byte of multiplier in 16-bit multiply
Excellent. I've had it running on my SNES for a while now, and it still looks good. I think we can say with reasonable confidence that it works.rainwarrior wrote: ↑Mon Jun 06, 2022 10:54 am Snes9x made it through all cases with a pass. I think that confirms it's logically correct.
I stopped the SNES hardware test, but I think a few hours is probably far more than enough to confirm there isn't a timing issue.
I don't use my SNES much during the week. Is there any hardware reason I shouldn't just leave it on?
- rainwarrior
- Posts: 8734
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: discarded low bit of high byte of multiplier in 16-bit multiply
If you don't mind the SNES being on for a long while, I don't think there's any other reason not to run it to completion. I just don't usually like leaving my vintage stuff on for days at a time.
Re: discarded low bit of high byte of multiplier in 16-bit multiply
All right, I'll try it.
Re: discarded low bit of high byte of multiplier in 16-bit multiply
One more tweak, based on the code here. (In fact the two methods are almost identical - I swear I didn't know that page existed...) This routine now takes 71 or 73 fast cycles and 17 slow cycles, totaling 562 or 574 master cycles.
Pro Tip: Do not name variables after the registers you plan to load them into. It can blind you to simple optimizations like this one. Basically, if you load y1 into X, you can leave y0 in Y and save three cycles. I did have to start the low*high earlier to leave enough time for the calculation, but this doesn't affect anything else.
The relevant differences between their method and mine seem to be:
1) their method assumes X/Y are 16-bit before and after, which adds 6 fast cycles to twiddle the X bit.
2) their method seems to use .w on a couple of instructions, adding two fast cycles for (apparently) busywaiting purposes. This seems to be due to subtle differences in instruction order, which without the extra cycles would not leave enough time to let the multiplications complete.
3) their method is a subroutine rather than a macro, resulting in an extra 8 fast and 4 slow cycles for jsr and rts.
Come to think of it, perhaps the RAM variables in my method should be macro parameters; it would make it more suitable for batch processing and better justify the lack of sep/rep activity...
Code: Select all
ldx x0 ; 2 fast, 1 slow - pick up low byte of multiplicand x
stx $4202 ; 4 fast - write low byte of x to non-triggering multiplicand input
ldy y0 ; 2 fast, 1 slow - pick up low byte of multiplicand y
sty $4203 ; 4 fast - write low byte of y to triggering input (low*low multiplication starts now)
ldx y1 ; 2 fast, 1 slow - pick up high byte of multiplicand y
stz prod+2 ; 2 fast, 2 slow - zero upper word of product
lda $4216 ; 5 fast - read result of low*low multiplication
stx $4203 ; 4 fast - write high byte of y to triggering input (low*high starts now)
sta prod ; 2 fast, 2 slow - store result in low word of product
ldx x1 ; 2 fast, 1 slow - pick up high byte of multiplicand x
lda $4216 ; 5 fast - read result of low*high multiplication
stx $4202 ; 4 fast - write high byte of x to non-triggering input
sty $4203 ; 4 fast - write low byte of y to triggering input (high*low starts now)
clc ; 2 fast - clear carry
adc prod+1 ; 2 fast, 2 slow - add result with high byte of previous result (cannot set carry)
ldy y1 ; 2 fast, 1 slow - pick up high byte of multiplicand y
adc $4216 ; 5 fast - add result of high*low multiplication to previous result (may set carry)
sty $4203 ; 4 fast - write high byte of y to triggering input (high*high starts now)
sta prod+1 ; 2 fast, 2 slow - store result in middle word of product
lda prod+2 ; 2 fast, 2 slow - load top word of product, including top byte of previous result
bcc cc\@ ; 2 or 3 fast - branch if carry clear
adc #$00FF ; 3 fast - increment top byte of result (clears carry)
cc\@: adc $4216 ; 5 fast - add current result to result of high*high multiplication (cannot set carry)
sta prod+2 ; 2 fast, 2 slow - store result to top word of product
The relevant differences between their method and mine seem to be:
1) their method assumes X/Y are 16-bit before and after, which adds 6 fast cycles to twiddle the X bit.
2) their method seems to use .w on a couple of instructions, adding two fast cycles for (apparently) busywaiting purposes. This seems to be due to subtle differences in instruction order, which without the extra cycles would not leave enough time to let the multiplications complete.
3) their method is a subroutine rather than a macro, resulting in an extra 8 fast and 4 slow cycles for jsr and rts.
Come to think of it, perhaps the RAM variables in my method should be macro parameters; it would make it more suitable for batch processing and better justify the lack of sep/rep activity...
- rainwarrior
- Posts: 8734
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: discarded low bit of high byte of multiplier in 16-bit multiply
Okay, I've updated the test ROM:
https://github.com/bbbradsmith/SNES_stu ... _mul16.sfc
https://github.com/bbbradsmith/SNES_stu ... _mul16.sfc