NeoOne wrote: Fri Oct 10, 2025 9:02 amThe first thing which is I think, is doing it the brute force way (every player bullet against every enemy) is actually good up to a certain number of objects.
Certainly, which is why I'd want to test each approach to see if it actually helps. The homing shots aren't especially numerous by themselves, so it's possible that just doing them naïvely could be optimal.
(Not sure how many registers SNES CPU has? maybe it has zero page?)
I might try doing player bullet collisions on the S-CPU during boss fights, to free up a little more time on the Super FX, but it's the Super FX that's going to be doing the heavy-duty stuff with potentially 2000+ collisions per frame during stages.
The S-CPU has one accumulator and two index registers, and the instruction set is not orthogonal (though it's better than the 6502). You can't do math with the index registers. Direct page (a movable zero page) is generally the fastest way to do something like this.
The Super FX is a very different chip. It doesn't like accessing memory; it has 8-bit ROM and RAM buses with access times of 5 cycles per byte when the core is in high-speed mode. To minimize bus delays, it has a 512-byte instruction cache (no data access, but single-cycle instruction execution) and 16 16-bit general-purpose registers...
...well, kinda. R15 is the program counter, R14 is the ROM buffer pointer, and various other registers have special functions (at least R1, R2, R4, R6, R7, R8, R12, and R13 if I recall correctly) and sometimes restrictions, although you can still use them for math and for storing values. Also, R0 functions as a sort of accumulator, since the RISC-like instruction set is 8-bit and is already extremely cramped just specifying one register for the instructions that need it. The FROM, TO, and WITH prefix instructions are used to specify additional source and destination registers, but they reset after use, and if you don't use the prefixes, SREG and DREG default to R0, making it considerably quicker under some circumstances to do math and logic in or with R0.
I figured that for simple enemy bullets, I could probably just leave the player's position in registers during the entire move/collide/draw/despawn loop. I might make this into two loops, since some bullets require more complicated logic, and even the simplest ones are already overloading the I-cache such that some of the despawn code has to run directly from ROM.
The very simple horizontal shooter (its just a demo game really I am optimizing) I am currently working on can have up to 80 enemies. 20 player bullets and 120 enemy bullets and it works (currently) with brute force collision checks at 60fps
NeoOne wrote: Thu Oct 23, 2025 10:31 amit can now display (+ collision check) up to 140 enemies, 153 enemy bullets and 21 player bullets.
Very nice, and somewhat encouraging. In my case, though, I kinda need the collisions to be a secondary load, since I'm going to want the vast majority of the Super FX's compute time to go towards rendering bullets, often enemies, and sometimes backdrop elements. Even a Neo Geo doesn't have enough sprites to handle
this game in hardware...
On this game - it would be possible for me to check half the player bullets one frame and half the next and no collisions would be missed.
There are probably lots of situations where I could do that, but I really don't want to because it's a port, and I'm trying to make it as accurate as possible. I'm already using the same PRNG, and it would be nice if it ended up somewhat replay-compatible.
I have never thought about Y sorting before, because I thought it would take a lot of time. So how fast can you sort the Y position of all your bullets/enemies (in display lines)?
I haven't tried it yet, and I'm intensely occupied with my job at the moment so I can't spare the brain power. I rather suspect it could be expensive. They're
almost sorted, but not quite. I expect colliding with flights before individual bullets may be a better plan, and flights would be much easier to sort. Furthermore, since all the bullets in a flight are at roughly the same Y-coordinate, sorting individual bullets might be a waste of time.
BTW I am currently reading the links you posted. Trying to understand it all. A lot it is new to me. I like that you just went ahead and coded that 128 object example - while everyone was still talking about it!
Yeah, that was fun. I figured out most of the method while out on a walk. I still remember a particular pickup truck I walked past while thinking about it. I remember initially not liking the grid idea very much, but as you can see it grew on me...
NeoOne wrote: Sat Oct 11, 2025 6:09 amBut then 6502's addressing abilities are not as good and 8 bit values slow it down
NeoOne wrote: Thu Oct 23, 2025 10:31 amon 68000 you have a MUL instruction to multiply and a DIV instruction to divide but on 6502 you have to write those out all in the little individual steps by adding or subtracting or whatever.
That's the nice part about the 65C816; it can operate in 16-bit mode, has a 16-bit ALU and can use 8-, 16-, or 24-bit addressing. You don't have to construct 16-bit operations out of 8-bit ones; it just takes an extra cycle to load or store a 16-bit value on the 8-bit bus. Plus the actual S-CPU has a bunch of custom bells and whistles, including an MMIO multiplier and divider. I think somebody roughly estimated once that the 3.58 MHz S-CPU is probably equivalent to a 5-6 MHz 68000. Weaker overall than the Mega Drive CPU, but not nearly as much weaker as the clock speed difference suggests.
Fun fact: the Sony SPC700 (the 1 MHz 8-bit 6502 knockoff used as a sound CPU in the SNES) has MUL and DIV instructions. The Super FX has several multiply instructions, but it doesn't have a divide instruction; you have to use reciprocal tables.
aa-dav wrote: Sat Oct 18, 2025 7:05 pm
What is RISC really about? It's about pipeline.
Oddly enough, the 65C816 and the Super FX (usually considered RISC) have the same pipeline length (and width). One byte, just enough for an opcode.
The major difference is that the 65C816 discards the pipelined byte if a branch is taken, and the Super FX doesn't. This means that the Super FX kinda has branch prediction, but it's up to the programmer rather than being automatic at runtime.
Pokun wrote: Thu Oct 23, 2025 2:51 pmIn hindsight it should have been called Nintendo 32!
The PlayStation would have crushed it even harder than it did, coming out a year and a half late with a me-too name like that...
Well, since somebody brought up the Nintendo 64, I've got to say it again: I wish they'd done that respin to fix the
memory interface bug. While they were at it, they might have realized that the additive blend mode was useless without clamping, and possibly even noticed the off-by-one multitexture bug. The CPU being too powerful was the least of its issues, and would have been at least partly solved by faster memory access (the RDRAM's data rate was faster than every piece of RAM in the PlayStation combined; the issue was latency, and if that article is right, a design error in the N64's chipset may have been largely to blame).
Pokun wrote: Sat Oct 25, 2025 2:01 pmDoesn't the SNES CPU run two bytes per clock on the 8-bit data-bus effectively making it 16-bit?
I wish. Loading or storing a 16-bit value takes two cycles. 24-bit addresses as operands take 3 cycles. It's an 8-bit bus.
In fact, it's worse than that. The 6502 has this weird quirk whereby only half of a memory access cycle counts as a memory access, meaning the memory has to be twice as fast to keep up. This quirk seems to have been removed from the PC Engine's HuC6280, resulting in the ability to run at 7.16 MHz on fast ROM or RAM. Unfortunately the licensed 65C816 core in the RIcoh 5A22 (the S-CPU) was not given this treatment, so memory that should be capable of over 8 MHz is only good enough for 3.58 MHz, and memory that should be capable of 5 MHz is only good enough for 2.68 MHz (it appears they were at least smart enough to only extend the actual memory access part of the cycle to deal with slow memory; the dead half is always 3 master clocks regardless).
It does (or rather can, and sometimes does) access the bus every cycle, as opposed to every 4 cycles for a 68000, which means that despite the Mega Drive's 16-bit bus and much higher clock speed vs. the SNES, the actual peak bus throughput is almost identical. (Plus the S-CPU doesn't know what a wait state is, so nothing else in the system can prevent it from using the bus, something that is
not true of the 68000.)
SNES DMA is also basically the same speed as the Mega Drive (at least in H32 mode, and H40 mode is only moderately faster) because the Mega Drive uses 8-bit VRAM, rendering half the bus useless when loading VRAM (the main task of DMA on both systems). It's genuinely weird how close these two systems' specs ended up, given how different all the design decisions were.