Isn't that the same as base+constant (4,A4) addressing on a 68000? Or is that slow due to an extra memory access? In any case, direct page and absolute,X addressing are the same speed on a 65816 if the direct page isn't 256-byte aligned and the absolute address is page-aligned.psycopathicteen wrote:I know a little speed trick that can be done on the 65816 but not on a 6502 or 68000. I like to move the direct page to the object slot of the object to be processed. This way I can easily reach everything in the object slot, and when I have two identical objects, it tricks the CPU into thinking it's writting into the same registers when it isn't.
Thunder Force 4 = overhyped
Moderator: Moderators
-
psycopathicteen
- Posts: 3001
- Joined: Wed May 19, 2010 6:12 pm
Today's secret word please, Conky:
The Super NES dot rate is the same as that of the NES: 3/2 Fsc, or 945/176 = 5.369 MHz. The master clock rate is four times that, and cycles take six or eight master clocks depending on whether they access slow memory (RAM or slow ROM); fast ROM and internal operation cycles always take six clocks. This gives an effective CPU rate somewhere between 2.7 and 3.6 MHz.
Because of the 68000's state machine, we can consider the Genesis to have a master clock of 7.67 MHz and actually run at 1.92 MHz, where each cycle of the internal state machine takes 4 master clocks. But how exactly is the 7.67 MHz rate related to the pixel clock? It's slightly faster than the Amiga and TG16 clock speed of 2*Fsc = 7.159 MHz. The dot rate in 256px mode is the same as the NES and Super NES, and the dot rate in 320px mode is 5/4 that: 6.712 MHz. Is the 68000 clock suposed to equal 8/7 of this 320px dot rate, which would equal 10/7 times the 256px dot rate or 15/7*Fsc?
To derive a clock speed useful for comparison between the Super NES and Genesis, I derive an abstract unit called the gencycle (AAAAAAAAA!), which is 1/2 of a fast (6 clock) cycle (that is, 7.159 MHz) or 1/3 of a slow (8 clock) cycle (that is, 8.054 MHz). Over the long run, the average period of a gencycle should be close to that of a Genesis master clock, allowing 65816 instruction timings to be quoted in units nearly commensurate with 68000 timings.
This 12-clock 68000 instruction corresponds to 10 or 12 65816 gencycles, as seen here:
Ordinary direct page instruction: 10 gc
Code: Select all
gencycleBecause of the 68000's state machine, we can consider the Genesis to have a master clock of 7.67 MHz and actually run at 1.92 MHz, where each cycle of the internal state machine takes 4 master clocks. But how exactly is the 7.67 MHz rate related to the pixel clock? It's slightly faster than the Amiga and TG16 clock speed of 2*Fsc = 7.159 MHz. The dot rate in 256px mode is the same as the NES and Super NES, and the dot rate in 320px mode is 5/4 that: 6.712 MHz. Is the 68000 clock suposed to equal 8/7 of this 320px dot rate, which would equal 10/7 times the 256px dot rate or 15/7*Fsc?
To derive a clock speed useful for comparison between the Super NES and Genesis, I derive an abstract unit called the gencycle (AAAAAAAAA!), which is 1/2 of a fast (6 clock) cycle (that is, 7.159 MHz) or 1/3 of a slow (8 clock) cycle (that is, 8.054 MHz). Over the long run, the average period of a gencycle should be close to that of a Genesis master clock, allowing 65816 instruction timings to be quoted in units nearly commensurate with 68000 timings.
This 12-clock 68000 instruction corresponds to 10 or 12 65816 gencycles, as seen here:
Ordinary direct page instruction: 10 gc
- opcode fetch: 2 gc (fast ROM)
- offset fetch: 2 gc (fast ROM)
- data low: 3 gc (slow RAM)
- data high: 3 gc (slow RAM) (if 16-bit M/X)
- opcode fetch: 2 gc (fast ROM)
- offset fetch: 2 gc (fast ROM)
- address generation: 2 gc (internal)
- data low: 3 gc (slow RAM)
- data high: 3 gc (slow RAM) (if 16-bit M/X)
- opcode fetch: 2 gc (fast ROM)
- address fetch low: 2 gc (fast ROM)
- address fetch high: 2 gc (fast ROM)
- data low byte: 3 gc (slow RAM)
- data high byte: 3 gc (slow RAM) (if 16-bit M/X)
Last edited by tepples on Fri Apr 06, 2012 6:49 pm, edited 1 time in total.
See what I mean, they wasted CPU cycles it sounds like. Causing more slowdown than needed to achieve what needed to be done. Here is a fun challenge since you might still have notes, did you try rewriting the routine optimized and seeing what sort of performance boost you could yield? That would be an impressive patch if you could significantly reduce slowdown in gradius 3.psycopathicteen wrote:A little while ago I attempted to find what was causing slowdown in Gradius 3, and I found the oam clearing routine, and the way they dealt with the the hi-oam was an unefficient way. Gradius 3 shuffles back and forth between the low-oam and high-oam, one sprite at a time.and in case with the SNES it arranged in not very convinient way (one bit of the X is packed these bits and sizes of with three other sprites).
The way I manage the high-oam is when I am writing a sprite to the oam, I "AND #$01ff" the x-coordinate, and "ORA #$0200" for big sprites. Then I store the result in both the oam buffer, and a second list that is 544 bytes after the oam buffer. Then I overwrite the high-byte of the x-coordinate with the y-coordinate, and overwrite the high-byte of the y-coordinate with the attribute word. After all the game logic has been calculated, I use the table of 16-bit x-coordiates, and build the high-oam with it.
Keep in mind though I think Gradius III was Konami's first Super Famicom game. It is not that strange that it would not be coded very efficiently.
-
psycopathicteen
- Posts: 3001
- Joined: Wed May 19, 2010 6:12 pm
http://www.emulationzone.org/consoles/snes/tech.htm
More idiots passing off incorrect technical information. They mentioned the cpu being "slow" 5 times, and wrote "slow" in all caps 4 of the 5 times.
More idiots passing off incorrect technical information. They mentioned the cpu being "slow" 5 times, and wrote "slow" in all caps 4 of the 5 times.
Yeah, that page is horrid. Have you tried e-mailing corrections to the maintainer? Say the presence of DMA to VRAM is equivalent to Blast Processing, and my gencycle theory might help alleviate the capitalized SLOW-ness. Apparently the most current e-mail address is wacko413 at Hotmail.
The "64K at a time" part appears to have something to do with the data segment register set with the PLB instruction. On the 68000, on the other hand, a pointer fits in a single register.
The "64K at a time" part appears to have something to do with the data segment register set with the PLB instruction. On the 68000, on the other hand, a pointer fits in a single register.
Last edited by tepples on Mon May 21, 2012 11:57 am, edited 1 time in total.
-
psycopathicteen
- Posts: 3001
- Joined: Wed May 19, 2010 6:12 pm
Re:
In other words, you emulate Metal Combat's OBC1 in software. I'm using this too, but I've discovered that the 32-byte buffer can be generated overlapping the 512-byte buffer.psycopathicteen wrote:The way I manage the high-oam is when I am writing a sprite to the oam, I "AND #$01ff" the x-coordinate, and "ORA #$0200" for big sprites. Then I store the result in both the oam buffer, and a second list that is 544 bytes after the oam buffer. Then I overwrite the high-byte of the x-coordinate with the y-coordinate, and overwrite the high-byte of the y-coordinate with the attribute word. After all the game logic has been calculated, I use the table of 16-bit x-coordiates, and build the high-oam with it.
Re:
Meh, since this thread was bumped I may as well read it... and wow there's so much problems here.
Also: memory accesses on the 68000 are probably the worst part of it. They're so pathetically slow I'd compare them to cache misses on modern CPUs, except you can't work around it. Avoid memory accesses at all costs, no matter what. Luckily there are enough registers that in practice RAM will pretty much get used only to store permanent state rather than variables in a subroutine, but yeah...
The biggest advantages of the 68000 are 1) having a much easier time manipulating large numbers (here is where most of the important optimizations can come in) and 2) being a tad easier to program for. Although I think people are underestimating the amount of time available in a frame, unless you're being wasteful a frame is usually plenty of time for a game (my code tends to stay around vblank duration even when I'm sloppy...). This goes for both the 68000 and the 65816.
32-bit operations help a lot from experience. Yeah, most of the time you'll use 16-bit (in fact, maybe even 8-bit moreso than 16-bit), but there are a lot of places where having 32-bit operations helps, and they're a lot faster than doing the equivalent with 16-bit operations. They can also be useful if you're dealing with a lot of data in a go.
Adds and substracts to memory tend to be useful with counters and such. No need to waste time loading the value into a register when all you want is just to increment it. Direct moves fall under a similar usage, and also make things easier when you want to store constants.
PS: Thunder Force IV sometimes slows down with a single enemy on screen, so huh... =P (although that PCM playback cuts the music is a much bigger offense to me, honestly)
Also: memory accesses on the 68000 are probably the worst part of it. They're so pathetically slow I'd compare them to cache misses on modern CPUs, except you can't work around it. Avoid memory accesses at all costs, no matter what. Luckily there are enough registers that in practice RAM will pretty much get used only to store permanent state rather than variables in a subroutine, but yeah...
The biggest advantages of the 68000 are 1) having a much easier time manipulating large numbers (here is where most of the important optimizations can come in) and 2) being a tad easier to program for. Although I think people are underestimating the amount of time available in a frame, unless you're being wasteful a frame is usually plenty of time for a game (my code tends to stay around vblank duration even when I'm sloppy...). This goes for both the 68000 and the 65816.
Hardware MUL and DIV don't give any edge at all since they're so pathetically slow nobody in their right mind would use them (and if you have raster effects, you outright can't use them because they'll delay the interrupt and you risk missing the timing window - DIVU/DIVS outright can delay as much as almost a third of a scanline).tepples wrote:But to what extent do direct MOVEs, adds and subtracts with a memory destination, hardware MULs and DIVs, and 32-bit addition and subtraction give 68000 the edge?
32-bit operations help a lot from experience. Yeah, most of the time you'll use 16-bit (in fact, maybe even 8-bit moreso than 16-bit), but there are a lot of places where having 32-bit operations helps, and they're a lot faster than doing the equivalent with 16-bit operations. They can also be useful if you're dealing with a lot of data in a go.
Adds and substracts to memory tend to be useful with counters and such. No need to waste time loading the value into a register when all you want is just to increment it. Direct moves fall under a similar usage, and also make things easier when you want to store constants.
PS: Thunder Force IV sometimes slows down with a single enemy on screen, so huh... =P (although that PCM playback cuts the music is a much bigger offense to me, honestly)
- rainwarrior
- Posts: 8062
- Joined: Sun Jan 22, 2012 12:03 pm
- Location: Canada
- Contact:
Re: Thunder Force 4 = overhyped
Hardware multiply is a huge boon if you want to do 3D transforms.
Re: Thunder Force 4 = overhyped
Ah, all those SEGA fanboys spreading bullshit lies on the SNES. That's nothing new. The fact is, the SNES has about the same processing power as the MD, despite a much lower clock rate, but better graphics and sound. I understand it's annoying, however, there's no hope to having them ever shut up their lies over the so called supperiority of the MD.
In any case, who cares about the supperiority or inferiority of a system ? The NES is inferior to so many gaming systems so far and I still love it.
In any case, who cares about the supperiority or inferiority of a system ? The NES is inferior to so many gaming systems so far and I still love it.
Re: Thunder Force 4 = overhyped
Actually, the format of the data used by the SNES hardware tends to get in the way too and makes code slower than it should be. The sprite table is probably the prime example of this, trying to add a sprite gets cumbersome when two of the bits are in a different place in the table and their byte is shared with other sprites. Also the fact that the only way the SPC can get any new data at all is through a busy loop with the 65816 (which is why so many games pause for a couple of seconds when switching the background music), which considering it's a sample-based chip, it can get really bad since those need a lot more of memory than synthesized sounds.
Also using planar format can hurt performance-wise, modifying a single pixel requires touching four bytes in RAM (assuming the usual 4bpp formats), if it was packed (like the Mega Drive's) it'd require touching only one byte as well as potentially simplifying code logic (mode 7 makes this even easier since 1 byte = 1 pixel, but it has a very limited amount of tiles so that reduces its usefulness). Granted, in practice like only 1% of games are probably affected by this, but I assume this is the main reason why the SuperFX is so slow for rendering (for comparison, the 68000 alone on the Mega Drive can do something comparable to Star Fox).
Of course in practice it's generally the sloppiness of the code what actually tends to affect performance the most (both systems are full of games that are slow as hell when there isn't any justification for it). Prime examples would be Sonic 2 in the case of the Mega Drive, and Super Mario World on the SNES seems to be prone to get slow easily (though I don't recall the game slowing down? could be hacks what are affected mostly, that game does have a reputation for slowing down easily for some reason).
Also using planar format can hurt performance-wise, modifying a single pixel requires touching four bytes in RAM (assuming the usual 4bpp formats), if it was packed (like the Mega Drive's) it'd require touching only one byte as well as potentially simplifying code logic (mode 7 makes this even easier since 1 byte = 1 pixel, but it has a very limited amount of tiles so that reduces its usefulness). Granted, in practice like only 1% of games are probably affected by this, but I assume this is the main reason why the SuperFX is so slow for rendering (for comparison, the 68000 alone on the Mega Drive can do something comparable to Star Fox).
Of course in practice it's generally the sloppiness of the code what actually tends to affect performance the most (both systems are full of games that are slow as hell when there isn't any justification for it). Prime examples would be Sonic 2 in the case of the Mega Drive, and Super Mario World on the SNES seems to be prone to get slow easily (though I don't recall the game slowing down? could be hacks what are affected mostly, that game does have a reputation for slowing down easily for some reason).