Page 4 of 8

Posted: Mon Aug 16, 2010 1:05 pm
by psycopathicteen
It's not the 65816 nor the PPU that bottlenecks the system, it's interface between the 65816 and the PPU that bottlenecks both the 65816 and the PPU.

Posted: Mon Aug 16, 2010 6:51 pm
by MottZilla
blargg wrote:
Who knows, maybe Nintendo went with the 65816 rather than the 68000 because they had envisioned NES compatibility.
Or developer compatibility. By using a similar scheme as the NES, all the NES developers would be able to pick up the SNES more quickly. Same for the graphical scheme, which is quite similar.
That's a good point which I wasn't thinking of. Though I think that Nintendo was in such a powerful position after the Famicom that they didn't need to baby the developers. They should have just chosen the best CPU and other hardware components they could have.
psycopathicteen wrote:It's not the 65816 nor the PPU that bottlenecks the system, it's interface between the 65816 and the PPU that bottlenecks both the 65816 and the PPU.
That depends on what you mean by bottleneck. The 65816 certainly is a barrier as you don't have a whole lot of CPU time. I'm not certain about the CPU<->PPU interface (I assume you mean speed of DMA) is a huge issue. I mean everyone would love even faster DMA so you could update even more tiles and background data each frame.

Both SNES and Genesis despite their issues were able to produce amazing top quality games. There was a big thread about this sort of thing on Spritesmind forums, basically a Sega forum similar to Nesdev.

Posted: Mon Aug 16, 2010 7:31 pm
by blargg
MottZilla wrote:
blargg wrote:Or developer compatibility. By using a similar scheme as the NES, all the NES developers would be able to pick up the SNES more quickly.
That's a good point which I wasn't thinking of. Though I think that Nintendo was in such a powerful position after the Famicom that they didn't need to baby the developers. They should have just chosen the best CPU and other hardware components they could have.
It still would have slowed developers, no matter how much market force Nintendo had. Slowing their developers would weaken their position in relation to the competition. With the 65816, a lot of code could be ported very easily, with minimal changes.

Posted: Mon Aug 16, 2010 7:50 pm
by psycopathicteen
MottZilla wrote:
That depends on what you mean by bottleneck. The 65816 certainly is a barrier as you don't have a whole lot of CPU time.
Are you kidding me? You've seen how cpu intence my game is, and that doesn't even take half the availeable CPU time, and I'm not doing that much optimization either.

Honestly people should do more experimenting with hardware and not just accept what is often repeated. Don't be afraid to try
I'm not certain about the CPU<->PPU interface (I assume you mean speed of DMA) is a huge issue. I mean everyone would love even faster DMA so you could update even more tiles and background data each frame.
Sort've. I was refering to the Snes's redundant register set. They could've had a lot less PPU hardware registers. It pulls my hair out whenever I need to use the PPU.

The way the OAM is organized requires a lot of extra CPU calculation. It's not running out of cpu time that I'm worried about, it's getting everything done that I worry about. The fact that the 65816 takes more instructions to do tasks is more of a problem than it's clock speed. Especially when you also are required to do more tasks than other systems. It could be clocked at 14 Mhz and not make a difference for me, since I already have plenty of cpu time left to do cpu intense calculations.

Posted: Mon Aug 16, 2010 8:23 pm
by ReaperSMS
A quick point about the CPU speeds.

65816: 21.477275/6 (FastROM), instructions are 2-9 cycles or so.
So, at 3.58MHz, the insruction rate ranges from 1.78M insns/sec to about 360K insns/sec

68000: wikipedia says it's running at ~7.67MHz
Move instructions range from 4-34 cycles, giving instruction rates from 1.9M/s to 226K/sec.

The two are not in completely different leagues, despite what some might try to claim. Well written 816 code that takes proper advantage of the direct page and whatnot should run on par with 68k code running primarily from registers. The 68k does have an easier time working with 16 bit quantities, since there's no extra penalty on those over bytes.

Posted: Mon Aug 16, 2010 8:56 pm
by Near
A fun thing to point out. I had heard 68k opcodes ate a lot more cycles, but never got raw instruction to instruction counts.

Overall the 68k could do more with its instructions, SNES cannot multiply, divide, shift by more than one place, or really do anything remotely interesting to its other two registers.

You waste a lot of time on the SNES juggling the index registers on the stack, and on switching register sizes around. However, you really can do some amazing low-level optimizations.

There's also the SA-1, if you count it. Out of I-RAM, it's two clocks per cycle, so you can tripe your theoretical instruction counts. But more often than not you need lots of BW-RAM, which is four clocks per cycle. Could compare to the SVP to make that more fair.

CPU<>PPU being a bottleneck sounds silly. Maybe for 256x240 output on NTSC, sure. But you can always crop the screen vertically and get a lot more time, SFA2 and similar games do this. There's enough bandwidth to play full motion video that blows away even the Mega CD. Just not enough ROM space for it.

I have a hard time imagining which PPU registers they could have gotten rid of. There's a few missing bits, especailly on the enables that tend to use five or six bits. But short of being stupid and unevenly bit-packing the whole thing, it's actually quite amazing how well utilized the 64-byte region was. The PPU can do a ton of things at the same time.

Posted: Mon Aug 16, 2010 9:57 pm
by TmEE
Byuu, from what I know you're kind of authority when it comes to SNES, so can you give some numbers regarding VRAM bandwidths per line/frame with and without DMA ...?

On MD one can achieve ~18KB/sec per frame in 320x240 or ~21KB/sec on 320x224, in 50Hz of course, using DMA. in 60Hz you're limited to ~11KB/sec per frame using DMA. Those figures are maximum you can transfer during a single frame without blanking anything... full blanked frame gives you ~59KB/sec in 50Hz

Posted: Tue Aug 17, 2010 5:36 am
by Sik
byuu wrote:There's also the SA-1, if you count it. Out of I-RAM, it's two clocks per cycle, so you can tripe your theoretical instruction counts. But more often than not you need lots of BW-RAM, which is four clocks per cycle. Could compare to the SVP to make that more fair.
Comparing the SVP to the SuperFX would actually be more fair (isn't the SuperFX faster than the SA-1?).
byuu wrote:CPU<>PPU being a bottleneck sounds silly. Maybe for 256x240 output on NTSC, sure. But you can always crop the screen vertically and get a lot more time, SFA2 and similar games do this. There's enough bandwidth to play full motion video that blows away even the Mega CD. Just not enough ROM space for it.
No, there isn't, if you want to load an entire frame on the Mega Drive (assuming no cropping and such), you're looking for 4 frames per update (well, on NTSC at least, PAL has over double the amount of transfer rate in blank x_X). That's why you're meant to use sprites, tilemaps, etc. =P

Posted: Tue Aug 17, 2010 8:03 am
by tepples
Sik wrote:Comparing the SVP to the SuperFX would actually be more fair (isn't the SuperFX faster than the SA-1?).
Super FX is heavily specialized for 3D graphics, as is the Sega Virtua Processor. The SVP, which appears to be based on a Samsung SSP1601 core, is clocked at roughly the same speed as the FX2. SA-1, on the other hand, is a general-purpose application coprocessor, as are the SuperH family CPUs in the 32X, Saturn, and Dreamcast.
if you want to load an entire frame on the Mega Drive (assuming no cropping and such), you're looking for 4 frames per update (well, on NTSC at least, PAL has over double the amount of transfer rate in blank x_X). That's why you're meant to use sprites, tilemaps, etc. =P
A full frame in either console's 256-pixel-wide mode is 256x224x4bpp, or 28 KiB. I've been quoted 7 KiB per vblank for DMA copying on SNES too. But if you cut the full-motion video to a cinematic display aspect ratio with 160 lines, I don't see why you can't easily fit 30 fps with an external FMV decoder. Imagine a coprocessor that decodes WebM from an SD card soldered onto the cartridge.

Posted: Tue Aug 17, 2010 8:16 am
by Near
Byuu, from what I know you're kind of authority when it comes to SNES, so can you give some numbers regarding VRAM bandwidths per line/frame with and without DMA ...?
You don't even want to know the bandwidth without DMA.

But with it ... you have effectively 1324 cycles per scanline. DMA consumes 8 cycles per byte transferred. NTSC mode has 262 scanlines @ 60hz, PAL mode has 312 scanlines @ 50hz.

My numbers are in bytes per frame. So for your average 256x224 NTSC game, that gives you 6.28K/f. For your average PAL game, that gives you 11.9K/f. NTSC can use 256x240 at 3.6K/f, and PAL can use 256x224 at 14.56K/f.

You can however disable the screen using force blank, and take lines off the top and bottom, which allows you to increase bandwidth further. Say you cut an 8-pixel row off the top and bottom, which would be lost to overscan anyway, you can get 8.93K/f for NTSC. The video rendering trick cuts off even more lines, and uses page-swapping at 20-30fps to double or triple that rate.

Without the active display getting in your way, the maximum bandwidth is 43.36K/f, or 2.68M/s.
Comparing the SVP to the SuperFX would actually be more fair (isn't the SuperFX faster than the SA-1?).
The SuperFX is garbage. It is specialized to the point of absolute insanity. The SA-1 by comparison is a full general-purpose CPU with lots of nifty tools like bitmap<>bitplane conversion, H/V/counter IRQs, vector address override, RAM/ROM protect, two distinct DMA modes, etc.

Posted: Tue Aug 17, 2010 9:17 am
by psycopathicteen
I think I'll just have to cut 16 pixels off the top and bottom of the screen, since I'm trying to become more practical and less idealistic.

I'm even changing my metasprite routine so instead of using for the x and y displacement and sprite attributes, the metasprites will be just be a rectangle with how many blocks tall and how many blocks wide.

Posted: Tue Aug 17, 2010 10:22 am
by Sik
tepples wrote:The SVP, which appears to be based on a Samsung SSP1601 core, is clocked at roughly the same speed as the FX2.
I guess that explains this (make sure it seeks to 6:36):
http://www.youtube.com/watch?v=PklSWesc1dM#t=6m36s

Although somebody I know checked the cartridge and it seems to take the same clock used by the 68000... I guess there's some multiplier there, but if it uses that directly then WTF o_O (actually the SVP can use less instructions than the SFX to achieve the same result IIRC, so that probably also helps)
byuu wrote:The SuperFX is garbage. It is specialized to the point of absolute insanity. The SA-1 by comparison is a full general-purpose CPU with lots of nifty tools like bitmap<>bitplane conversion, H/V/counter IRQs, vector address override, RAM/ROM protect, two distinct DMA modes, etc.
Still my point stands, the SVP is more akin to the SuperFX than to the SA-1.

EDIT: talking about SA-1, these instructions are listed in the docs:
http://srb2town.sepwich.com/junk/lolinstructions.PNG
(just mentioning for the sake of it, couldn't miss the chance XD)

Posted: Tue Aug 17, 2010 10:40 am
by TmEE
byuu wrote:You don't even want to know the bandwidth without DMA.
Hehe, must be really slow then :P

I could decode and display around ten 320x240 4-bit BMP files on MD per second in 50Hz, so even without DMA, things are not too bad, but still quite a bit worse than possible with DMA.
byuu wrote:But with it ... you have effectively 1324 cycles per scanline. DMA consumes 8 cycles per byte transferred. NTSC mode has 262 scanlines @ 60hz, PAL mode has 312 scanlines @ 50hz.

My numbers are in bytes per frame. So for your average 256x224 NTSC game, that gives you 6.28K/f. For your average PAL game, that gives you 11.9K/f. NTSC can use 256x240 at 3.6K/f, and PAL can use 256x224 at 14.56K/f.

You can however disable the screen using force blank, and take lines off the top and bottom, which allows you to increase bandwidth further. Say you cut an 8-pixel row off the top and bottom, which would be lost to overscan anyway, you can get 8.93K/f for NTSC. The video rendering trick cuts off even more lines, and uses page-swapping at 20-30fps to double or triple that rate.

Without the active display getting in your way, the maximum bandwidth is 43.36K/f, or 2.68M/s.
This does not sound that bad at all IMO... it is still slower than MD in 256 pixel modes ~9.5KB/f for 60Hz, ~18KB/f for 50Hz but not too major.

And I gave wrong value for max per blanked frame... it is 63KB/f for 50Hz, 53KB/f for 60Hz, total of 3.1MBytes/sec

Posted: Tue Aug 17, 2010 11:17 am
by Near
Sik wrote:EDIT: talking about SA-1, these instructions are listed in the docs: http://srb2town.sepwich.com/junk/lolinstructions.PNG
(just mentioning for the sake of it, couldn't miss the chance XD)
Those are SuperFX instructions, but yes. I've tried to make a juvenile program, but haven't had much luck.

All of these are legal SFX opcodes: STOP CACHE TO WITH LOOP PLOT SWAP COLOR NOT ADD SUB MERGE AND LINK SEX LOB OR COLOR.

Maybe one of you can come up with a funnier sentence using only the above words. Bonus points if the sequence actually does something useful on SFX hardware.

Something like:

Code: Select all

SexyPlot:
  STOP
  LINK
  WITH
  CACHE
  AND
  HAVE
  SEX
  OR
  ; ... I got nothin'

ShakespeareanFilter:
  TO
  STOP
  OR
  NOT
  TO
  STOP
  ; That is the question, that I might ask.
Yeah, see ... so close, yet so far :/

Posted: Tue Aug 17, 2010 11:23 am
by tepples
byuu wrote:Something like:

Code: Select all

SexyPlot:
  STOP
  LINK
  WITH
  CACHE
  AND
  HAVE
  SEX
  OR
  ; ... I got nothin'
UNIX is the same way: Not exactly safe for work (UNIX counterpart to Windows chkdsk is close to an indecent word)