Way too slow...

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

Cybergoth
Posts: 29
Joined: Thu Sep 14, 2006 1:35 am

Way too slow...

Post by Cybergoth »

Hi there!

I'm currently working on a particle engine. I have a first draft ready, but unfortunately so far my efforts are way too slow.

My goal was to move 128 Tile pixels at 60Hz, but my current best approach moves only 96 pixels at 15Hz :oops:

Albeit my main VBLANK code is probably near perfect I think (first loop of NMI code, does 96 PPU RAM updates per frame), but maybe I'm missing something substantial.

I have uploaded my source and binary, hoping that maybe someone here has an idea to make things faster :)

=> http://home.arcor.de/cybergoth/apocalypse.zip

Greetings,
Manuel
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »

Well, I don't think it is too slow, in fact it is pretty cool. It is always faster than my pseudo-mode 7 demo that was computing a frame in arround 2-3 seconds or something like that (I didn't optimize my code).

Does only the pattern table or the name table too being uploaded ?
Useless, lumbering half-wits don't scare us.
Cybergoth
Posts: 29
Joined: Thu Sep 14, 2006 1:35 am

Post by Cybergoth »

Bregalad wrote:Well, I don't think it is too slow, in fact it is pretty cool. It is always faster than my pseudo-mode 7 demo that was computing a frame in arround 2-3 seconds or something like that (I didn't optimize my code).
Thanks for the heads-up. I found the thread here and tried your demo, that's really impressive stuff! :shock:
Bregalad wrote:Does only the pattern table or the name table too being uploaded ?
Both, yes. I need 4 writes per dot: Erase old dot+tile and draw anew. I just randomized dot positions for the moment and implemented the simplest movement scheme possible, but every dot can actually be freely(*) positioned.

*Right now I'm doing nothing against tile clashes, so the last drawn dot always "wins" the tile :wink:

Basically I have a few general directions of thought for speed-up:

1. Something like a mapper that allows toggling the same RAM between PPU and CPU. Does something like that exist?

2. Selfmodifying code. Should definitely speed it up some, but could require up to 15*128 Bytes of RAM.

3. Updating the PPU just-in-time. I assume as long as I'm ahead of the raster beam I can do whatever I want, regardless wether it's still VBLANK or not? (Still, Y-Sorting 128 objects during a frame might be impossible as well...)

4. Optimize by restricting/patternizing the dot movement. Something I will possibly do later.
User avatar
Bregalad
Posts: 8036
Joined: Fri Nov 12, 2004 2:49 pm
Location: Caen, France

Post by Bregalad »


1. Something like a mapper that allows toggling the same RAM between PPU and CPU. Does something like that exist?
No this will not be electronically possible. Even if you used dual-port RAM there would still be issues. Maybe if you insrert a modern Multi-megahertz-super-fast DSP in the cartridge that will multiplex RAM read and writes for both chips in a transparant fashion this would be possible.
3. Updating the PPU just-in-time. I assume as long as I'm ahead of the raster beam I can do whatever I want, regardless wether it's still VBLANK or not? (Still, Y-Sorting 128 objects during a frame might be impossible as well...)
No you need to be in VBlank or forced VBlank to acess RAM. You can however force VBlank for a part of the frame to get more RAM writes.
Useless, lumbering half-wits don't scare us.
User avatar
tokumaru
Posts: 12106
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru »

Cybergoth wrote:*Right now I'm doing nothing against tile clashes, so the last drawn dot always "wins" the tile :wink:
Ah, I thought I saw some of the pixels disappearing for small amounts of time...! =)
1. Something like a mapper that allows toggling the same RAM between PPU and CPU. Does something like that exist?
I think the MMC5 will let you access it's internal RAM from the CPU address space even while the screen renders, but that memory can only be used for a name table I think. I might have got that wrong, I haven't read about the MMC5 in a while, but I think it's not what you're looking for anyway.
2. Selfmodifying code. Should definitely speed it up some, but could require up to 15*128 Bytes of RAM.
That's nearly all of the internal RAM, but shouldn't be a problem if you use a cart with 8KB of extra RAM.
3. Updating the PPU just-in-time. I assume as long as I'm ahead of the raster beam I can do whatever I want, regardless wether it's still VBLANK or not?
Like Bregalad said, there is no way around that. The address register we use to write data to the PPU (accessed through $2006) is also used by the PPU during rendering, so accessing it outside of VBlank will corrupt whatever is being rendered at the moment. As Bregalad suggested, you can turn rendering off manually for a few extra scanlines of PPU access.
Cybergoth
Posts: 29
Joined: Thu Sep 14, 2006 1:35 am

Post by Cybergoth »

tokumaru wrote:Ah, I thought I saw some of the pixels disappearing for small amounts of time...! =)
I'm undecided yet if I'm going to do something against it or just allow it to happen ;)
tokumaru wrote:I think the MMC5 will let you access it's internal RAM from the CPU address space even while the screen renders, but that memory can only be used for a name table I think. I might have got that wrong, I haven't read about the MMC5 in a while, but I think it's not what you're looking for anyway.
That mappers quite a beast! :shock:

I'd assume that's a configuration that'll never be used for homebrew efforts, unless one is going to cannibalize original carts :shock: :shock: :shock:
tokumaru wrote:That's nearly all of the internal RAM, but shouldn't be a problem if you use a cart with 8KB of extra RAM.
While still in tech-demo stages I might just try it once to see how much speed up it provides. Probably not really worth the tradeoff :?
tokumaru wrote:Like Bregalad said, there is no way around that. The address register we use to write data to the PPU (accessed through $2006) is also used by the PPU during rendering, so accessing it outside of VBlank will corrupt whatever is being rendered at the moment.
That's quite interesting. I think I got that idea from reading some tech notes from Ian Bell coming with his Tank demo, where it says he's cycle counting in order to know the position of the beam or somesuch? Maybe I just misunderstood that part.
tokumaru wrote:As Bregalad suggested, you can turn rendering off manually for a few extra scanlines of PPU access.
Is the default VBLANK window already maxed out or can you already gain some cycles here without shrinking the resolution? I'm thinking about some unused overscan area or the missing top/bottom lines.

In case, is there some sample code available that maxes out vblank time?
CartCollector
Posts: 122
Joined: Mon Oct 30, 2006 8:32 pm

Post by CartCollector »

To expand on what Bregalad said, This document shows exactly what the PPU does each scanline. It accesses VRAM with every cycle it has available except for one, and since 1 PPU cycle is 1/3 of a CPU cycle, it doesn't help you that much. So it's impossible to access VRAM while a scanline is being rendered without messing up the video. However, you CAN access the PPU's internal registers, like the scroll registers, during certain times while a scanline is being rendered. The document there, along with loopy's "The Skinny on NES Scrolling," tell you how to do so.
I think the MMC5 will let you access it's internal RAM from the CPU address space even while the screen renders, but that memory can only be used for a name table I think. I might have got that wrong, I haven't read about the MMC5 in a while, but I think it's not what you're looking for anyway.

That mappers quite a beast! Shocked

I'd assume that's a configuration that'll never be used for homebrew efforts, unless one is going to cannibalize original carts Shocked Shocked Shocked
The MMC5's internal ram, when it's used for enhancing the graphics, is usually used as a second name and attribute table. It allows you to use up to 16384 different tiles in the background at the same time, and also lets you use 1 palette per tile, instead of the 2 by 2 tile area that normal attribute tables use. It only works with one screen mirroring though, and doesn't really help with sprites. More info is on the NESdevWiki.
Is the default VBLANK window already maxed out or can you already gain some cycles here without shrinking the resolution? I'm thinking about some unused overscan area or the missing top/bottom lines.

In case, is there some sample code available that maxes out vblank time?
Depends on the TV. If it's NTSC then some scanlines are probably being chopped off. Most emulators remove the first and last 8 scanlines, though real NTSC TVs might remove more or less, and they might remove more from the top than the bottom or vice versa. But 8 from the top and bottom is usually a safe bet. If it's PAL then it's displaying more scanlines than the NES does, approximately 260 scanlines according to the NESdevWiki. So with PAL you shouldn't remove any scanlines from the display if you don't have to. But then again with PAL you get 70 scanlines of VBlank, so you probably don't need to.

That document by Ian Bell probably refers to counting cycles so that the tank demo always enables and disables the PPU at the same scanlines. The reason he needed to do so was because he was using a mapper (UNROM I think) that didn't have scanline interrupts. If you use something like an MMC3 or better though it'll probably have the option to use scanline interrupts, which means you don't have to count cycles to know when a certain scanline is being rendered. The Tank demo is an example of a program that disables rendering for extended VBlank. So are Battletoads and this one test program by Celius. I forget what it's called though.
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

Cybergoth wrote:Is the default VBLANK window already maxed out or can you already gain some cycles here without shrinking the resolution? I'm thinking about some unused overscan area or the missing top/bottom lines.

In case, is there some sample code available that maxes out vblank time?
LJ65 turns off rendering about nine lines early so that it can blit a whole 200-byte playfield, plus OAM and the palette, in one NTSC vblank.
CartCollector wrote:This document shows exactly what the PPU does each scanline. It accesses VRAM with every cycle it has available except for one
Unless a mapper queues up (address, data) pairs to write to VRAM and executes the queue during times when the data read by the PPU doesn't matter. The document lists the following memory fetch phases when the PPU appears to ignore what it reads:
  • 125-128: Unused thirty-fourth sliver of the background
  • 129, 130, 133, 134, ..., 157, 158: Garbage nametable bytes
  • 169 and 170: PPU is frozen while waiting 5 dots for horizontal blanking to end
But then that might be almost as hard to build as MMC5, with independent front-side and back-side PPU address buses. And like MMC5, it might screw up on mostly-NES-compatible consoles using "NOAC" chipsets. If you really want to write to VRAM during rendering, then try programming for the TurboGrafx-16 or the Game Boy Advance.
User avatar
Memblers
Site Admin
Posts: 3901
Joined: Mon Sep 20, 2004 6:04 am
Location: Indianapolis
Contact:

Post by Memblers »

The latest program I've written allows a lot of room for VRAM writes. But all the code running during the displayed part of the screen (and vblank) has to take the same amount of cycles every frame. I didn't even use the sprite #0 hit yet. But the end part was just a delay loop, which would work fine for sprite #0 hit. It's a totally wrong set up, but what I'm doing takes several frames to finish the main loop, but it updates the screen all at once when it's ready. Having some fun with the nametables. :)

Using the sprite #0 hit detect is the common way to combine variable code with timed code. But it only works once per frame. An IRQ would be better, but only ASIC-basic mappers have them, generally.
CartCollector
Posts: 122
Joined: Mon Oct 30, 2006 8:32 pm

Post by CartCollector »

LJ65 turns off rendering about nine lines early so that it can blit a whole 200-byte playfield, plus OAM and the palette, in one NTSC vblank.
How do you know when to turn the screen back on? Is the blit guaranteed to not take over 30 scanlines of time? I get 30 from 20 normal VBlank + 9 extended VBlank + 1 scanline before normal VBlank that the PPU doesn't render but doesn't trigger the VBlank NMI for either.
If you really want to write to VRAM during rendering, then try programming for the TurboGrafx-16 or the Game Boy Advance.
Or the Atari 800, Atari 5200, or Commodore 64. All of which were released before the Famicom ;)
tepples
Posts: 22345
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Post by tepples »

CartCollector wrote:How do you know when to turn the screen back on? Is the blit guaranteed to not take over 30 scanlines of time?
Correct. I timed a 22-line update plus palette update plus OAM DMA, and it didn't exceed 3500 cycles, even with DPCM stealing cycles. That's 9 forced blank lines, 1 post-render line, 20 vblank lines, and the first 100 CPU cycles of the pre-render line. But then you do get most of the pre-render scanline too.
Celius
Posts: 2159
Joined: Sun Jun 05, 2005 2:04 pm
Location: Minneapolis, Minnesota, United States
Contact:

Post by Celius »

I wrote what I consider to be a pretty great PPU updating routine. It uses 12 scanlines of extended Vblank, but it can do a lot (in my mind).

Each frame, it does:

64 tile writes (row)
30 tile writes (column)
16 attribute writes (row)
8 attribute writes (column)
32 entry palette update
10 * (1 CHR RAM tile) or (6 Miscellaneous PPU writes)
Sprite DMA
Sets Scroll

Unfortunately, since I'm using extended Vblank, it has to take an exact amount of cycles every frame. So it's about 3600 cycles for all of that, but it's worth it. And for each of the CHR RAM tile routines, one can decide to do 6 miscellaneous PPU writes instead if they so desire, which is handy if there's a BG update that's not related to scrolling over level data.
Cybergoth
Posts: 29
Joined: Thu Sep 14, 2006 1:35 am

Post by Cybergoth »

Thanks guys for all your feedback. I'll see what I can do with all the ideas I got now for the next revision of the engine :)
Cybergoth
Posts: 29
Joined: Thu Sep 14, 2006 1:35 am

Re: Way too slow...

Post by Cybergoth »

Cybergoth wrote:my current best approach moves only 96 pixels at 15Hz
By significantly changing that approach I managed to almost double the speed of my particle engine (without any vblank extension!) :D

Now it updates 94 particles with 30Hz.

=> http://home.arcor.de/cybergoth/apocalypse.zip
User avatar
Memblers
Site Admin
Posts: 3901
Joined: Mon Sep 20, 2004 6:04 am
Location: Indianapolis
Contact:

Post by Memblers »

Nice! Looks like a couple 'lemmings' blew up in outer space. :D
Post Reply