effecient 4-way scrolling

A place where you can keep others updated about your NES-related projects through screenshots, videos or information in general.

Moderator: Moderators

Post Reply
spaceharrier
Posts: 40
Joined: Wed Jan 19, 2022 9:52 am

effecient 4-way scrolling

Post by spaceharrier »

Wrote a little demo that implements 4-way scrolling and real time map 4 to 1 map metatile decompression. Boundary rollover was a pain in the ass on the y direction due to incongruency between the PPU's 0-240 scrolling rollover and how I packed map data.

Slightly buggy in one specific spot and I still need to add attribute table updates (and also mask off the top)

Estimate cycle count for the seam update and metatile fetching is estimated to be about 2500-3500 cycles (after including attribute table updating, which is still in my WIP build)

I'm working on updating sprites at the moment (with regards to camera scrolling) and that is a nghtmare. The best I've been able to do with code is to get 64 sprites evaluated for displaying/hiding, and coordinate translation to the x/y positions on the screen, in about 10,000 cycles. I'm trying to make this faster but even with lookup tables for some arithmetic operations, every extra instruction in the overhead adds up.

I'm also working on an entity system that evaluates entities for an area and marks whether or not do attempt to display them based on position evaluation, and also an event scripting system that will work on an entity ty entity basis) CPU cycles are getting tight. If anybody knows any good small, low-resource, interpretive scripting languages I can embed, that isn't a Forth interpreter, I'm all ears.
Attachments
what.nes
(32.02 KiB) Downloaded 93 times
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: effecient 4-way scrolling

Post by tokumaru »

The cycle count for for sprite drawing seems rather high... 10,000 cycles is about 1/3 of the entire frame! This can indeed be one of the most time-consuming parts of a game, but 1/3 of the total is way too much! Here are a few tips that may or may not help you with improving that:

1- Unless you're tight on ROM, don't waste time with grid-based sprite formats, just store explicit X and Y coordinates for each individual sprite. Simpler games tend to have sprites arranged in grids, but once you start optimizing blank space and overlapping sprites, coordinates hardly adhere to a grid, so it makes little sense to spend any time calculating coordinates automatically when you can just fetch them straight off of ROM.

2- Again, unless you're tight on ROM, don't bother with flipping X and Y coordinates on the fly, just save pre-flipped coordinates for each sprite along with the unflipped versions (making each sprite occupy 6 bytes instead of 4). It's always faster to just grab explicit/pre-computed values than calculating things at run time. Another option here would be to have 4 separate core loops in your meta sprite function, one for each possible flipping combination. This way you only need to make flipping-related decisions once for each object at the start of the function, not for every individual sprite.

3- Go to screen space as soon as possible. Right at the start, you should subtract the camera's coordinates from the object's coordinates to find the coordinates of the meta sprite's central spot on screen space (you can also compensate for the 1-scanline delay in sprite rendering here). After this, you can just add each sprite's coordinates to that central point in order to calculate the final OAM coordinates.

4- Use coarse sprite clipping. Once you add a sprite's coordinates to the central point, don't even save the high byte of the result anywhere, just use it to decide whether to output the sprite or not. If the high byte is not 0, you know that the sprite is off screen, so you can skip it. On the Y axis, some values < 256 are also off-screen, specially if you have a status bar, but you'll save a bit of time by ignoring those cases and outputting the sprites anyway. If you really want more precise clipping on the Y axis, you're gonna need extra compare and branch instructions for each sprite.

5- Handle sprite cycling immediately. Output your sprites in pseudo-random order right from the start, either by drawing objects in pseudo-random order and outputting sprites linearly, or by starting from a random OAM position and advancing a prime number of slots after each sprite. This way your OAM will already be shuffled when you're done processing sprites, no need to dedicate any more time to sprite cycling.

6- Clear OAM efficiently. To put unused sprites off screen, you only need to set the Y coordinate to a value > 239, you don't need to touch the other 3 bytes at all. Also, instead of hiding ALL 64 sprites beforehand, consider hiding only the unused ones AFTER you're finished outputting sprites. This will save you some time, because you'll only spend time hiding sprites when there are less sprites in use, meaning that the CPU usage is probably low, but when CPU usage is high and there are barely any free sprites left, you're not gonna spend much time on this.
spaceharrier
Posts: 40
Joined: Wed Jan 19, 2022 9:52 am

Re: effecient 4-way scrolling

Post by spaceharrier »

The system does more than just the positioning of the sprite relative to camera. A level can have an overloaded amount of sprites based on whether or not objects the sprites are attached to are in the camera's FOV. The OAM also gets rebuilt every frame with priority based on a vertical position sort so that objects closer to the front are drawn first. I've still got a few bells and whistles to try and speed things up.

My scrolling demo shows a side scrolling example, though I'm working on something that has a slight illusion of depth to things. I am attempting to stuff 10 pounds of fertilizer into a 5 pound bag.
Last edited by spaceharrier on Mon Feb 14, 2022 5:00 pm, edited 1 time in total.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: effecient 4-way scrolling

Post by tokumaru »

spaceharrier wrote: Mon Feb 14, 2022 4:58 pmA level can have an overloaded amount of sprites based on whether or not objects the sprites are attached to are in the camera's FOV.
Since the NES doesn't have much memory or processing power, not that many objects can be active at once, and most games only keep in memory objects that are near the camera anyway, so they can assume that any object that's active has a good chance of being at least partially visible. If you somehow do have several active objects (say, over 32) away from the camera, I suppose you could a range check after converting their coordinates into screen space to eliminate the ones that are definitely off-screen, but doing that to so many objects will definitely eat a lot of CPU time.

In modern platforms you can certainly keep all objects of a level loaded in RAM at all times, and do a proximity test to determine whether they need to be updated and drawn, but most NES games will keep the number of active objects low, and have the scrolling system activate/deactivate objects according to how close they are to the camera.
The OAM also gets rebuilt every frame
That's very typical of NES games.
with priority based on a vertical position sort so that objects closer to the front are drawn first.
Depth sorting definitely complicates things... Ideally you'd want to only sort sprites that overlap, but still randomize the order of sprites that don't, so that sprite cycling is still a thing and frontmost sprites don't cause more distant ones to disappear permanently.
I am attempting to stuff 10 pounds of fertilizer into a 5 pound bag.
That's the only fun way of doing it.
calima
Posts: 1745
Joined: Tue Oct 06, 2015 10:16 am

Re: effecient 4-way scrolling

Post by calima »

Are you using a traditional sort? By using some RAM, it's possible to have a vertical priority in linear time.
spaceharrier
Posts: 40
Joined: Wed Jan 19, 2022 9:52 am

Re: effecient 4-way scrolling

Post by spaceharrier »

Right now, a bubble sort, because the total number of visible objects in camera view is usually 10 or less. I haven't done an analysis on something faster yet to see if the benefit is there.

I do like the idea of using the extra sram space and adding some sort of a bucket sorting method. It would also solve the issue of adding sprite flicker if more than objects are on a line. I estimate it would take about 2k.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 569
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: effecient 4-way scrolling

Post by Jarhmander »

Bubble sort? Don't. That's the worst possible algorithm. Anything is faster than bubble sort. Your simplest, naive sorting algorithm (which would look like insertion sort) would actually be faster than bubble sort.
((λ (x) (x x)) (λ (x) (x x)))
User avatar
Quietust
Posts: 1920
Joined: Sun Sep 19, 2004 10:59 pm
Contact:

Re: effecient 4-way scrolling

Post by Quietust »

Jarhmander wrote: Tue Feb 15, 2022 7:49 pm Bubble sort? Don't. That's the worst possible algorithm. Anything is faster than bubble sort.
Not unless you count Bogosort, though I suppose that's not a real sorting algorithm...
Jarhmander wrote: Tue Feb 15, 2022 7:49 pm Your simplest, naive sorting algorithm (which would look like insertion sort) would actually be faster than bubble sort.
I was thinking that a basic selection sort might actually be faster than insertion sort in this specific case, just because the insertion operation has to move so much data around (which is especially slow since the 6502 takes 32 cycles to move four bytes forward to the next slot) while the comparison just has to look at two bytes (which is 8 cycles plus the branch).
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 569
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: effecient 4-way scrolling

Post by Jarhmander »

Quietust wrote: Tue Feb 15, 2022 9:16 pm
Jarhmander wrote: Tue Feb 15, 2022 7:49 pm Your simplest, naive sorting algorithm (which would look like insertion sort) would actually be faster than bubble sort.
I was thinking that a basic selection sort might actually be faster than insertion sort in this specific case, just because the insertion operation has to move so much data around (which is especially slow since the 6502 takes 32 cycles to move four bytes forward to the next slot) while the comparison just has to look at two bytes (which is 8 cycles plus the branch).
Interesting insight, that just proves that 1) I didn't implement any sort algorithm on the 6502, and 2) there are always variations in how a given algorithm perform, even with the same time complexity, in how it actually perform on a given machine.
((λ (x) (x x)) (λ (x) (x x)))
spaceharrier
Posts: 40
Joined: Wed Jan 19, 2022 9:52 am

Re: effecient 4-way scrolling

Post by spaceharrier »

After tinkering with things, I started adding a radix sort since I
could contain the memory needed to zero page, and because I needed to sort 11 bit values. It's about 4500 cycles to depth sort 32 actors. Some c64 developers have more complex radix sorts that could knock 1500 cycles or so and I've already spotted a few optimizations I can do.

So far, with no extra work ram, and a scenario with 32 visible actors being depth sorted and maxing out all 64 sprites, I'm looking at about 13000 cycles with less than half of the sram utilized. Obviously that is a worst case and most levels either won't have 32 actors visible simultaneously.

Combined with my 4 way map scrolling and overhead of inputs and I'm looking at about 60% cycle utilization in worst case scenario. I'd imagine situations with maybe 8 or so actors visible at once and half of the sprites displayed would be more realistic, and a swag estimate would be less than 8k.

It will be a while before I can show a demo of the worst case but I will release one along with the tools I am writing to pack level data when they are more mature.

Also hello people. I am a veteran embedded software engineer. I am adept at low level stuff, but I am standing on the shoulders of giants; The amount of reverse engineering, documentation, and even the quality of the in-emulator debuggers blows my mind.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: effecient 4-way scrolling

Post by Oziphantom »

bubble sort is fine for the sprite depths, as you have frame consistency, in that each entity won't move very far from frame to frame. So the number of swaps will be minor, and doing an insertion or shell etc won't really help much.

There aren't any drop in scripting languages for 6502 that would be suitable, BASIC, COMAL, FORTH etc all exist but they eat ram and rom. You typically make a custom language that meets the exact specs of what you need it to do.
spaceharrier
Posts: 40
Joined: Wed Jan 19, 2022 9:52 am

Re: effecient 4-way scrolling

Post by spaceharrier »

I reworked my current code to get things more effecient. I was able to get my display routine down to 8,000 cycles. This includes:

- Translating 32 actors in a single frame
- depth sorting 32 actors
- adjusting actor position relative to camera
- sprite occlusion (actor is rejected not in the camera's view)
- drawing all 32 actors (up to 64 sprites)

This is in addition to getting my map decompression down to about 3,0000 cycles with attribute updates.

This accounts for a total of about 11,000 cycles, not counting vblank, or roughly 40% of available CPU time.

Two optimized patterns I discovered, and probably is already well documented by people smarter than me, is using a few illegal opcodes to speed up some things:

In particular, atx, which performs a logical and of a and x registers and stores the result in x, has been useful in eliminating 2 cycles in places, and a combo of lax, axs for arithmetic, which eliminates needing to set or clear the carry flag. That might break compatibility on clone consoles, but you can't make an omelet without breaking a few eggs.

I'll try to have a demo in a week or 3. This is a hobby project so progress is whenever.
Post Reply