Got any tips for Early NES Emulator Development?

Discuss emulation of the Nintendo Entertainment System and Famicom.

Moderator: Moderators

WedNESday
Posts: 1231
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany
Contact:

Post by WedNESday »

MottZilla wrote:

Code: Select all

void ADC(unsigned char Value)
{
	unsigned char Carry=CPU_P&0x01;
	// Check for Carry
	if( (CPU_A + Value + Carry ) > 0xFF)	// Check if Carry will Result
	{
		CPU_SETC=1;
	}
	else
	{
		CPU_SETC=0;
	}

	// Check for Zero
	if( (CPU_A + Value + Carry)==0 )		// Check if Zero will Result, Set Flag accordingly.
	{
		SetZero();
	}
	else
	{
		ClearZero();
	}

	// Check for Overflow
	CPU_TEMP=CPU_A + Value + Carry;
	CPU_SETV=0;
	if(!((CPU_A ^ Value)&0x80) && !((CPU_A ^ CPU_TEMP)&0x80))
		CPU_SETV=1;
	if(CPU_SETV)
	{
		SetOverflow();
	}
	else
	{
		ClearOverflow();
	}

	// Do ADC Operation
	CPU_A = CPU_A + Value + Carry;

	if(CPU_SETC==1)
	{
		SetCarry();
	}
	else
	{
		ClearCarry();
	}

}

void SBC(unsigned char Value)
{
	ADC(Value ^ 0xFF);
}
A simply hideous amount of code if I may say, for something as simple as ADC. Here is WedNESday's code;

Code: Select all

CPU.TMP2 = (char)CPU.A + (char)CPU.Databus + CPU.CF;
if( CPU.TMP2 < -128 || CPU.TMP2 > 127 )
	CPU.OF = 0x40; else CPU.OF = 0x00;
CPU.NF = CPU.ZF = CPU.A = CPU.CF = CPU.A + CPU.Databus + CPU.CF;
CPU.CF >>= 8;
No memory addressing provided, and CPU.TMP2 holds the byte fetched. In my experience if's and else's are what slows down an emulator the most, especially in any pixel-rendering functions. And as for calling ADC(Value ^ 0xFF) for your SBC code, you should really __forceinline everything to make sure that it is as fast as possible.
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Post by Disch »

WedNESday wrote:you should really __forceinline everything to make sure that it is as fast as possible.
__forceinline is not a C++ keyword, but rather one of those "MSVS only" keywords that VC++ adds. I would recommend against using any sort of compiler addon that isn't part of the standard (any function/keyword that is preceeded by underscores should throw a red flag -- avoid all of them). "inline" should suffice... and is probably better than __forceinline anyway -- since inlining doesn't always produce faster code, and 'inline' will detect these instances whereas __forceinline will not.

Preferably, I would even #define calling conventions elsewhere in the code and use the #defines rather than using the calling convention directly. This way if you run into calling convention problems with other compilers or platforms you can easily change the #define and remove all related problems:

Code: Select all


// what I would recommend against
void inline BadExample()
{
}

// what I would recommend

// in some header file
#define NES_INLINE  inline

// in source
void NES_INLINE GoodExample()
{
}
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper »

Fx3's ADC code:

Code: Select all

CPUOP(ADC0)
   offset = cpu->A + value;
   if(cpu->status & C_BIT)
      offset++;
   cpu->status &= ~(C_BIT | V_BIT);
   if(offset & 0xFF00)
      cpu->status |= C_BIT;
   if((cpu->A ^ offset) & (value ^ offset) & 0x80)
      cpu->status |= V_BIT;
   cpu->A = (unsigned char)(offset);
   set_sz_flags(cpu->A);
OPEND
User avatar
MottZilla
Posts: 2835
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla »

Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper »

MottZilla wrote:Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.
Right, I agree. Anyway, I sense a grain of salt in your commentary... we're just showing examples for you, as advice/tips only, which can target some optimization and I bet you're not allowed to kick off. :|
User avatar
blargg
Posts: 3717
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg »

Here's fully portable version written for clarity. All variables are of type int. Who says efficiency and clarity are always at odds?

Code: Select all

overflow = ((a ^ 0x80) + (operand ^ 0x80) + carry - 0x80) & 0x100;
temp     = a + operand + carry;
carry    = temp >> 8;
a        = temp & 0xFF;
// update negative and zero flags based on a
// ...
EDIT: Actually, Wednesday's overflow checking is clearer. Untested:

Code: Select all

temp     = (int8_t) a + (int8_t) operand + carry;
overflow = (temp < -128 || temp > 127);
carry    = temp >> 8 & 1;
a        = temp & 0xFF;
// update negative and zero flags based on a
// ...
Last edited by blargg on Sat Mar 22, 2008 1:07 pm, edited 1 time in total.
User avatar
MottZilla
Posts: 2835
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla »

Fx3 wrote:
MottZilla wrote:Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.
Right, I agree. Anyway, I sense a grain of salt in your commentary... we're just showing examples for you, as advice/tips only, which can target some optimization and I bet you're not allowed to kick off. :|
Not at all. I was mainly addressing WedNESday who was insulting my code. :p

I do appreciate all the help and insight.
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper »

Hmm... It's criticism, as far as I can tell you. ^_^;;
User avatar
MottZilla
Posts: 2835
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla »

I suppose, I just took offensive to the word hidious considering this is a topic about n00b emulator tips. :p

My emulator runs alot of games now, but I'm having issue with Sprite 0. I can't seem to get any luck so far with implementing an accurate emulation of it which is important for alot of games. Right now I just sort of fake it so it works decently enough.

I also went and added support for AxROM (Mapper 7) so I could try out BattleToads. The scroll was very much messed up. Part of it seems to have to do with updating scroll at Sprite 0 in the non-standard way. I tried added that in and it helped but it wasn't always right. I didn't put on it too much though since the Sprite 0 timing is bs anyway.

I'm not really sure what I want to do next. Ideally though it would be getting cycle accurate (atleast close to it) sprite 0 hit. Might require some more rendering adjustments and such. I had tried doing it one way that should have worked but for some reason didn't seem to help at all.
WedNESday
Posts: 1231
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany
Contact:

Post by WedNESday »

MottZilla wrote:
Fx3 wrote:
MottZilla wrote:Yes well this is the first emulator I've ever written. Thus I'm not concerned with code being huge or slow, only understandable and functioning. If I really wanted to make it all small and efficant that'd be somethign to worry about later.
Right, I agree. Anyway, I sense a grain of salt in your commentary... we're just showing examples for you, as advice/tips only, which can target some optimization and I bet you're not allowed to kick off. :|
Not at all. I was mainly addressing WedNESday who was insulting my code. :p

I do appreciate all the help and insight.
I wasn't being insulting, it was just a bit shocking to see so much code that's all. Btw Fx3's ADC code is just as big, but blargg's seems nice and small. And yes you're right, it's your first emulator so you don't have to worry to much about efficiency at this stage, just get the damn games to work, and worry about other things later. As for inline/__forceinline, since the opcodes are only called once in the emulator, __forceinline is the best option IMO.
User avatar
Zepper
Formerly Fx3
Posts: 3264
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil
Contact:

Post by Zepper »

- I like to discuss programming skills and optimizations. Mr.Wed, by considerating the number of ADC opcodes in a game, well... it's quite rare if you compare it with the LDA, as example. ;) And I don't think my code is as big as the previous one, heh ^_^;;
WedNESday
Posts: 1231
Joined: Thu Sep 15, 2005 9:23 am
Location: Berlin, Germany
Contact:

Post by WedNESday »

Fx3 wrote:- I like to discuss programming skills and optimizations. Mr.Wed, by considerating the number of ADC opcodes in a game, well... it's quite rare if you compare it with the LDA, as example. ;) And I don't think my code is as big as the previous one, heh ^_^;;
:D Of course it's not as big. And I know that certain opcodes are called more times than others, but I have spent ages constantly refining each opcode to make it as fast as I can. Btw Disch, since the opcodes are called only once in the CPU code, you would want them to be inlined as they as they are only called once. You don't need to worry too much about the code getting too big etc. That's only if you are constantly calling the same function inline too many times. If you write an opcode and only call it once inside of the swith(WhichOpcode) bit, then the code would be exactly the same size. I must admit, I stole the concept of __forceinline rather the just inline from Nintendulator anyway :lol: .
User avatar
MottZilla
Posts: 2835
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla »

What does __forceinline do? It sounds to me like it replaces the function call with inline code? So it sounds to me like the compiler doesn't actually call your function but instead any part of code that uses it actually has it placed right in there? Let me know if my idea is close. :p

Anyway, curious what do you guys know about BattleToads? It seems to be one of the best games to test emulator accuracy and conviently uses one of the most simple memory mappers.

I've read you need pretty accurate Sprite 0# Hit Flag timing. And just timing in general. The game runs on my emulator, but the problem I'm having is scroll offset. The game appears to use the scrolling technique that Loopy's document is about. I "tried" to implement scrolling adjustments mid-frame for Battletoads. It was close but not correct.

I do seem to have the name table switching by writing to 2006 correct, atleast enough so that Super Mario's status bar doesn't flicker. But I'm not clear on how you change the scroll offset.

I think was I did was I was taking the first write to 2006$, masking for the lower 2 bits. Then shifted those left bit 3 bits. I saved that number till the second write. Then I would combine those 2 bits, with the second write masking for the upper 3 bits, shifted right 5 bits. Then I would set ScrollY to that value * 8.

As I said it was close and in some places it was correct but not everywhere.
User avatar
Disch
Posts: 1848
Joined: Wed Nov 10, 2004 6:47 pm

Post by Disch »

WedNESday wrote:__forceinline is the best option IMO.
Just to reiterate...

__forceinline is not a standard C++ keyword. I'd say it's never a good idea to use it simply because it may give you trouble with compilers that don't support it -- which will make portability or even public source release somewhat problematic.

Plus inline (which is a standard keyword) does the same job. The only difference is that __forceinline doesn't detect conditions where inlining isn't favorable. You say it's favorable for CPU functions and I don't disagree -- but the truth is you shouldn't substitute your judgement for the compiler's. Inline doesn't always mean faster... and in the offchance you happen to make a function inline where inlining reduces performance, the compiler will correct that (that's its job) -- whereas with __forceinline you end up screwing yourself.

Any situation where it is favorable to have stuff inlined, inline works just as well as __forceinline.

So yeah -- __forceinline should never be the way to go, IMO.

But again -- this is another reason to #define calling conventions, since whoever compiles your source can change the define to inline rather than __forceinline if they choose -- rather than having to go and change every function.

EDIT (MottZilla replied while I was typing)
What does __forceinline do? It sounds to me like it replaces the function call with inline code? So it sounds to me like the compiler doesn't actually call your function but instead any part of code that uses it actually has it placed right in there? Let me know if my idea is close. :p
Yeah sounds like you have it right. Function inlining makes it so that when you call a function, it doesn't actually jump to that function -- rather, the function gets sort of copy/pasted into the area that calls it.

This is good because there's a little overhead for function calling (variables pushed on stack and whatnot) which is avoided if the function is inlined.

But it can also be bad because it can greatly bloat code size, which may cause the program to run slower.
Anyway, curious what do you guys know about BattleToads?
It's very picky about timing. If your NMI isn't timed just right, or if your sprite 0 hit is little off, the game can very easily deadlock on level 2. It's also picky about when in the scanline you reset the horizontal scroll and increment the Y scroll, etc. Doing these at the wrong times can cause it to deadlock.
I do seem to have the name table switching by writing to 2006 correct, atleast enough so that Super Mario's status bar doesn't flicker. But I'm not clear on how you change the scroll offset.
How it works is the PPU address set by $2006 is the same address that the PPU uses to fetch tiles to render. During rendering, every time the PPU fetches a tile it increments the address so that the next tile to be displayed is pointed to. I'm not really sure it helps to think of it in terms of scroll offset.

For example... if the game sets the PPU address to $1234 by writing to $2006, this means that the next tile fetched comes from $2234 ($0234 + $2000) and with a fine Y scroll of 1 ($1000 >> 12). In effect this translates to:

Y scroll: $89
X scroll: $A0 (to $A7... depending on the fine X scroll set by $2005)
User avatar
blargg
Posts: 3717
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA
Contact:

Post by blargg »

WedNESday wrote:I have spent ages constantly refining each opcode to make it as fast as I can.
Only a handful of opcodes are used regularly, while some are virtually never used. Optimization effort is best spent on the former.
WedNESday wrote:Btw Disch, since the opcodes are called only once in the CPU code, you would want them to be inlined as they as they are only called once. You don't need to worry too much about the code getting too big etc. That's only if you are constantly calling the same function inline too many times. If you write an opcode and only call it once inside of the swith(WhichOpcode) bit, then the code would be exactly the same size.
The inline version would probably even be smaller, since the outline one would have function call overhead. But, you sometimes want a once-called function outlined, if it's used rarely. If it were inline, it'd use more of the cache since its beginning and end would be kept in the cache by the often-used code around it, and branches over it would have to hit a different cache line. It might also stress the optimizer out enough that it can't optimize other parts of the function as well. As Disch says below, by using regular inline, you allow the compiler to detect things like this.
Disch wrote:[...] the truth is you shouldn't substitute your judgement for the compiler's. Inline doesn't always mean faster... and in the offchance you happen to make a function inline where inlining reduces performance, the compiler will correct that (that's its job) -- whereas with __forceinline you end up screwing yourself.
I'd normally say that the programmer sometimes knows best, but in this case, you have a very good point. With profiler-guided optimization (often called PGO), it really can decide best as to what should be inlined.

But like I always say, with optimization, the only authority is how something affects the speed of the actual program. If something really does speed up your program, then it's good. WedNESday, do you have any numbers for speedups you've gotten with your techniques?
Post Reply