cycle for cycle stuff

Discuss emulation of the Nintendo Entertainment System and Famicom.
mozz
Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz »

byuu wrote:mozz: I do allow breakpoints to be set in both the CPU and APU, I also allow them both to be stepped cycle by cycle in real time. See bsnes v0.013's debugger for an example of this.
Of course. I just wanted to emphasize that in an implementation that was geared towards performance, you wouldn't have to run in lock-step mode unless the user actually set breakpoints in both. Even with breakpoints set for one chip, you could still run it ahead of the other one to batch up work.

I fully realize that sort of thing makes no sense in the context of Bsnes. I think Bsnes is great, by the way. I've looked over some of its code and its pretty readable (very readable by the standards of most emulator code! :wink:)
blargg wrote:Oh well, I was hoping for actual problem cases for the catch-up method. I agree that a straightforward design like you describe is great for ease-of-implementation, comprehension, and simplicity. If high efficiency wasn't a prime goal, I would not use catch-up. And like you say, all emulators are at a higher level than the console itself. All take liberty by merely emulating the behavior of the hardware at some level, rather than emulating the hardware itself. Putting aside the extremely low-level operation at the molecular and finer levels, practical things like bad cartridge connections and power glitches require a much lower level of emulation than all emulators currently achieve.
Well, you could take the point of view that what you are emulating is an abstract machine--an artificial construct that can be described in its entirety with mathematical precision. The Nintendo Entertainment System hardware is one implementation of this abstract machine. An emulator is a different implementation. Of course there is some uncertainty when you have multiple chips on independent clocks (like the SNES does). Maybe that defies the "mathematical precision" and you really do have to think of it as emulating two different interacting abstract machines. Hmm...
WedNESday
Posts: 1312
Joined: Thu Sep 15, 2005 9:23 am
Location: London, England

Post by WedNESday »

Topic drift...
byuu wrote:
Ok, today i finally finished the new cycle-for-cycle accurate 6502 emulator ... Boy, nothing could prepare me for it. On my P4 2.2GHZ I had 60FPS, and full 30 times slower than the previous core, which had 1800FPS in the same situation.
Then you seriously botched something up
Oh you betcha. I have just done a test of the following;

Code: Select all

inline void Operation69()
{
	switch(CPU.Cycle)
	{
		case 0:
			CPU.PC++;
			CPU.Cycle++;
			break;
		case 1:
			CPU.P &= 0x3D;
			if( CPU.A + CPU.Memory[CPU.PC] + (CPU.P & 0x01) > 0xFF )
				CPU.TMP = 1; else CPU.TMP = 0;
			if( (char)CPU.A + (char)CPU.Memory[CPU.PC] + (CPU.P & 0x01) < -128 || (char)CPU.A + (char)CPU.Memory[CPU.PC] + (CPU.P & 0x01) > 127 )
				CPU.P += 0x40;
			CPU.A += CPU.Memory[CPU.PC] + (CPU.P & 0x01);
			CPU.P &= 0xFE;
			if( !CPU.A )
				CPU.P += 0x02;
			CPU.P += CPU.TMP + (CPU.A & 0x80);
			CPU.PC++;
			CPU.Cycle = 0;
			break;
	}
	CPU.CC++;
}
vs.

Code: Select all

inline void Operation69()
{
	CPU.PC++;
	CPU.CC++;
	UpdatePPU();
	UpdateAPU();
	UpdateMapping();

	CPU.Byte = CPU.Memory[CPU.PC];
	CPU.TMP = (char)CPU.A + (char)CPU.Byte + CPU.CF;
	if( CPU.TMP < -128 || CPU.TMP > 127 )
		CPU.VF = 0x40; else CPU.VF = 0x00;
	if( CPU.A + CPU.Byte + CPU.CF > 0xFF )
		CPU.TMP2 = 1; else CPU.TMP2 = 0;
	CPU.NF = CPU.ZF = CPU.A += CPU.Byte + CPU.CF;
	CPU.CF = CPU.TMP2;
	CPU.PC++;
	CPU.CC++;
	UpdatePPU();
	UpdateAPU();
	UpdateMapping();
}
The latter of the two was twice as fast.
ReaperSMS
Posts: 174
Joined: Sun Sep 19, 2004 11:07 pm

Post by ReaperSMS »

The sticky part of a catch-up scheme is synching the visible environment changes between processors. Queueing up writes works to a point, but anything that can respond asynchronously takes extra effort.

Scanline interrupts and such are rather predictable, but that of course adds another layer of complexity to the matter.

Sound/gui updates should be handled as asynchronously as possible anyways, for the reasons mentioned.

Any coop multitasking system is going to be rather system specific, though the details can be abstracted out to minimize the unportable bits.

Coroutines might be another approach as well, though implementing them in C is somewhat tricky. Essentially, when the CPU talks to the APU, it jumps over to the APU code, which does it's thing, then jumps back, keeping track of the state for both. It's relatively similar to cooperative multitasking, except you specify the thread to jump to explicitly.

As for emulating full-blown multiprocessor systems accurately, there's almost always going to be tradeoffs made, if only because emulating the clock skew and propogation delay between chips is a bit extreme for this stuff. It's better to sit down and think about the system, and determine what exactly each processor can see, what events modify that view, and structure things around that.
WedNESday
Posts: 1312
Joined: Thu Sep 15, 2005 9:23 am
Location: London, England

Post by WedNESday »

I am gonna make WedNESday cycle accurate. But there are a few things that I would like to make sure of first.

1. Is it worth updating the Data Bus after each cycle? I can't see how this would be useful in a 6502 emulator.

2. What is the 'Address Bus' that some documents keep referring to. Example;

Code: Select all

BRK

        #  address R/W description
       --- ------- --- -----------------------------------------------
        1    PC     R  fetch opcode, increment PC
        2    PC     R  read next instruction byte (and throw it away),
                       increment PC
        3  $0100,S  W  push PCH on stack (with B flag set), decrement S
        4  $0100,S  W  push PCL on stack, decrement S
        5  $0100,S  W  push P on stack, decrement S
        6   $FFFE   R  fetch PCL
        7   $FFFF   R  fetch PCH
2. I am quite confused by the 'pipelining' concept. Some documents say that the previous opcode is executed when the new one is fetched. Is this true? Or is the following correct;

Code: Select all

inline void ADC()
{
	PC++;
	CC++;

	A += ...	// Executed ADC NOW not when next opcode is fetched
	CC++;
}
Thanks in advance.
mozz
Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz »

WedNESday wrote:2. What is the 'Address Bus' that some documents keep referring to.
It just refers to the signalling lines that are used by the CPU to tell other bus participants (e.g. a memory controller?) what address it wants to read or write to.

Think of a "bus transaction" as follows: one party (the CPU) supplies an address on the address lines, and indicates whether it wants to read or write. If writing, it also supplies a "value" on the data lines. if reading, the other party (the memory controller?) is responsible for setting the data lines. One party writes to them and the other party reads from them.
WedNESday wrote:2. I am quite confused by the 'pipelining' concept. Some documents say that the previous opcode is executed when the new one is fetched. Is this true?
Yes. But for practical purposes, you can pretend it isn't true.

"Pipelining" means dividing your logic that implements an instruction into multiple "stages". If you had three instructions A, B, C and a three-stage pipeline, then in cycle 1 instruction A enters the first stage. In cycle 2 instruction A moves on to the second stage, and instruction B enters the first stage. In cycle 3 instruction A enters the third stage, B enters the second stage and C enters the first stage.

The "latency" of instruction A was 3, because it had to pass through 3 pipeline stages (taking one cycle each) before it was completely finished. However, the "throughput" of this pipeline (assuming all instructions use all 3 stages and have latency of 3 cycles) would be 1 instruction per cycle! In other words, by splitting up the task of executing an instruction into 3 stages, we manage to be doing 3 things at once instead of just one thing. So our throughput is 3 times as high.

Now, its not always so rosy. For one thing, instructions take different numbers of cycles and sometimes need to participate in only some of the pipeline stages. Also, sometimes there are bottlenecks: for example, the 80386 had a pipeline but the first stage had to share some resources with the second stage. So an instruction could only 'issue' into the pipeline every two cycles, and the minimum effective cost of each instruction was 2 cycles. The 486 fixed this and many of its instructions could then be executed effectively in one cycle, making it perform noticably better.

Imagine a branch instruction enters the first stage of your pipeline, but you won't figure out if the branch is 'taken' or 'not taken' until stage 3 of the pipeline. What do you do on the next cycle? The branch moves up to stage 2 of the pipeline, but what should go into stage 1? You don't know yet if its the instruction right after the branch, or the instruction at the branch target address. (even if you knew, you'd have to fetch that byte before you could decode it). Most processors have 'branch prediction' logic to try and figure out well in advance whether a branch is likely to be taken or not. They then assume that is the case and keep filling the pipeline with the predicted instructions. If it turns out it guessed wrong, it has to flush the parts of the pipeline that contain wrongly-predicted instructions, causing a "pipeline stall". On modern X86 processors, most branch instructions are predicted correctly 80% to 90% of the time. The rest of the time, you take a performance hit that could be anywhere from 10 to 25 cycles. [Edit: I think I forgot to mention, that modern x86 processors have the equivalent of 15-25 stages in their pipelines. Modern GPUs in graphics cards have even more stages---dozens and dozens!]


Okay, so getting back to the 6502... think of it as having a "pipeline of depth two". The first stage performs a memory access (read or write), and the second stage performs the operation of the instruction. The first and second cycles of a 6502 instruction are consecutive reads. During the second cycle, it is also decoding the opcode and deciding what to do with it. Maybe the second read was wasted (e.g. an implied operand), or maybe it was a Direct Offset or an Absolute Address Low byte, or something else we would have needed to read anyway. Take the example of an implied operand, though. The *third* cycle of the instruction is when its actually executed! But the instruction is only two cycles long, you say! And you're sort of right---that *third* cycle is the same cycle as when the opcode fetch of the next instruction is occurring:

Code: Select all

    A-1.  fetch A's opcode      (and, finish last instruction)
    A-2.  fetch next byte        (and decode A)
    B-1.  fetch B's opcode      (***and execute A)
    B-2.  fetch next byte        (and decode B, which turns out to be a Zero Page insn)
    B-3.  read from Zero Page   (and...idle)
    C-1.  fetch C's opcode     (and execute B)
But if you write these effects sequentially, does it really matter how you divide them up?

Even if its really doing this:

Code: Select all

    (finish last instruction) and
    A-1.  fetch A's opcode
----------------------------
    (decode A) and
    A-2.  fetch next byte
----------------------------
    (execute A) and
    B-1.  fetch B's opcode
----------------------------
    (decode B, which turns out to be a Zero Page insn), and
    B-2.  fetch next byte
you can for the most part think of it instead as doing this:

Code: Select all

    ....
    (finish last instruction)
----------------------------
    A-1.  fetch A's opcode, and
    (decode A)
----------------------------
    A-2.  fetch next byte, and
    (execute A)
----------------------------
    B-1.  fetch B's opcode, and
    (decode B, which turns out to be a Zero Page insn)
----------------------------
    B-2.  fetch next byte, and
    ....
I'm no expert so there might be errors in this. But you get the general idea.
WedNESday
Posts: 1312
Joined: Thu Sep 15, 2005 9:23 am
Location: London, England

Post by WedNESday »

Very helpful thanks. Does anyone know if reseting the NES clears the pipelines?
tepples
Posts: 23006
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)

Post by tepples »

Reset is one of the three interrupts (the other being IRQ and NMI). Any interrupt will clear the pipeline.
User avatar
Zepper
Formerly Fx3
Posts: 3262
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil

Post by Zepper »

What about this...

1. Trigger NMI/IRQ if pending.
2. Fetch next instruction.
3. Decode the addressing mode & process the proper bytes (1 or 2).
4. Jump to the instruction using the addressing mode byte/word as argument.

Note: I could make great use of jump tables. A lot of opcodes with different addressing modes work as if they would be only 1.
mozz
Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz »

By the way, I don't think you should actually try to emulate the pipelines explicitly in your code! That would almost certainly be overkill.

Remember, external devices can ONLY interact with the CPU through memory accesses and the IRQ/NMI/RESET pins. So while a device (such as a memory mapper, or the PPU) is able to see reads and writes, it is NOT able to see exactly when the 6502 is doing the "work" of executing the instruction. That would be internal calculations such as adding a temporary value (perhaps read in the previous cycle) to A.

So as an emulator author, you just have to make sure the correct "internal operations" are performed sometime during the instruction after the inputs become available and before the output is needed. The 6502 actually performs them in parallel with bus operations and uses the 2-stage pipeline. But emulator authors should not need to do that.

This sort of timing stuff can get very complicated. If it helps you think about it, you could draw out a sort of "timeline of events" for an instruction. Divide it up into cycles, and mark down when the different actions (reads, writes, internal computations) are occurring. For anything thats not externally visible, there is no fixed time you have to do it at --- just dependencies on earlier events, and maybe later events that depend on this one.
Note: I could make great use of jump tables. A lot of opcodes with different addressing modes work as if they would be only 1.
Be aware that there is usually a performance penalty for jumping through a jump table. That is called a "dynamic branch" and modern processors will be very slow unless it is highly predictable, i.e. 95% of the time you are jumping to the same address. Opcode dispatch is almost the worst case for dynamic branch, being random enough that the CPU can almost never predict the correct address. For example, if you were to write your core so that each instruction did TWO table dispatches instead of just one, you would probably find that the whole core was now 10-20% slower! Of course it depends how fast the rest of your instruction emulation is... since the 6502 performs a read or write (which takes you many host cycles worth of work to emulate) on *every* cycle, maybe you're paying enough for that already that the dispatch overhead is kinda small in comparison.

I'll recommend this: try to do the table lookup into a register well in advance, and leave it there untouched for a few dozen cycles before you actually do the indirect jump. One of my coworkers once went for a seminar at Intel about their VTune product, and he asked about dynamic branches and was told that on Pentium 2's and Pentium 3's, a jump through a register is "essentially free" if the address sits in the register for at least 40 cycles before the jump. My takeaway from that story was that 40 cycles is long enough for the write to the register to retire from the ROB and be written back to the architecturally-visible register file. Anyway, if the value has been in the register long enough, the branch predictor will just use it as the predicted address. So try to read it as early as possible.

[Edit: I realize now this part of my post was disengeous, because I'm actually working on generating cores that do in fact divide most instructions into two halves and do two dynamic branches to execute them. The first routine handles most of the addressing mode, and the second routine includes the various ALU ops and so on. It should considerably reduce code size (which I think I am about the only person on the planet who cares about that these days! hehe) and to avoid it being too slow I'll be doing exactly what I recommended above---in fact my two dispatch tables are combined into one table and my dispatch routine will load BOTH addresses into registers as early as possible. It will then do a few other things before branching to the first register. The second register will be untouched throughout the first handler for the instruction (the address mode code) unless it needs to be spilled by the handler for a nontrivial memory access such as a PPU access. So hopefully, the jump-through-a-register which connects the "first half" of my instruction to its "second half" will be essentially free most of the time. I'm kind of more interested in how much I can optimize code size anyways...]
User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA

Post by blargg »

mozz wrote:'m kind of more interested in how much I can optimize code size anyways...
If you can get under around 4K of machine code + 1K for the jump table, you have my C++-based core beat. :)
User avatar
Zepper
Formerly Fx3
Posts: 3262
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil

Post by Zepper »

As far as i know, there are 3 "options" to bear with opcode emulation...

a) a LARGE case statement. You know, around 256 entries of case 00h...FFh inside an infinity loop. Pros? Cons?

b) jump tables. You use the opcodes as small "pieces", and then jumps to proper "piece" carrying an argument (byte or word). It makes the code much smaller. Pros? Cons?

c) inlined subroutines. The standard thing, newbie style perhaps? You code each opcode into a separate function, maybe inlined &| static. Each opcode is accessed by a pointer, something like (*function_op[opcode_num](argument)); Pros? Cons?

Actually, I use b. No clue if a small core could be worst than a large core, like the case statement.
WedNESday
Posts: 1312
Joined: Thu Sep 15, 2005 9:23 am
Location: London, England

Post by WedNESday »

Here's mine:

Code: Select all

__forceinline void OperationCode69()
{
	CPU.PC++;
	CPU.CC++;
	// PPU/APU
	// ADC
	CPU.PC++;
	CPU.CC++;
	// PPU/APU
}

__forceinline void OperationCode()
{
	switch(CPU.Memory[CPU.PC])
	{
		case 0x69: OperationCode69(); return;
	}
}
Note the return not break at the end of the opcode fetch. Just as long as there is no code after the case table then you can exit the function immediately. I know that without macros the cycle for cycle memory accessing will make the code large, but never mind. Is there a faster method of doing this kind of thing?
User avatar
baisoku
Posts: 121
Joined: Thu Nov 11, 2004 5:30 am
Location: San Francisco, CA

Post by baisoku »

WedNESday wrote:Here's mine:

Code: Select all

__forceinline void OperationCode()
{
	switch(CPU.Memory[CPU.PC])
	{
		case 0x69: OperationCode69(); return;
	}
}
Note that this generates the exact same code as:

Code: Select all

__forceinline void OperationCode()
{
	switch(CPU.Memory[CPU.PC])
	{
		case 0x69: OperationCode69(); break;
		case 0x70: OperationCode70(); break;
	}
}
So you're not actually gaining anything.
...patience...
WedNESday
Posts: 1312
Joined: Thu Sep 15, 2005 9:23 am
Location: London, England

Post by WedNESday »

The return would cause the function to exit straight away, whilst the break would cause the program exit the switch table and then return from the function. Surely without the break is faster?
User avatar
Zepper
Formerly Fx3
Posts: 3262
Joined: Fri Nov 12, 2004 4:59 pm
Location: Brazil

Post by Zepper »

Compile both versions and compare the binaries using an hexa editor. ^_^;;..