cycle for cycle stuff
Moderator: Moderators
I don't get Quietust's original comment in the first place. What he described is executing partial instructions, and it will handle the case of reading $2002 just after the VBL flag is cleared (as will any method which communicates the time of the memory read on the fourth instruction clock, even just read_memory( addr, timestamp + 3 )).
When someone says "executing partial instructions", I think of the ability to halt the CPU in the middle of an instruction. My approach is not the same - though it does emulate the individual cycles, it still must execute one full instruction at a time.
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.
P.S. If you don't get this note, let me know and I'll write you another.
It would return $00.
As Q mentioned, his is cycle accurate, but execution is instruction granular, so if you tell it to execute N cycles, it likely will execute slightly fewer or slightly more, depending on where the instruction falls. In the grand scheme of things, keeping the rest of the system in-synch with this is trivial.
I wrote a CPU core that could halt in the middle of an instruction. To say it was slow would be a vast understatement. It wasn't terribly useful aside from the novelty of the idea.
As Q mentioned, his is cycle accurate, but execution is instruction granular, so if you tell it to execute N cycles, it likely will execute slightly fewer or slightly more, depending on where the instruction falls. In the grand scheme of things, keeping the rest of the system in-synch with this is trivial.
I wrote a CPU core that could halt in the middle of an instruction. To say it was slow would be a vast understatement. It wasn't terribly useful aside from the novelty of the idea.
If you're running the PPU and APU each CPU clock, then the CPU emulator is halting in the middle of an instruction:When someone says "executing partial instructions", I think of the ability to halt the CPU in the middle of an instruction.
Code: Select all
int opcode = read_mem( pc++ );
next_clock(); // halts here
switch ( opcode )
{
case 0xAD: // LDA abs
int lo = read_mem( pc++ );
run_ppu_and_apu(); // halts here
int hi = read_mem( pc++ );
run_ppu_and_apu(); // halts here
a = read_mem( (hi << 8) | lo );
run_ppu_and_apu(); // halts here
set_nz( a );
break;
...
}
Code: Select all
case 0xAD: // LDA abs
switch ( phase++ )
{
case 0: lo = read_mem( pc++ ); break;
case 1: hi = read_mem( pc++ ); break;
caes 2: a = read_mem( (hi << 8) | lo ); break;
case 3: set_nz( a ); opcode = read_mem( pc++ ); phase = 0; break;
}
break;
Code: Select all
while ( clocks_remain-- )
{
run_one_cpu_clock();
run_ppu_and_apu();
}
I understand what you are saying, but I prefer my method. First of all the 6502 emulator that I wrote will also be used on some of my other emulators (e.g. Atari 2600). Secondly it is easier to handle interrupts this way and I can also emulate the BRK bug (I am going for MAXIMUM accuracy here baby!). Thirdly is the simplicity of it all, as I am emulating what the NES does exactly. I also don't have the function overheads/increased .exe size from updating the PPU/APU like Nintendulator does. I have inlined all of the opcodes for maximum speed.
Example;
Example;
Code: Select all
inline void OpticCodeAD()
{
switch(CPU.Cycle)
{
case 0:
CPU.PC++;
CPU.Cycle++;
break;
case 1:
CPU.TMP2 = CPU.Memory[CPU.PC];
CPU.PC++;
CPU.Cycle++;
break;
case 2:
CPU.TMP2 += (CPU.Memory[CPU.PC] << 8);
CPU.PC++;
CPU.Cycle++;
break;
case 3:
CPU.A = CPU.Memory[CPU.TMP2];
CPU.P &= 0x7D;
if( !CPU.A )
CPU.P += 0x02;
CPU.P += (CPU.A & 0x80);
CPU.Cycle = 0;
break;
}
CPU.CC++;
}
The NES works via electrons moving in transistors (or even more basic, if you want to go to a subatomic level). An emulator doesn't emulate this. Most work at a higher level, emulating the behavior of the CPU instructions.Thirdly is the simplicity of it all, as I am emulating what the NES does exactly.
What you show above is probably slower since it adds lots of branching and function calls. But programmer intuition has never been what determines the speed of code. What does your profiler say?I also don't have the function overheads/increased .exe size from updating the PPU/APU like Nintendulator does. I have inlined all of the opcodes for maximum speed.
Yeah, yeah I meant at a higher level anyway. I know that this method will make my emulator very slow due to the switch/case branching but with since I have inlined every opcode there are actually no function calls at all. As for my profiler, I don't have one, but my probation officer says that if I don't keep my nose clean, it'll be back to the state pen. for me.
-
Guest
blargg: technically, yes it's stopping in the middle of the instruction. as far as performance goes, there's drastic differences between the two.
WdNESday: if you think about it, there is no logical difference between the two approaches. If the other parts are implemented properly, the CPU won't be able to tell the difference, and the only thing you get out of that approach is a slight cleanup in the outer loop running the CPU core. Going down that route for the purpose of personal curiousity is fine, but keep in mind that you get no technical benefit, and a slowdown of about 100x compared to the instruction-granular with cycle-accurate side effects approach.
WdNESday: if you think about it, there is no logical difference between the two approaches. If the other parts are implemented properly, the CPU won't be able to tell the difference, and the only thing you get out of that approach is a slight cleanup in the outer loop running the CPU core. Going down that route for the purpose of personal curiousity is fine, but keep in mind that you get no technical benefit, and a slowdown of about 100x compared to the instruction-granular with cycle-accurate side effects approach.
Thanks, for the advice. I know that it will make it slower, but I am implementing this because the core will also be used in other console/computers on other emulators. Also I like the simplicity that it involves.
For example, the (NTSC) VBlank time is 2273 (.3) cc's. If we are on cycle 2272 and STA Absolute is executed then then first cycle wouldn't need any PPU drawing/fetching, but the others would. Observe;
This way, if we are in a VBlank period the PPU won't need any checking. My method ensures that there are no wasted calls to Draw3Pixels(). Also observe the following; (let's say that we are on a different console/computer)
Let's pretend that after case # 1 was executed there was some kind of automatic bankswitching that meant that a different high byte was fetched. This would ensure that the correct byte is fetched.
For example, the (NTSC) VBlank time is 2273 (.3) cc's. If we are on cycle 2272 and STA Absolute is executed then then first cycle wouldn't need any PPU drawing/fetching, but the others would. Observe;
Code: Select all
for( int cc = 0; cc < 2273; )
{
FetchOpcode();
}
for( int cc = 0; cc < 29393; )
{
FetchOpcode();
Draw3Pixels();
}
Code: Select all
inline void OpticCodeAD()
{
switch(CPU.Cycle)
{
case 0:
CPU.PC++;
CPU.Cycle++;
break;
case 1:
CPU.TMP2 = CPU.Memory[CPU.PC];
CPU.PC++;
CPU.Cycle++;
break;
case 2:
CPU.TMP2 += (CPU.Memory[CPU.PC] << 8);
CPU.PC++;
CPU.Cycle++;
break;
case 3:
CPU.A = CPU.Memory[CPU.TMP2];
CPU.P &= 0x7D;
if( !CPU.A )
CPU.P += 0x02;
CPU.P += (CPU.A & 0x80);
CPU.Cycle = 0;
break;
}
CPU.CC++;
}
-deleted-
Last edited by Zepper on Sun Jun 21, 2009 8:16 pm, edited 1 time in total.
Zepper
RockNES author
RockNES author
There's nothing stopping instruction granular from handling the situation you mention regarding a timed bankswitch, if implemented correctly
If the switch is timed, then it should be updated per-cpu-cycle like the rest of the hardware, and the memory accesses should realize that side effects will possibly invalidate the direct fetches, so the memory fetch should go through code rather than direct access.
If the switch is timed, then it should be updated per-cpu-cycle like the rest of the hardware, and the memory accesses should realize that side effects will possibly invalidate the direct fetches, so the memory fetch should go through code rather than direct access.
Any NES CPU emulator which includes the timestamp of memory accessess can be used as the basis for a "cycle-accurate" NES emulator. The general rule is, any number of hardware modules can be emulated on an as-needed ("catch-up") basis as long as the future effects of all but one module on others can easily be predicted in advance. This is the case for the NES, where the CPU is the only entity whose future effect can only be determined by doing the actual emulation.
Ok, today i finally finished the new cycle-for-cycle accurate 6502 emulator. I immediately hooked it up to WedNESday to test it out. I didn't bother to include any PPU/APU accesses, no memory mapping/trapping, no blitting, x1 window, as I just wanted a rough estimate of how slow the core was.
Boy, nothing could prepare me for it.
On my P4 2.2GHZ I had 60FPS, and full 30 times slower than the previous core, which had 1800FPS in the same situation.
Please don't say I told you so. I did listen to you guys and I always agreed with you all the way it was just that I wanted to give it a try because no one had done it before.
Boy, nothing could prepare me for it.
On my P4 2.2GHZ I had 60FPS, and full 30 times slower than the previous core, which had 1800FPS in the same situation.
Please don't say I told you so. I did listen to you guys and I always agreed with you all the way it was just that I wanted to give it a try because no one had done it before.
- lord_Chile
- Posts: 120
- Joined: Thu Feb 02, 2006 7:07 am
- Location: Chile (South America), Quilpué
- Contact:
a question
what is the name of Quietust emulator???? do you release it???
Good day to nesdev people. Lord..
Author of nothing =P
UTFSM Sansano programmer.. lord_Chile
Saludos a la Sede JMC de la UTFSM... Viña del Mar, CHILE
Author of nothing =P
UTFSM Sansano programmer.. lord_Chile
Saludos a la Sede JMC de la UTFSM... Viña del Mar, CHILE