I have a working 2A03 emulator, but would like to optimize it as much as possible. Blargg's website has some good information, but there are a couple of points that are still unclear.
*The same addressing modes are re-used numerous times. For instance, LDA ($nn),Y will use the same effective address calculation as ORA ($nn),Y. In general, on an x86 platform, is it faster to inline the effective address calculation (thus minimizing CALL/RET overhead), or is it faster to use subroutines for each form of calculation (thus minimizing code size and making better use of the L1 cache)?
*One suggestion I've heard is to not calculate the N and Z flags on every opcode that sets them (almost all of them), but instead to simply keep a variable that contains the last data byte that affected N/Z, and only parse the flags when needed. Therefore, BEQ/BNE would simply check whether the last data byte was 0, and BMI/BPL would check whether bit 7 was set, and it would only be necessary to change the flags into 2A03 format for PHP or interrupts. But, if this method is used, how can the emulator handle setting N and Z simultaneously via BIT, PLP, or RTI?
6502 emulation optimization
-
dvdmth
- Posts: 354
- Joined: Wed Mar 22, 2006 8:00 am
Inline is generally faster, although you might want to use a profiler to test different kinds of optimization. If you don't know how to use a profiler, find out.
To handle BIT, PLP, and RTI, you need a way to set N and Z arbitrarily. One way I can think of is to use a 16-bit (or larger) variable to hold the N/Z result. Assume the N flag set if bits 7 OR 15 are set in the variable (nz & 0x8080), and assume the Z flag set if bits 0-7 are zero ((nz & 0xFF) == 0). For most opcodes, simply set the N/Z value to the operation result (making sure bit 15 never gets set accidentally), and for opcodes such as BIT, store the N result in bit 15 and the inverse of Z in bit 0, while leaving the other bits clear. (This is not necessarily the best way to do it, but it's one way I can think of that it can be done.)
To handle BIT, PLP, and RTI, you need a way to set N and Z arbitrarily. One way I can think of is to use a 16-bit (or larger) variable to hold the N/Z result. Assume the N flag set if bits 7 OR 15 are set in the variable (nz & 0x8080), and assume the Z flag set if bits 0-7 are zero ((nz & 0xFF) == 0). For most opcodes, simply set the N/Z value to the operation result (making sure bit 15 never gets set accidentally), and for opcodes such as BIT, store the N result in bit 15 and the inverse of Z in bit 0, while leaving the other bits clear. (This is not necessarily the best way to do it, but it's one way I can think of that it can be done.)
"Last version was better," says Floyd. "More bugs. Bugs make game fun."
-
Dwedit
- Posts: 5256
- Joined: Fri Nov 19, 2004 7:35 pm
-
randilyn
- Posts: 16
- Joined: Tue Mar 28, 2006 6:22 am
-
Zepper
- Formerly Fx3
- Posts: 3262
- Joined: Fri Nov 12, 2004 4:59 pm
- Location: Brazil
I don't know if this method is the fastest, or even 'compiler-friendly', but I use jumps, or the goto. Firstly, I wrote my core instruction by instruction, separated by addressing mode. Later, I started to optimize them, as removing redundant code (blocks) because they could be executed using a similar block (instruction). It's something like...
- It starts inside a case statement for the addressing mode. Once the argument is done (like immediate, byte, word...), it jumps into the proper block to execute the instruction. The CPUOP() is a jump label, and OPEND is a goto op_end. If you're good enough, you notice this might work for a giant case statement, but I never tried out.
Code: Select all
//ADDRESSING #1 (offset)
//Parameter: offset (unsigned short)
CPUOP(ORA1)
value = readvalue(offset);
_doCPUOP(ORA0);
CPUOP(ASL1)
value = readvalue(offset);
ASL(value);
writevalue(offset, value);
OPENDZepper
RockNES author
RockNES author
-
tepples
- Posts: 22993
- Joined: Sun Sep 19, 2004 11:12 pm
- Location: NE Indiana, USA (NTSC)
Unless your target platform contains hardware that accelerates the PPU and half the PSG. Loopy would be familiar with these.randilyn wrote:The CPU portion of the 2A03 cores used in modern emulators is easily one of the fastest components in the emulation performance-wise, and isn't likely to really impact speed enough to be worth optimizing beyond the [obvious] basics.