I've been writing an emulator in Java, where one of the primary targets is low-power mobile-like devices. While I can do the core CPU/PPU emulation at ~750fps on my desktop, I'm only getting ~15fps on the slowest targets.
I've already put in a lot of optimization work, and at this rate I don't see enough opportunities left. I'm wondering what corners I can cut that would allow for the serious improvements I need to see while having the smallest impact on overall compatibility. While I would love to have a perfectly compatible emulator, I've pretty much resigned myself to the fact that I'll need to break some things to reach playable speeds on the current gen hardware.
Any suggestions?
Performance suggestions
Moderator: Moderators
I'm interested. If you don't mind typing it up, I'd love to hear how you're doing this.Dwedit wrote:I could explain the algorithm I use in PocketNES if you're interested.
Thanks
get nemulator
http://nemulator.com
http://nemulator.com
tepples: For a given scanline, I spend about 4x more time in the PPU than the CPU. I was a bit surprised that it was that close, given that you have to render 256 pixels per 25 cpu instructions or so. That's after spending most of my effort trying to make the PPU faster, though.
Dwedit: The CPU and PPU are standard J2SE with different frontends for different targets, like Swing and Android. I'd love to hear about your speedhacks.
Dwedit: The CPU and PPU are standard J2SE with different frontends for different targets, like Swing and Android. I'd love to hear about your speedhacks.
Same here.James wrote:I'm interested. If you don't mind typing it up, I'd love to hear how you're doing this.Dwedit wrote:I could explain the algorithm I use in PocketNES if you're interested.
Thanks
Zepper
RockNES author
RockNES author
PocketNES implements two main kinds of speed hacks: jump hack and branch hack. Both in effect freeze the emulated CPU until an interrupt occurs. There are four sources of interrupts in the NES: NMI from the PPU, completion IRQ from the DMC, timer IRQ from the APU frame counter, and an IRQ from the mapper. Barring certain kinds of heavy raster effects, you won't get more than about two interrupts per frame, so you can freeze the CPU for a relatively long time.
The "jump" hack is for games that use the "superloop" structure such as Super Mario Bros. In these games, the entire game runs as NMI and IRQ handlers. The NMI handler updates VRAM and then runs the next frame of game logic.
For this, the CPU can just stop until the next interrupt and then adjust its timing based on which cycle of the JMP instruction the interrupt hit.
The other is for games that repeatedly read a variable that the NMI handler updates and branch based on it. For example, LJ65, Concentration Room, Lawn Mower, and Thwaite and all use this structure:
Some games' NMI handlers are much longer than this, for example doing all the VRAM and audio updates, and signaling at the end that NMI has occurred. But it illustrates the sort of tight loop that a "branch" speed hack exploits. The emulator can look for short loops including no store instructions, detect what address the loop is spinning on, skip running the CPU until an interrupt occurs, and then adjust the CPU timing based on where in the loop the interrupt occurred.
PocketNES gets a lot of mileage out of its speed hacks because it delegates most of the work of drawing tiles to the GBA's PPU and most of the work of playing audio to the GBC's APU. This leaves the CPU as by far the biggest item on the profile. On a platform with a dumb frame buffer, such as your PCs and Android devices, your mileage may vary.
The "jump" hack is for games that use the "superloop" structure such as Super Mario Bros. In these games, the entire game runs as NMI and IRQ handlers. The NMI handler updates VRAM and then runs the next frame of game logic.
Code: Select all
; initialize the registers and the game loop variables
; for the first time, and once that's done, just
; jump in place forever
forever:
jmp forever
nmihandler:
pha
txa
pha
; ...
pla
tax
pla
rti
The other is for games that repeatedly read a variable that the NMI handler updates and branch based on it. For example, LJ65, Concentration Room, Lawn Mower, and Thwaite and all use this structure:
Code: Select all
; ...
lda retraces
nmiwaitloop:
cmp retraces
beq nmiwaitloop
; ...
nmihandler:
inc retraces
rti
PocketNES gets a lot of mileage out of its speed hacks because it delegates most of the work of drawing tiles to the GBA's PPU and most of the work of playing audio to the GBC's APU. This leaves the CPU as by far the biggest item on the profile. On a platform with a dumb frame buffer, such as your PCs and Android devices, your mileage may vary.
PocketNES sets up three of the GBA's four DMA channels for HDMA, pointing at the GBA's equivalents of PPUSCROLL, PPUCTRL, and PPUMASK (BG0XOFS/BG0YOFS, BG0CNT, and DISPCNT respectively). The fourth is used to stream decoded DPCM (but not $4011 PCM, unfortunately for Big Bird fans).foobaz wrote:How can you delegate tile drawing to the hardware? Isn't there too much state that can change per scanline to make that possible?