cycle for cycle stuff

Discuss emulation of the Nintendo Entertainment System and Famicom.
augnober
Posts: 23
Joined: Sun Jan 08, 2006 12:22 pm

Post by augnober »

-_pentium5.1_- wrote:
augnober wrote:Thinking about it more, I think a different language could help a lot.
What language in particular do you have in mind?
I'm not thinking of any particular language, rather just being kind of optimistic for the sake of not ruling it out. I don't want to go too far off the topic of emulation, but I'll explain the sort of thing that I think could be of use since I didn't explain before and haven't seen mention of this subject before (although since I haven't seen it mentioned, I also unfortunately don't know anything about it and therefore don't feel right going on about it.. hmm..)

If a high level language were designed to facilitate the saving and loading of execution state, then this would be abstracted across platforms (saving a developer from the impractically hacky/difficult platform-specific work).. and if they didn't go so far as to require the state to be transferable between platforms, then in theory, with a good optimizing compiler I don't see why it would necessarily be too slow. At its simplest, a facility to keep track of pointers that will be relocated on load and to register heap which must be saved (not being thorough here) may be enough to put a language on its way toward the goal (while things such as being able to keep the stack organized are common to many compilers anyway - and just may need to be dealt with in more depth). So to avoid giving the language too much burden/overhead in an attempt to make it automatically work in every situation, perhaps the application programmer would be expected to keep restoration in mind and act accordingly. Code modification after saved state could cause trouble, but not so much as to disallow all changes.

I bet some computer scientists have gotten obsessed with the idea of being able to stop and restore any execution state and have gone ahead and developed a high level language to support it across different platforms, even if only as a proof of concept (or to secure a grant, to graduate, whatever). I'm not aware of any languages advertised as such though. Strangely, I haven't felt any need for such a language since I'm generally satisfied with being able to save and load my own data.. but this project of byuu's is interesting :) In a way, interpreted languages make it easy enough to restore state by nature.. so it's surely been done before even if by accident, but the question is whether or not someone's gone for restorable execution state and also good performance.

If someone has made something like I'm describing with compilers which can generate code not much slower than the best C++ compilers, then I'd be interested in hearing about it :) Or if someone has tried and there's a reason why performance is necessarily poor, I'd be interested in finding out why.
mozz
Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz »

blargg wrote:
mozz wrote:'m kind of more interested in how much I can optimize code size anyways...
If you can get under around 4K of machine code + 1K for the jump table, you have my C++-based core beat. :)
For interest's sake: Last weekend, I wrote by hand about 90-95% of a 6502 core (in x86 assembly code) which uses two handlers to implement each instruction. The dispatch table will be 1K (two 16-bit fields in each entry) and so far the code size is about 750 bytes, however I am missing some code:
- need more dispatch code, probably about 50 bytes worth
- the memory read and write routines are not present. They will probably weigh in around 200 bytes.
- BRK and a few other instructions (one of the complex undocumented insns--ARR?--is not implemented yet) will probably add another 50 bytes or maybe even more..

Anyway, its currently looking like the entire core will weigh about 2K including dispatch table. I still haven't tried to assemble it or run it, and inevitably it has bugs in it. I don't know how well it will perform (and frankly having a core that small is so cool that even if it turns out to be slower I will probably keep it that way! hehe)
mattmatteh
Posts: 345
Joined: Fri Jul 29, 2005 3:40 pm
Location: near chicago

Post by mattmatteh »

i dont think blargg uses any asm. mine doesnt as its portable. i will have some asm but with c code as option. using all asm limits portability.

matt
User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA

Post by blargg »

My 6502 emulator doesn't use any assembly, but it also doesn't emulate any unofficial instructions nor some of the more subtle aspects of read-modify-write instructions yet (dummy reads). So mozz's likely 2-3K core with unofficial instruction support will be quite an achievement.
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

Bah, pack that baby down to 256-bytes and release it as a demoscene .com app :P

Or more realistically, it'd be kind of cool to try and design the smallest possible NES emulator in pure assembly as a .com DOS file.

For those who use Macs, a .com file is basically an executable with no header. You can't have one bigger than 64k, and it runs in 16-bit mode but you can use 32-bit registers in it.

TNES had one that was ~30kb, but it went up in size and eventually became a .exe file anyway.
tepples
Posts: 23006
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)

Post by tepples »

Under Windows, the .com suffix is also useful for fooling n00bs into executing your rootkit, as they think it's a shortcut to some web site.
mozz
Posts: 94
Joined: Mon Mar 06, 2006 3:42 pm
Location: Montreal, canada

Post by mozz »

byuu wrote:Or more realistically, it'd be kind of cool to try and design the smallest possible NES emulator in pure assembly as a .com DOS file.
Its a cool idea, but I don't know realistically how small you can make some of the stuff. The PPU for example---perhaps you could make one that was small or fast but not both. And I have no idea how small, either--maybe not very. And then think of the mappers... yikes! I bet even a clever size-optimized implementation of 100+ mappers will be bigger than my asm core. :P

At the end of the day, if all the important code+tables fits in an L1 cache then that will be good enough.

One thing I hope to do with my code generator is experiment a bit. If my core was 10x bigger but only used one handler per instruction instead of two, would it be faster or slower? (I suspect it would be a tiny bit faster, but truly I am just guessing!) But I have lots to do before I get to that point (start compiling and debugging these cores, get generating a suite of automated instruction tests and get the cores running those tests, etc).

Edit: also my code is meant for 32-bit code and flat memory model and DOS used 16-bit code and segmented memory. Insns using a 32-bit register would be one byte bigger; however, near call instructions would be 2 bytes smaller! =)
User avatar
blargg
Posts: 3715
Joined: Mon Sep 27, 2004 8:33 am
Location: Central Texas, USA

Post by blargg »

You don't need to worry about your entire code set fitting in the cache, just the often-used portion. For the CPU emulator, for example, a small set of instructions dominate. The same probably goes for what hardware is most accessed, like the PPU status register.
dvdmth
Posts: 354
Joined: Wed Mar 22, 2006 8:00 am

Post by dvdmth »

Going back a little ways here...

byuu - You're looking for a way to do savestates in a co-threaded execution model? I used co-threads once before, so I'm used to how they work. I might have a suggestion here, although I don't know how feasible it is, as I don't know how your code is set up.

Each of the three emulation threads (CPU, PPU, APU) should have one "safe yield" point where a savestate can be made. The PPU would most likely have this point during its V-Blank (or forced blank, possibly) handler, since that would be an easy time to resurrect the current state as the PPU isn't really doing anything other than counting cycles. The CPU and APU would have a "safe yield" between instructions, where execution would return to the main loop within their respective cores. Upon resuming from a "safe yield," all three threads should receive a message stating whether the thread should archive its state (to a section of memory, pending write to disk) or not. This way, each thread is responsible for archiving its state.

All threads should, upon creation, receive an "unarchive" message, telling it to load to an existing state and jump to the code immediately following the "safe yield" location. If a save state is loaded, currently executing threads would be destroyed and new threads would be created, so that a thread would not have to test for "unarchiving" after every yield.

You should have no trouble syncronizing the PPU and CPU to "safe yield" points, since the CPU should (in most cases) execute numerous instructions during V-Blank. The tricky part, however, is finding a "safe yield" point for the APU. My suggestion here is this: When the user hits the freeze state button/key, continue execution until the CPU and PPU are both at safe yield points, then tell those two threads to archive. Next, continue execution until the APU is on a safe yield point, then archive its state and write to disk (making sure to include the amount of time elapsed between the CPU/PPU states and the APU state). When resuming from a savestate, be sure to block APU execution until the appropriate amount of time passes.

The only time this will fail is if the CPU reads from an APU register (writes to an APU register won't be a problem, as their effects would be remembered in the APU state). The way to solve this would be to archive the state of these registers on every CPU read cycle between the CPU state and the APU state, then restoring these values after each cycle when loading the savestate. Thus, if the CPU ever accesses these registers during this time, it would receive the value returned by the APU at that time.

The only remaining problem I can think of would be if DMA takes up the entire V-Blank time, but I don't see how that could happen during gameplay (is it possible?).
Near
Founder of higan project
Posts: 1553
Joined: Mon Mar 27, 2006 5:23 pm

Post by Near »

Actually, a single DMA call can span over eleven fields (consider it frames in non-interlace).

8 channels * 65536 bytes/channel * 8 clocks/byte transferred = 4194304 clocks
1364 clocks/scanline * 262 lines/field = 357368 clocks/field
4194304 / 357368 = 11.7 fields

Anyway, I'm aware that if I can align the CPU and APU on opcode boundaries or "cheat" and execute instructions out of order (so long as they aren't communicating with each other it won't matter), I can save states at "safe" points. But I really wanted to avoid that sort of code trickery if possible. Especially a time buffer. If I have a time buffer between CPU and APU communications, I may as well not use cothreading to begin with. The goal was quite honestly to simplify the code.

We're really only talking a single opcode of desynchronization anyway, and even then it can't be detected unles the two are communicating with each other (eg CPU is accessing APU port and APU is accessing CPU port. One or the other would not cause a problem). It's not like other emulators (excepting probably SS) have more accurate savestates anyway ... and this allows conversion between the emulators since they cannot resume mid-opcode.

So then :

Code: Select all

void save_state() {
  while(r_cpu->scanline() < 240 || r_cpu->scanline() > 260)snes->run();
  while(r_cpu->in_opcode())r_cpu->run();
  while(r_apu->in_opcode())r_apu->run();
  capture_savestate();
}
The only uncertain part left is the PPU. Right now, the PPU does not have its own thread, and runs one scanline at a time. Therefore, I have no re-entry problems with it.
However, with a dot-based renderer, things might get more tricky. Taking a snapshot every field may work, but isn't as point-specific as I'm sure some would like. Specifically, romhacking would suffer if the savestate leaped forward when they were testing a small block of code. I would need to make the PPU renderer re-entrant, or force it to "run ahead" to the next "safe point" which I have no idea where that would be at.

Lastly, I'm still planning on creating interpretive opcode emulators that can be combined with the current scanline PPU renderer. It should hopefully be as accurate as SNES9x ever will be (but nowhere near as fast) at least, and won't require cothreading at all.

And as always, I'm still interested in a transparent way to create recursive re-entrant functions that gets at least 80% of the speed of cothreading, but without actually requiring cothreading.