LLVM-MOS NES targets
Moderator: Moderators
-
- Posts: 6
- Joined: Sat Apr 23, 2022 12:34 pm
LLVM-MOS NES targets
Hi nesdev,
I'm the primary codegen author for the LLVM-MOS 6502 backend for Clang/LLVM. I finally got around to playing around with developing some quick example routines for my NES, and I ended up with the basis for a real port for LLVM-MOS to the NES.
I've added skeletal targets to the LLVM-MOS SDK for the NES-NROM-128, NES-NROM-256, and NES-SLROM (MMC1) boards. I'm planning to target at least one board for each Nintendo-produced mapper, to make sure that various banking schemes are all reasonably supportable.
Right now, only basic NES poweron functionality is provided, as well as the usual C runtime initialization and finalization routines. I've also added a small PPU support library, and simple color-cycling example. The compiler outputs both ELF binaries (useful for command-line manipulation) and iNES 2.0 files. The contents of various fields in the iNES header can be controlled by setting the values of corresponding linker symbols. All the math to construct the eventual iNES 2.0 header is handled automatically.
Our targets are set up in a hierarchical fashion, so it can be as little as 50 lines to add a new one to the SDK. For example, both the NES-NROM-128 and NES-NROM-256 targets inherit most of their code/config from an incomplete NES-NROM target. Accordingly, I'd eventually like to collect pretty much all of the production boards (well, that anyone's interested in developing on, at least) into the SDK. If we set things up right, it shouldn't take too much maintenance overhead per board to keep them around, which should provide a nice out-of-the-box experience when developing for them.
Let me know if you have any questions about the NES targets, our plans, comments, critiques, etc. I'll add that we're perenially committed to improving the quality of generated code, although this competes with other concerns like correctness, portability, and maintainability. Still, at the end of the day, a compiler is only useful if it generates "good enough" code, and we want llvm-mos to be good enough for all but the tightest inner loops of an game or application (and maybe even those, someday.)
I'm the primary codegen author for the LLVM-MOS 6502 backend for Clang/LLVM. I finally got around to playing around with developing some quick example routines for my NES, and I ended up with the basis for a real port for LLVM-MOS to the NES.
I've added skeletal targets to the LLVM-MOS SDK for the NES-NROM-128, NES-NROM-256, and NES-SLROM (MMC1) boards. I'm planning to target at least one board for each Nintendo-produced mapper, to make sure that various banking schemes are all reasonably supportable.
Right now, only basic NES poweron functionality is provided, as well as the usual C runtime initialization and finalization routines. I've also added a small PPU support library, and simple color-cycling example. The compiler outputs both ELF binaries (useful for command-line manipulation) and iNES 2.0 files. The contents of various fields in the iNES header can be controlled by setting the values of corresponding linker symbols. All the math to construct the eventual iNES 2.0 header is handled automatically.
Our targets are set up in a hierarchical fashion, so it can be as little as 50 lines to add a new one to the SDK. For example, both the NES-NROM-128 and NES-NROM-256 targets inherit most of their code/config from an incomplete NES-NROM target. Accordingly, I'd eventually like to collect pretty much all of the production boards (well, that anyone's interested in developing on, at least) into the SDK. If we set things up right, it shouldn't take too much maintenance overhead per board to keep them around, which should provide a nice out-of-the-box experience when developing for them.
Let me know if you have any questions about the NES targets, our plans, comments, critiques, etc. I'll add that we're perenially committed to improving the quality of generated code, although this competes with other concerns like correctness, portability, and maintainability. Still, at the end of the day, a compiler is only useful if it generates "good enough" code, and we want llvm-mos to be good enough for all but the tightest inner loops of an game or application (and maybe even those, someday.)
Re: LLVM-MOS NES targets
Any support for structure rearrangement? (e.g. an array of 256 16-bit numbers being striped into two separate 8-bit arrays to use the faster instructions)
-
- Posts: 6
- Joined: Sat Apr 23, 2022 12:34 pm
Re: LLVM-MOS NES targets
Not yet, but it's something we'd definitely like to build. Making the analysis safe is a bit tricky: you have to prove that no wide pointers to the interior of the array can escape the analysis, and you have to convert all pointer uses that don't escape.
Re: LLVM-MOS NES targets
Would it be possible to add a special __attribute__ instead or in addition? A promise by the programmer that it will only be used in the optimized way?
-
- Posts: 6
- Joined: Sat Apr 23, 2022 12:34 pm
Re: LLVM-MOS NES targets
Maybe, it's not something I've spent much time thinking about yet. We'd need a really precise definition of what the optimized way actually is, as it'd be undefined behavior if the programmer stepped out of line. Ideally, this would also be difficult to accidentally do.
That's why we've tended to shy away from hand annotation whenever possible; it adds to the number of things the programmer needs to keep in mind, and it decreases the compilers flexibility.
For example, you probably wouldn't want to do this optimization if the array was of length 257; if it were automatic, then there's no risk of the programmer forgetting to remove the annotation if they change the array size. That's why modern compilers almost completely ignore the register keyword, for example. They end up decreasing performance in practice, since their performance implications are complex, and incompletely understood by programmers.
Still, we do use some hand annotation; there will always be things that the compiler won't ever reasonably be able to figure out. I'd wager for this one, doing it automatically may only be around 1.5x or 2x harder than via annotation, but I'll know more once I get around to it.
Re: LLVM-MOS NES targets
One optimization I'd really like to see is elimination of variables on the stack.
This would be for code isn't recursive (either directly, or indirectly).
I could describe it better later if there's any interest.
This would be for code isn't recursive (either directly, or indirectly).
I could describe it better later if there's any interest.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
-
- Posts: 6
- Joined: Sat Apr 23, 2022 12:34 pm
Re: LLVM-MOS NES targets
We actually do that one; we call it "static stack optimization". We analyze the call graph of each translation unit, and the stack frame of each function we can prove non-recursive is replaced with a global variable. We default to generating code at link time (i.e., link time optimimization), so a "translation unit" is typically the whole program.
Eventually, we'd like to allow stack stack regions for functions that cannot be simultaneously active to overlap. There's not much technical obstacle to doing so, I just haven't gotten around to it yet.
-
- Posts: 6
- Joined: Sat Apr 23, 2022 12:34 pm
Re: LLVM-MOS NES targets
I briefly mentioned this on the discord, but there's been two fairly big additions to LLVM-MOS's code generator since the last time I checked in.
First, static stack frames of different functions can now overlap if the functions can be proven to never simultaneously be active.
Second, the compiler now allocates the zero page! It scans the whole-program call graph, estimates how often each instruction in each function is called, estimates the cycle/byte savings of moving each possible global/local/constant to the zero page, then greedily allocates candidates best-first until the available zero page is consumed. Each target defaults to using all available zero page, but the amount the compiler can use can be capped with a compiler flag. As with static stack, zero page frames can overlap.
Here's an example. Note that only 25 bytes of zero page are used; foo does not conflict with bar, so they share the same region of the zero page. The large array in main is placed in a static stack in main memory, as usual. Sections that begin with '.zp' are automatically placed in the zero page by the SDK's linker scripts. In this example there's no real savings, but it does show off the semantics.
There's a few things that the current approach can't do, but overall it works pretty well. (The compiler doesn't have a notion of an 8-bit pointer yet, so it can't see any benefit to rewriting general pointer loops over arrays lifted to the zero page. This only applies if absolute indexed address mode wasn't selected, though.)
The next big "humans do this but compilers just don't" optimization is converting arrays of structs to structs of arrays. But I think it's important to pause at this point and start improving the SDK's libraries; a full suite of hardware registers for the NES is near the top of my list. I'll try to port over cc65's headers wherever appropriate so there's a degree of compatibility between the compilers.
Take care!
First, static stack frames of different functions can now overlap if the functions can be proven to never simultaneously be active.
Second, the compiler now allocates the zero page! It scans the whole-program call graph, estimates how often each instruction in each function is called, estimates the cycle/byte savings of moving each possible global/local/constant to the zero page, then greedily allocates candidates best-first until the available zero page is consumed. Each target defaults to using all available zero page, but the amount the compiler can use can be capped with a compiler flag. As with static stack, zero page frames can overlap.
Here's an example. Note that only 25 bytes of zero page are used; foo does not conflict with bar, so they share the same region of the zero page. The large array in main is placed in a static stack in main memory, as usual. Sections that begin with '.zp' are automatically placed in the zero page by the SDK's linker scripts. In this example there's no real savings, but it does show off the semantics.
Code: Select all
static char * volatile global;
__attribute__((noinline)) void foo() {
char foo_local[5];
global = foo_local;
}
__attribute__((noinline)) void bar() {
char bar_local[10];
global = bar_local;
}
int main(void) {
char main_local[15];
char big_local[512];
global = main_local;
global = big_local;
foo();
bar();
return 0;
}
Code: Select all
foo:
ldx #mos8(.Lfoo_zp_stk)
ldy #mos8(0)
stx global
sty global+1
rts
bar:
ldx #mos8(.Lbar_zp_stk)
ldy #mos8(0)
stx global
sty global+1
rts
main:
ldx #mos8(.Lmain_zp_stk)
ldy #mos8(0)
stx global
sty global+1
ldx #mos16lo(.Lmain_sstk)
ldy #mos16hi(.Lmain_sstk)
stx global
sty global+1
jsr foo
jsr bar
ldx #0
txa
rts
.section .bss.global,"aw",@nobits
global:
.short 0
.section .zp.noinit..Lzp_stack,"aw",@nobits
.Lzp_stack:
.zero 25
.section .noinit..Lstatic_stack,"aw",@nobits
.Lstatic_stack:
.zero 512
.set .Lfoo_zp_stk, .Lzp_stack+15
.size .Lfoo_zp_stk, 5
.set .Lbar_zp_stk, .Lzp_stack+15
.size .Lbar_zp_stk, 10
.set .Lmain_zp_stk, .Lzp_stack
.size .Lmain_zp_stk, 15
.set .Lmain_sstk, .Lstatic_stack
.size .Lmain_sstk, 512
The next big "humans do this but compilers just don't" optimization is converting arrays of structs to structs of arrays. But I think it's important to pause at this point and start improving the SDK's libraries; a full suite of hardware registers for the NES is near the top of my list. I'll try to port over cc65's headers wherever appropriate so there's a degree of compatibility between the compilers.
Take care!
Re: LLVM-MOS NES targets
I've joined the LLVM-MOS project in some capacity and would like to bump the thread to bring an update on NES support in LLVM-MOS, and its code generation in general, since July 2022. The full changelog is available here, as usual.
Code generation
The NES targets have been completely reworked:
Code generation
- Whole-program automatic zero page allocation - LLVM-MOS now automatically allocates global variables/constants, function local variables, and callee-saved registers to function-specific zero page locations whenever possible. There are also heuristics implemented to estimate and try to maximize benefit during selection of variables to be promoted in such a way.
- Marking sections and variables as zero-page; right now this relies on __attribute__((section)), but work on proper C-side support (__zeropage address space) is ongoing.
- Small memory copy/set operations are now properly inlined, instead of emitting an expensive library call.
- Many minor and major code generation optimizations have been added.
The NES targets have been completely reworked:
- Many additional mappers are supported, and existing ones have been reworked. The list is now: CNROM, NROM, MMC1, MMC3, Action 53 (thanks to jroweboy) and UNROM; homebrew scene favorites UNROM-512 and GTROM are scheduled for the upcoming release. Most mappers now have test suites, which also serve as examples on how to use their banking functionality in code. Suggestions for additional mappers are welcome!
- The iNES header information is now fully configurable using either an assembly-language file or C macros.
- The neslib, nesdoug and FamiTone2 libraries have been ported over and are now available in LLVM-MOS.
- The .dpcm section has been added for correctly allocating DPCM sample data without hassle; in addition, a __dpcm_offset symbol is defined with the correct value for APU usage. (Note that on 32K mappers, like GTROM, each bank has its own .dpcm_N section.)
- Many cc65 headers (nes.h, peekpoke.h) have been ported over to LLVM-MOS for easier code porting.
Re: LLVM-MOS NES targets
mysterymath, asie, anyone else who has worked on this project, I just wanted to say thanks for bringing this to the 6502, and including so many NES resources in the SDK. The results from the compiler are excellent.
I've attached a couple of simple demos. Shows how to build from a batch file in Windows, and how to include CHR-ROM data with an NROM program.
example11, by Shiru. Originally for cc65, the worst-case frames take up the entire frame. With LLVM-MOS, it takes a little over 1/3rd of the frame. Note that it builds with the wrong mirroring, thankfully this is much easier to configure with the newest SDK version, but I left it as it was.
ballsc, same benchmark test of naively-written array of structs code that I've run on cc65, vbcc6502, and now here. I think vbcc was getting 62 of 64 objects, LLVM-MOS is the first one to handle all 64, and has significant idle time left.. about 60 scanlines, more than 6K CPU cycles.
It's really cool to see support might be added for the unofficial DCP instruction. When I manually optimized neslib's vram_write and vram_read functions, I used DCP in there. AXS is also a nice one for incrementing X, inc by four is common when dealing with OAM.
I was wondering if I could help the project by adding optimizing info into the compiler, but looking through the LLVM docs, there's a lot to take in. I'd like to help out where I can, though.
I was wondering also, if it's worth considering including an "identity table" to extend the instruction set.
They are all 3-byte, 4-cycle instructions, but (for example) allowing something like SBC ident,y is 2 cycles faster than doing STY $00, SBC $00.https://www.nesdev.org/wiki/Identity_table
I've attached a couple of simple demos. Shows how to build from a batch file in Windows, and how to include CHR-ROM data with an NROM program.
example11, by Shiru. Originally for cc65, the worst-case frames take up the entire frame. With LLVM-MOS, it takes a little over 1/3rd of the frame. Note that it builds with the wrong mirroring, thankfully this is much easier to configure with the newest SDK version, but I left it as it was.
ballsc, same benchmark test of naively-written array of structs code that I've run on cc65, vbcc6502, and now here. I think vbcc was getting 62 of 64 objects, LLVM-MOS is the first one to handle all 64, and has significant idle time left.. about 60 scanlines, more than 6K CPU cycles.
It's really cool to see support might be added for the unofficial DCP instruction. When I manually optimized neslib's vram_write and vram_read functions, I used DCP in there. AXS is also a nice one for incrementing X, inc by four is common when dealing with OAM.
I was wondering if I could help the project by adding optimizing info into the compiler, but looking through the LLVM docs, there's a lot to take in. I'd like to help out where I can, though.
I was wondering also, if it's worth considering including an "identity table" to extend the instruction set.
They are all 3-byte, 4-cycle instructions, but (for example) allowing something like SBC ident,y is 2 cycles faster than doing STY $00, SBC $00.https://www.nesdev.org/wiki/Identity_table
- Attachments
-
- example11-llvm-mos.zip
- (7.91 KiB) Downloaded 37 times
-
- ballsc.zip
- (4.6 KiB) Downloaded 30 times
Re: LLVM-MOS NES targets
Thank you for your long-time work for the NES scene, and I'm happy to have been able to add support for your mapper
You might also find mysterymath's ports of the nesdoug examples to LLVM-MOS interesting.
DCP for multiple-byte decrements has now been merged, and should be available in the next release. It will, however, require the -mcpu=mos6502x switch while compiling; we don't enable unofficial instructions by default. (As a side-note, we've also optimized multi-byte decrements for official 6502 opcode users.)It's really cool to see support might be added for the unofficial DCP instruction. When I manually optimized neslib's vram_write and vram_read functions, I used DCP in there. AXS is also a nice one for incrementing X, inc by four is common when dealing with OAM.
AXS is a little tricker to add support for.
(As another side-note, as a newcomer to NES development, I'm pretty sure it was this tweet which initially set me on the path of adding support for them. Thank you!)
The LLVM-MOS Discord server is where most of the chat happens. However, even without contributing to the code itself, identifying places which require optimization and documenting them (providing good test cases as issues on the GitHub repository, there's already some open) is a great help in itself. If you'd like to tackle LLVM itself nonetheless, I highly recommend following the video resources linked at the LLVM-MOS wiki as a start - that's what I did, and they really helped provide a "bird's eye" view of the target backend architecture. I also recommend playing around with the godbolt Compiler Explorer - it has support for many LLVM-MOS targets, and "Add New... -> LLVM Opt Pipeline" can be used to explore all of the compiler passes performed on the code in a (relatively) user-friendly manner.I was wondering if I could help the project by adding optimizing info into the compiler, but looking through the LLVM docs, there's a lot to take in. I'd like to help out where I can, though.
Alternatively, contributing to the SDK side of things - libraries, non-LLVM tooling - would be of great help too. While llvm-mos-sdk mostly as a set of low-level implementations for many targets, a kind of starting point, there's almost certainly demand for wrapping the LLVM-MOS tooling around a more opinionated set of libraries and tools, providing an actual ready-to-use NES developer workflow. Cogwheel over on our Discord has been looking into this as an option, after working on some neslib optimizations - you may find that of particular interest.
I don't see a reason not to support it, especially as we already implement an "identity table" (of dynamic size) for mappers with bus conflicts. I have opened a relevant issue, but I cannot promise an ETA.I was wondering also, if it's worth considering including an "identity table" to extend the instruction set.