Just wanted to share that I've formalized my project to write some scripts for automating aspects of setting up disassembly projects.
https://gitlab.com/segaloco/disnes
After the link is "disnes", a script that takes an iNES ROM and attempts to standup a make(1)-based build package with PRG banks disassembled and CHR broken up into individual tiles. This is built on top of da65 for the disassembly and creates a package targeting ca65 and ld65. I stuck these bits in variables with the hopes that potentially transporting this to a different toolset is possible, but haven't flexed any of those abstractions yet.
Anywho, documentation is included in the form of manpages as usual and licensing information is included as well. As of present this only supports NROM, as that was the simplest to build up the basics around, but I intend to add more mappers as I go.
The resulting package should build back into an identical ROM. A checksum mechanism is included that verifies this. One caveat is every generated PRG file, at least for now, will have a hard-coded ".org" directive emitted at the top. This is to get around the most common da65 hiccup that prevents immediate reassembly: it is common to see branch targets emitted as absolute addresses ($xxxx) rather than labels, even for addresses that fall within the range of the disassembly. I thought this only happened when a target wound up "inside" an operation, but I've even seen it with the expected label being right where you'd expect it to be. In any case, setting the origin allows these absolute addresses to resolve properly when encountered. Typically when this happens the branch is actually nonsense being interpreted as code anyway, but to support stock da65 and ca65 execution without a bunch of options or multiple passes, this was the quickest way.
Note there is one crucial thing missing from the package generation at present. It does not generate an extraction script for pulling assets back out of a ROM. All of the necessary data to do so is present during this process, I just forgot to add that until I already packed it all up for the first push. That's something I'll probably add soon just know that while this produces a package that will build, it's not one that could be pushed up to a git repo and then be useful to others without manually adding an extract script.
If you use this to seed a project, giving a shoutout would be cool. I don't care about recognition myself but I would like to see the tooling spread if it is found useful. Also consider it WIP for the foreseeable future, big chunks may change if I get down the road with mappers and things get hairy. Also know this'll only ever be a "break it into a package" tool. I'm considering some stuff layered on top of this for things like automatic hardware register labeling, generation and application of memory labels, that sort of thing, but those would exist separately as to not widen the scope of this script.
Happy hunting!
Scripted Disassembly/Analysis Tools
-
segaloco
- Posts: 959
- Joined: Fri Aug 25, 2023 11:56 am
Re: Scripted Disassembly/Analysis Tools
Haven't done much with the above tool, but threw together something I've probably built several versions of already and it may be a factor in further disassembly pipeline stuff.
https://gitlab.com/segaloco/misc/-/blob ... pts/dump65
After the link is dump65, a simple script to dump binary objects in the format ca65 expects for inclusion as data. That means you get .byte, .word, .addr, and .dbyt representations of whatever you pass in. There are also options to control the start and stop offset of extraction. Be forewarned it is not foolproofed against things like picking an offset that ends before it starts, when in doubt consult the license. Included in the same directory is the following manpage:
Started as just writing a script to plug in a start and end offset and get bytes out, but realized I could generalize it a bit. The da65 disassembler has some scheme by which you can use a configuration file with memory offsets to get it to emit particular datatypes, but it's a bit unwieldy for more incremental stuff. Like I said, I've probably written something like this a few times now, possibly even shared it here before, but this one feels a little more elegant inside.
Now the trick would be to connect this with some ld65 map interpreter to allow seeking to symbols based on their position in the binary...
https://gitlab.com/segaloco/misc/-/blob ... pts/dump65
After the link is dump65, a simple script to dump binary objects in the format ca65 expects for inclusion as data. That means you get .byte, .word, .addr, and .dbyt representations of whatever you pass in. There are also options to control the start and stop offset of extraction. Be forewarned it is not foolproofed against things like picking an offset that ends before it starts, when in doubt consult the license. Included in the same directory is the following manpage:
Code: Select all
DUMP65(1) General Commands Manual DUMP65(1)
NAME
dump65 - dump binary data as ca65-compatible text
SYNOPSIS
dump65 [ -s start ] [ -e end ] [ -t type_string ] [ file ]
DESCRIPTION
Dump65 prints an input file according to a set of coordinates and type
information. If file is omitted, the standard input is used.
The -s option indicates the offset to start dumping at. The -e option
indicates the offset to end dumping at. Each coordinate may be
hexadecimal, octal, binary, or decimal, which are prefixed with 0x, 0,
0b, and no prefix respectively. Either may be omitted, with the
default start and end being the start and end of the file.
The -t option indicates the data representation to use in the output.
The allowable values include byte, addr, word, and dbyt. When omitted,
byte is assumed. These values are the ca65(1) type directives to
prefix the output with.
SEE ALSO
ca65(1)
BUGS
Bees are cool.
DUMP65(1)
Now the trick would be to connect this with some ld65 map interpreter to allow seeking to symbols based on their position in the binary...
-
segaloco
- Posts: 959
- Joined: Fri Aug 25, 2023 11:56 am
Re: Scripted Disassembly/Analysis Tools
Sharing that I've done some significant work on my ddnes(1) utility, manual given below:
The prior version was pretty tightly coupled to how I'm doing my disnes(1) disassembler project, but I decided I needed the general concern of iNES ROM chunking in other places, and furthermore the more flexible interface is to have it mash things together on stdout, instead requesting the splitting behavior with implementation-specific filenames when actually needed.
The result is an improved version which handles a number of new cases:
I'm leaving ddnes(1) in the larger disnes project repository for now but it's certainly gained value of its own outside of that project. What I really need to do is start considering pulling all my various NES-related development tools together and making a singular, more consistent package of it all....
Edit: As the prior lines proposed, this is now packaged with several other NES-related tools here: https://gitlab.com/segaloco/misc/-/blob ... /fc_tools/
Code: Select all
DDNES(1) General Commands Manual DDNES(1)
NAME
ddnes - dd-like utility for iNES ROMs
SYNOPSIS
ddnes [option=value] ...
DESCRIPTION
Ddnes, by default, copies the components of the provided iNES ROM to
the standard output. The following options may be used with ddnes:
option values
if=file input file name; standard input is default
of=file output file name; standard output is default; if this
option is a directory, the conv option split is assumed
pbs=n copy PRG in n byte chunks; default is 16384
cbs=n copy CHR in n byte chunks; default is 8192
skipp=n skip n PRG banks before starting PRG copy
countp=n copy only n PRG banks
skipc=n skip n CHR banks before starting CHR copy
countc=n copy only n CHR banks
conv=... special conversion options
Where sizes are specified, a number of banks is expected.
During execution, ddnes reports the statistics of each dd(1) operation
on the standard error.
If the conv option split is given, the PRG banks are emitted as
sequential files named with the printf(3) specification prg%.2X.bin in
the output directory and the CHR banks are emitted as sequential
directories named with the printf(3) specification chr%.2X in the
output directory then containing tiles named with the printf(3)
specification tile_%.2X.bin. In both cases, the numbering of banks
reflects their positions in the ROM image. In addition to the PRG and
CHR banks, the iNES header is emitted in the output directory as
header.bin.
If the conv option trainer is given, then the ROM is assumed to contain
a 512-byte trainer. If the conv option split is given, the trainer is
emitted in the output directory as trainer.bin.
If the conv option mmc3 is given, the bank sizes are taken as those
swapped by the Nintendo Multi-Memory Controller #3. In this case, any
pbs and cbs options are silently ignored. Otherwise, the sizes
employed by the NROM mapper are used as defaults. This behavior may
also be explicitly requested with the conv option nrom.
SEE ALSO
dd(1).
DIAGNOSTICS
f+p records in(out) numbers of full and partial records read(written)
by dd(1).
BUGS
DDNES(1)
The result is an improved version which handles a number of new cases:
- Trainers are now supported, supply conv=trainer to get this behavior
- Splitting behavior is now behind conv=split, with default behavior being mash everything that isn't a header or trainer into the stdout. Currently I can't think of scenarios where I need to yes/no on these being sent down the pipe, maybe later.
- A general block sizing mechanism is given for PRG and CHR banks so that the skip and count units can be adjusted. This is in line with dd(1) itself offering block sizing options. I found this useful since different mappers express the logical block sizes differently than the iNES header, so this allows minute overrides of what units the logical banks are in while still allowing extraction from the header instead. Be forewarned, this expects you to get it right, if you supply bogus bank sizes, you are liable for the over/underrun.
- An "mmc3" conversion is given that performs the block sizing based on MMC3 banking. This was my initial use-case for retooling the block sizing behavior, then I generalized it.
- The "of" option is now really just any file. If it is a directory, the split option will be assumed and the requested data will be spit out there as a number of files. If it is a file, the split option is not assumed and instead whatever bank data is generated is spit out on that file, overwriting it if anything is already there. The default output is now the stdout for pipelines.
I'm leaving ddnes(1) in the larger disnes project repository for now but it's certainly gained value of its own outside of that project. What I really need to do is start considering pulling all my various NES-related development tools together and making a singular, more consistent package of it all....
Edit: As the prior lines proposed, this is now packaged with several other NES-related tools here: https://gitlab.com/segaloco/misc/-/blob ... /fc_tools/
Last edited by segaloco on Sun Oct 26, 2025 10:21 pm, edited 2 times in total.
-
segaloco
- Posts: 959
- Joined: Fri Aug 25, 2023 11:56 am
Re: Scripted Disassembly/Analysis Tools
Here's a simple one, ca65dump: https://gitlab.com/segaloco/misc/-/blob ... ca65data.c
Takes things like .byte, .addr, .word, etc. and spits out a binary. Yes, ca65 can already do this, but my tool does it as a filter, meaning this can act as my interface to a number of data processing tools without having to go find the binary offsets of data to supply the slices via dd(1) anymore.
From the (quite curt) manpage:
This has been helpful for quickly providing data to other tools, rather than having to modify the tools to accept a text format. This means they could still be fed by other processors as well, for instance, something that takes a JSON description of OAM or audio data and transforms it into a proper BLOB.
Anywho, one such tool is trackdump from my SMB3 disassembly: https://gitlab.com/segaloco/smb3/-/blob ... rackdump.c
This tool takes a BLOB of length-run Nintendo music tracker format data and generates the necessary .byte directives using my header defines to describe the music track. In other words, the output becomes something like:
Well, tying these two tools together in a pipeline of
Then I can literally hand type .byte directives at it and it spits out the corresponding symbolic representation. Of course, like any stdio filter, I can simply then hook up data to process via < and >, or I can use the tool interactively on a terminal.
This speaks to a general benefit. I use fflush(3) in both tools to ensure the data is sent down the pipe ASAP, meaning for instance if I have the results of a pipeline piped into a viewer that refreshes upon receiving new data, I can have a terminal input surface that renders its results in, well, whatever format the output is. In practice I've built tools to do this via text with both decoding the sound format and the string format in SMB3. This means I can simply copy series of .byte directives out, redirect to some file in the filesystem, and have that file open in a file editor that refreshes. I drop data to process into the terminal, and the result pops up in my editor. If my editor happens to be something like vi(1) where I am then calling the script on a selection of data, the ! shell escape feature then allows me to spot-replace byte directives without even leaving the editor.
in other words, by having a tool like ca65data, the text format of data directives from ca65 now become the input to an infinite array of data analysis and processing tools from any environment capable of requesting a service from the shell. One of my only big gripes with ca65 is that it does not have a stdio filter mode, that you can't simply use ca65 as an assembly step for processing arbitrary assembler text into binary form. I get it though, there's things like dealing with labels, but even just an option to emit with zeroed out immediates/operands when operating in filter mode would be so helpful. That or even just buffering until EOF and letting one hook up stdin/stdout. It's not super efficient but it does allow one to use it in pipelines. No shade though, I wouldn't be using ca65 if it wasn't the best assembler for my workflow. This just meant to fit a very focused need, and meets it.
Takes things like .byte, .addr, .word, etc. and spits out a binary. Yes, ca65 can already do this, but my tool does it as a filter, meaning this can act as my interface to a number of data processing tools without having to go find the binary offsets of data to supply the slices via dd(1) anymore.
From the (quite curt) manpage:
Code: Select all
CA65DATA(1) General Commands Manual CA65DATA(1)
NAME
ca65data - ca65-compatible data assembler
SYNOPSIS
ca65data
DESCRIPTION
Ca65data filters a series of data directives, those supported by
ca65(1), into binary data.
SEE ALSO
ca65(1)
BUGS
CA65DATA(1)
Anywho, one such tool is trackdump from my SMB3 disassembly: https://gitlab.com/segaloco/smb3/-/blob ... rackdump.c
This tool takes a BLOB of length-run Nintendo music tracker format data and generates the necessary .byte directives using my header defines to describe the music track. In other words, the output becomes something like:
Code: Select all
.byte apu_patch_vals::patch_01|4
.byte sound_notes::Fx4
.byte sound_notes::G4
.byte sound_notes::G4
.byte sound_notes::Fx4
.byte sound_notes::G4
.byte sound_notes::G4
.byte sound_notes::Fx4
.byte sound_notes::G4
.byte sound_notes::A4
.byte sound_notes::Gx4
.byte sound_notes::A4
.byte apu_patch_vals::patch_01|8
.byte sound_notes::Ax4
.byte APU_TRACK_NOP
.byte sound_notes::B4
.byte apu_patch_vals::patch_01|4
.byte sound_notes::A4
.byte sound_notes::G4
.byte sound_notes::F4
.byte APU_TRACK_END
Code: Select all
ca65data | trackdump
This speaks to a general benefit. I use fflush(3) in both tools to ensure the data is sent down the pipe ASAP, meaning for instance if I have the results of a pipeline piped into a viewer that refreshes upon receiving new data, I can have a terminal input surface that renders its results in, well, whatever format the output is. In practice I've built tools to do this via text with both decoding the sound format and the string format in SMB3. This means I can simply copy series of .byte directives out, redirect to some file in the filesystem, and have that file open in a file editor that refreshes. I drop data to process into the terminal, and the result pops up in my editor. If my editor happens to be something like vi(1) where I am then calling the script on a selection of data, the ! shell escape feature then allows me to spot-replace byte directives without even leaving the editor.
in other words, by having a tool like ca65data, the text format of data directives from ca65 now become the input to an infinite array of data analysis and processing tools from any environment capable of requesting a service from the shell. One of my only big gripes with ca65 is that it does not have a stdio filter mode, that you can't simply use ca65 as an assembly step for processing arbitrary assembler text into binary form. I get it though, there's things like dealing with labels, but even just an option to emit with zeroed out immediates/operands when operating in filter mode would be so helpful. That or even just buffering until EOF and letting one hook up stdin/stdout. It's not super efficient but it does allow one to use it in pipelines. No shade though, I wouldn't be using ca65 if it wasn't the best assembler for my workflow. This just meant to fit a very focused need, and meets it.
-
segaloco
- Posts: 959
- Joined: Fri Aug 25, 2023 11:56 am
Re: Scripted Disassembly/Analysis Tools
This one is a prototype, the final version of this tool will support multiple bases/data representations (anything POSIX od(1) can spit out), optionally searching for overlapping instances of the same pattern, and case-insensitivity, but it's a start. I'm calling this preliminary version "hexgrep":
The above script accepts a data stream on stdin. This data stream is searched for a series of hexadecimal byte values provided as an argument. The stream is then searched, and the hexadecimal offset of each (non-overlapping) instance of the string of bytes is printed in hexadecimal on the standard output.
If no pattern and/or payload are supplied, no processing is done. Otherwise the string length and substring search features of sh(1) are used to determine the position of an instance. If the position is valid and was pulled from a meaningful string (i.e. the cursor has not reached the end) that value is printed. Otherwise, it assumes the data has been consumed and/or no match could be found and exits.
As mentioned, this is a quick and dirty, I'm going to put together an enhanced version that supports more options, but this meets my immediate case, which is finding the offset of specific subroutines in disparate iNES ROMs. This should speed up things like identifying code reuse across different titles, at least for subroutines with very recognizable, consistent series of bytes in them. This replaces one more concern of a monolithic hex editor. Good for me too because I haven't touched a graphical hex editor in years and frankly don't intend to ever again.
Edit: Legal cma, the name hexgrep is a tentative name subject to change and does not imply affiliation with other projects or products called hexgrep. Any trademarks are retained by their respective owners, this tool will certainly be named something different in its fully realized state. I'll do better research on names, because this one is taken in a few places.
Code: Select all
#!/bin/sh
#
# hexgrep - grep(1) for hexadecimal strings in binary files
#
# Copyright 2025 Matthew Gilmore
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its contributors
# may be used to endorse or promote products derived from this software without
# specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
BIN=`basename $0`
DATA=`od -An -txC | tr '\n\t' ' ' | sed -e 's/[ ]\{2,\}/ /g' -e 's/^ //g' -e 's/ $/\n/g'`
PATTERN="$1"
if test -z "$DATA"
then
printf "%s: unexpected EOF\n" "$BIN" >&2
exit 1
fi
if test -z "$PATTERN"
then
printf "%s: empty pattern\n" "$BIN" >&2
exit 1
fi
DATA_WRK="$DATA"
while true
do
TAIL=${DATA_WRK#*$PATTERN}
POS=`expr \( ${#DATA} - \( ${#PATTERN} + ${#TAIL} \) \) / 3`
if test $POS -ge 0 && test ${#DATA_WRK} -gt 0
then
printf "0x%X\n" $POS
else
break
fi
DATA_WRK="$TAIL"
done
If no pattern and/or payload are supplied, no processing is done. Otherwise the string length and substring search features of sh(1) are used to determine the position of an instance. If the position is valid and was pulled from a meaningful string (i.e. the cursor has not reached the end) that value is printed. Otherwise, it assumes the data has been consumed and/or no match could be found and exits.
As mentioned, this is a quick and dirty, I'm going to put together an enhanced version that supports more options, but this meets my immediate case, which is finding the offset of specific subroutines in disparate iNES ROMs. This should speed up things like identifying code reuse across different titles, at least for subroutines with very recognizable, consistent series of bytes in them. This replaces one more concern of a monolithic hex editor. Good for me too because I haven't touched a graphical hex editor in years and frankly don't intend to ever again.
Edit: Legal cma, the name hexgrep is a tentative name subject to change and does not imply affiliation with other projects or products called hexgrep. Any trademarks are retained by their respective owners, this tool will certainly be named something different in its fully realized state. I'll do better research on names, because this one is taken in a few places.