Scripted Disassembly/Analysis Tools

A place where you can keep others updated about your NES-related projects through screenshots, videos or information in general.
User avatar
segaloco
Posts: 959
Joined: Fri Aug 25, 2023 11:56 am

Scripted Disassembly/Analysis Tools

Post by segaloco »

Just wanted to share that I've formalized my project to write some scripts for automating aspects of setting up disassembly projects.

https://gitlab.com/segaloco/disnes

After the link is "disnes", a script that takes an iNES ROM and attempts to standup a make(1)-based build package with PRG banks disassembled and CHR broken up into individual tiles. This is built on top of da65 for the disassembly and creates a package targeting ca65 and ld65. I stuck these bits in variables with the hopes that potentially transporting this to a different toolset is possible, but haven't flexed any of those abstractions yet.

Anywho, documentation is included in the form of manpages as usual and licensing information is included as well. As of present this only supports NROM, as that was the simplest to build up the basics around, but I intend to add more mappers as I go.

The resulting package should build back into an identical ROM. A checksum mechanism is included that verifies this. One caveat is every generated PRG file, at least for now, will have a hard-coded ".org" directive emitted at the top. This is to get around the most common da65 hiccup that prevents immediate reassembly: it is common to see branch targets emitted as absolute addresses ($xxxx) rather than labels, even for addresses that fall within the range of the disassembly. I thought this only happened when a target wound up "inside" an operation, but I've even seen it with the expected label being right where you'd expect it to be. In any case, setting the origin allows these absolute addresses to resolve properly when encountered. Typically when this happens the branch is actually nonsense being interpreted as code anyway, but to support stock da65 and ca65 execution without a bunch of options or multiple passes, this was the quickest way.

Note there is one crucial thing missing from the package generation at present. It does not generate an extraction script for pulling assets back out of a ROM. All of the necessary data to do so is present during this process, I just forgot to add that until I already packed it all up for the first push. That's something I'll probably add soon just know that while this produces a package that will build, it's not one that could be pushed up to a git repo and then be useful to others without manually adding an extract script.

If you use this to seed a project, giving a shoutout would be cool. I don't care about recognition myself but I would like to see the tooling spread if it is found useful. Also consider it WIP for the foreseeable future, big chunks may change if I get down the road with mappers and things get hairy. Also know this'll only ever be a "break it into a package" tool. I'm considering some stuff layered on top of this for things like automatic hardware register labeling, generation and application of memory labels, that sort of thing, but those would exist separately as to not widen the scope of this script.

Happy hunting!
User avatar
segaloco
Posts: 959
Joined: Fri Aug 25, 2023 11:56 am

Re: Scripted Disassembly/Analysis Tools

Post by segaloco »

Haven't done much with the above tool, but threw together something I've probably built several versions of already and it may be a factor in further disassembly pipeline stuff.

https://gitlab.com/segaloco/misc/-/blob ... pts/dump65

After the link is dump65, a simple script to dump binary objects in the format ca65 expects for inclusion as data. That means you get .byte, .word, .addr, and .dbyt representations of whatever you pass in. There are also options to control the start and stop offset of extraction. Be forewarned it is not foolproofed against things like picking an offset that ends before it starts, when in doubt consult the license. Included in the same directory is the following manpage:

Code: Select all

DUMP65(1)                   General Commands Manual                  DUMP65(1)

NAME
       dump65 - dump binary data as ca65-compatible text

SYNOPSIS
       dump65 [ -s start ] [ -e end ] [ -t type_string ] [ file ]

DESCRIPTION
       Dump65 prints an input file according to a set of coordinates and type
       information.  If file is omitted, the standard input is used.

       The -s option indicates the offset to start dumping at.  The -e option
       indicates the offset to end dumping at.  Each coordinate may be
       hexadecimal, octal, binary, or decimal, which are prefixed with 0x, 0,
       0b, and no prefix respectively.  Either may be omitted, with the
       default start and end being the start and end of the file.

       The -t option indicates the data representation to use in the output.
       The allowable values include byte, addr, word, and dbyt.  When omitted,
       byte is assumed.  These values are the ca65(1) type directives to
       prefix the output with.

SEE ALSO
       ca65(1)

BUGS
       Bees are cool.

                                                                     DUMP65(1)
Started as just writing a script to plug in a start and end offset and get bytes out, but realized I could generalize it a bit. The da65 disassembler has some scheme by which you can use a configuration file with memory offsets to get it to emit particular datatypes, but it's a bit unwieldy for more incremental stuff. Like I said, I've probably written something like this a few times now, possibly even shared it here before, but this one feels a little more elegant inside.

Now the trick would be to connect this with some ld65 map interpreter to allow seeking to symbols based on their position in the binary...
User avatar
segaloco
Posts: 959
Joined: Fri Aug 25, 2023 11:56 am

Re: Scripted Disassembly/Analysis Tools

Post by segaloco »

Sharing that I've done some significant work on my ddnes(1) utility, manual given below:

Code: Select all

DDNES(1)                    General Commands Manual                   DDNES(1)

NAME
       ddnes - dd-like utility for iNES ROMs

SYNOPSIS
       ddnes [option=value] ...

DESCRIPTION
       Ddnes, by default, copies the components of the provided iNES ROM to
       the standard output.  The following options may be used with ddnes:

       option          values

       if=file         input file name; standard input is default

       of=file         output file name; standard output is default; if this
                       option is a directory, the conv option split is assumed

       pbs=n           copy PRG in n byte chunks; default is 16384

       cbs=n           copy CHR in n byte chunks; default is 8192

       skipp=n         skip n PRG banks before starting PRG copy

       countp=n        copy only n PRG banks

       skipc=n         skip n CHR banks before starting CHR copy

       countc=n        copy only n CHR banks

       conv=...        special conversion options

       Where sizes are specified, a number of banks is expected.

       During execution, ddnes reports the statistics of each dd(1) operation
       on the standard error.

       If the conv option split is given, the PRG banks are emitted as
       sequential files named with the printf(3) specification prg%.2X.bin in
       the output directory and the CHR banks are emitted as sequential
       directories named with the printf(3) specification chr%.2X in the
       output directory then containing tiles named with the printf(3)
       specification tile_%.2X.bin.  In both cases, the numbering of banks
       reflects their positions in the ROM image.  In addition to the PRG and
       CHR banks, the iNES header is emitted in the output directory as
       header.bin.

       If the conv option trainer is given, then the ROM is assumed to contain
       a 512-byte trainer.  If the conv option split is given, the trainer is
       emitted in the output directory as trainer.bin.

       If the conv option mmc3 is given, the bank sizes are taken as those
       swapped by the Nintendo Multi-Memory Controller #3.  In this case, any
       pbs and cbs options are silently ignored.  Otherwise, the sizes
       employed by the NROM mapper are used as defaults.  This behavior may
       also be explicitly requested with the conv option nrom.

SEE ALSO
       dd(1).

DIAGNOSTICS
       f+p records in(out) numbers of full and partial records read(written)
       by dd(1).

BUGS

                                                                      DDNES(1)
The prior version was pretty tightly coupled to how I'm doing my disnes(1) disassembler project, but I decided I needed the general concern of iNES ROM chunking in other places, and furthermore the more flexible interface is to have it mash things together on stdout, instead requesting the splitting behavior with implementation-specific filenames when actually needed.

The result is an improved version which handles a number of new cases:
  • Trainers are now supported, supply conv=trainer to get this behavior
  • Splitting behavior is now behind conv=split, with default behavior being mash everything that isn't a header or trainer into the stdout. Currently I can't think of scenarios where I need to yes/no on these being sent down the pipe, maybe later.
  • A general block sizing mechanism is given for PRG and CHR banks so that the skip and count units can be adjusted. This is in line with dd(1) itself offering block sizing options. I found this useful since different mappers express the logical block sizes differently than the iNES header, so this allows minute overrides of what units the logical banks are in while still allowing extraction from the header instead. Be forewarned, this expects you to get it right, if you supply bogus bank sizes, you are liable for the over/underrun.
  • An "mmc3" conversion is given that performs the block sizing based on MMC3 banking. This was my initial use-case for retooling the block sizing behavior, then I generalized it.
  • The "of" option is now really just any file. If it is a directory, the split option will be assumed and the requested data will be spit out there as a number of files. If it is a file, the split option is not assumed and instead whatever bank data is generated is spit out on that file, overwriting it if anything is already there. The default output is now the stdout for pipelines.
This solves a need in one of my other projects to have a simple MMC3-bank-based lookup of CHR groups in a ROM, which is going to be part of a tile "hot-refresh" display mechanism I'm working on by which you simply supply the IDs of a series of CHR banks to monitor in a ROM as well as the palette to apply to the monitor view, and then that can be rerun and present an updated visual map on a recompile of the ROM. This will allow entering the data in the values the MMC3 mapper speaks in. In turn, this is the tile view "control" of a graphics toolkit.

I'm leaving ddnes(1) in the larger disnes project repository for now but it's certainly gained value of its own outside of that project. What I really need to do is start considering pulling all my various NES-related development tools together and making a singular, more consistent package of it all....

Edit: As the prior lines proposed, this is now packaged with several other NES-related tools here: https://gitlab.com/segaloco/misc/-/blob ... /fc_tools/
Last edited by segaloco on Sun Oct 26, 2025 10:21 pm, edited 2 times in total.
User avatar
segaloco
Posts: 959
Joined: Fri Aug 25, 2023 11:56 am

Re: Scripted Disassembly/Analysis Tools

Post by segaloco »

Here's a simple one, ca65dump: https://gitlab.com/segaloco/misc/-/blob ... ca65data.c

Takes things like .byte, .addr, .word, etc. and spits out a binary. Yes, ca65 can already do this, but my tool does it as a filter, meaning this can act as my interface to a number of data processing tools without having to go find the binary offsets of data to supply the slices via dd(1) anymore.

From the (quite curt) manpage:

Code: Select all

CA65DATA(1)                 General Commands Manual                CA65DATA(1)

NAME
       ca65data - ca65-compatible data assembler

SYNOPSIS
       ca65data

DESCRIPTION
       Ca65data filters a series of data directives, those supported by
       ca65(1), into binary data.

SEE ALSO
       ca65(1)

BUGS

                                                                   CA65DATA(1)
This has been helpful for quickly providing data to other tools, rather than having to modify the tools to accept a text format. This means they could still be fed by other processors as well, for instance, something that takes a JSON description of OAM or audio data and transforms it into a proper BLOB.

Anywho, one such tool is trackdump from my SMB3 disassembly: https://gitlab.com/segaloco/smb3/-/blob ... rackdump.c

This tool takes a BLOB of length-run Nintendo music tracker format data and generates the necessary .byte directives using my header defines to describe the music track. In other words, the output becomes something like:

Code: Select all

	.byte	apu_patch_vals::patch_01|4
		.byte	sound_notes::Fx4
		.byte	sound_notes::G4
		.byte	sound_notes::G4
		.byte	sound_notes::Fx4
		.byte	sound_notes::G4
		.byte	sound_notes::G4
		.byte	sound_notes::Fx4
		.byte	sound_notes::G4
		.byte	sound_notes::A4
		.byte	sound_notes::Gx4
		.byte	sound_notes::A4

	.byte	apu_patch_vals::patch_01|8
		.byte	sound_notes::Ax4
		.byte	APU_TRACK_NOP
		.byte	sound_notes::B4

	.byte	apu_patch_vals::patch_01|4
		.byte	sound_notes::A4
		.byte	sound_notes::G4
		.byte	sound_notes::F4

	.byte	APU_TRACK_END
Well, tying these two tools together in a pipeline of

Code: Select all

ca65data | trackdump
Then I can literally hand type .byte directives at it and it spits out the corresponding symbolic representation. Of course, like any stdio filter, I can simply then hook up data to process via < and >, or I can use the tool interactively on a terminal.

This speaks to a general benefit. I use fflush(3) in both tools to ensure the data is sent down the pipe ASAP, meaning for instance if I have the results of a pipeline piped into a viewer that refreshes upon receiving new data, I can have a terminal input surface that renders its results in, well, whatever format the output is. In practice I've built tools to do this via text with both decoding the sound format and the string format in SMB3. This means I can simply copy series of .byte directives out, redirect to some file in the filesystem, and have that file open in a file editor that refreshes. I drop data to process into the terminal, and the result pops up in my editor. If my editor happens to be something like vi(1) where I am then calling the script on a selection of data, the ! shell escape feature then allows me to spot-replace byte directives without even leaving the editor.

in other words, by having a tool like ca65data, the text format of data directives from ca65 now become the input to an infinite array of data analysis and processing tools from any environment capable of requesting a service from the shell. One of my only big gripes with ca65 is that it does not have a stdio filter mode, that you can't simply use ca65 as an assembly step for processing arbitrary assembler text into binary form. I get it though, there's things like dealing with labels, but even just an option to emit with zeroed out immediates/operands when operating in filter mode would be so helpful. That or even just buffering until EOF and letting one hook up stdin/stdout. It's not super efficient but it does allow one to use it in pipelines. No shade though, I wouldn't be using ca65 if it wasn't the best assembler for my workflow. This just meant to fit a very focused need, and meets it.
User avatar
segaloco
Posts: 959
Joined: Fri Aug 25, 2023 11:56 am

Re: Scripted Disassembly/Analysis Tools

Post by segaloco »

This one is a prototype, the final version of this tool will support multiple bases/data representations (anything POSIX od(1) can spit out), optionally searching for overlapping instances of the same pattern, and case-insensitivity, but it's a start. I'm calling this preliminary version "hexgrep":

Code: Select all

#!/bin/sh
#
# hexgrep - grep(1) for hexadecimal strings in binary files
#
# Copyright 2025 Matthew Gilmore
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# 
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its contributors
# may be used to endorse or promote products derived from this software without
# specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

BIN=`basename $0`
DATA=`od -An -txC | tr '\n\t' ' ' | sed -e 's/[ ]\{2,\}/ /g' -e 's/^ //g' -e 's/ $/\n/g'`
PATTERN="$1"

if test -z "$DATA"
then
    printf "%s: unexpected EOF\n" "$BIN" >&2
    exit 1
fi

if test -z "$PATTERN"
then
    printf "%s: empty pattern\n" "$BIN" >&2
    exit 1
fi

DATA_WRK="$DATA"
while true
do
    TAIL=${DATA_WRK#*$PATTERN}
    POS=`expr \( ${#DATA} - \( ${#PATTERN} + ${#TAIL} \) \) / 3`

    if test $POS -ge 0 && test ${#DATA_WRK} -gt 0
    then
        printf "0x%X\n" $POS
    else
        break
    fi

    DATA_WRK="$TAIL"
done
The above script accepts a data stream on stdin. This data stream is searched for a series of hexadecimal byte values provided as an argument. The stream is then searched, and the hexadecimal offset of each (non-overlapping) instance of the string of bytes is printed in hexadecimal on the standard output.

If no pattern and/or payload are supplied, no processing is done. Otherwise the string length and substring search features of sh(1) are used to determine the position of an instance. If the position is valid and was pulled from a meaningful string (i.e. the cursor has not reached the end) that value is printed. Otherwise, it assumes the data has been consumed and/or no match could be found and exits.

As mentioned, this is a quick and dirty, I'm going to put together an enhanced version that supports more options, but this meets my immediate case, which is finding the offset of specific subroutines in disparate iNES ROMs. This should speed up things like identifying code reuse across different titles, at least for subroutines with very recognizable, consistent series of bytes in them. This replaces one more concern of a monolithic hex editor. Good for me too because I haven't touched a graphical hex editor in years and frankly don't intend to ever again.

Edit: Legal cma, the name hexgrep is a tentative name subject to change and does not imply affiliation with other projects or products called hexgrep. Any trademarks are retained by their respective owners, this tool will certainly be named something different in its fully realized state. I'll do better research on names, because this one is taken in a few places.