Since I now know how many bits of precision I needed for the intermediate values, I went back and rewrote it in assembly with some unrolled 2 8x4 bit multiplications, and 4 16x3 bit multiplications. Worst case scenario is ~1000 cycles per collision, and that only happens a small portion of the time so I think I have time to process more collisions per frame than I have sprites to draw now.
I threw a quick demo up on my webpage of a circle moving around a tilemap. My plan is to make something with a 2 wheeled vehicle or maybe a 1 wheeled robot. It should hopefully get more interesting once I re-implement circle to line collisions so I can have slopes and such.
http://files.slembcke.net/temp/nes-embe ... ysics.html