It doesn't seem to me that there's any way to do it that wouldn't take longer than just using long addresses.
Note that long addresses only add one cycle over a 16-bit address, so they're not as costly as you might expect. You would need to set the bank to $00 and restore it in less than 5 cycles 9 cycles to make your copy routine faster than what you've already got.
EDIT: Just noticed that you're doing txa; sta ^DMAMODE because stx and sty don't have a long addressing mode, so it's 9 cycles rather than 5 cycles. I'm not sure if it can be done even in 9 cycles though.
Last edited by Nicole on Mon Mar 19, 2018 4:05 pm, edited 1 time in total.
I've seen it recommended to just keep the data bank in the LoROM areas ($00-$3F, $80-$BF), and put any small data tables you need direct code access to in those areas (for 4 MB or less, I guess that means putting such stuff in the top halves of banks). The program bank can be in a HiROM region so you get 64 kB of uninterrupted code, and bulk data can be put anywhere because specifying a bank for DMA is not very onerous.
You might also want to move the direct page to point to a set of registers if you need to do a bunch of accesses (since direct page is always in bank 0), but make sure it's a net win cycle-wise first.
When direct page is pointed at PPU ports ($2100) or memory controller ports ($4200), you can't easily use it for (dd),Y addressing. This means you need to fall back to (dd,S),Y, which has a 1-cycle penalty like [dd],Y.