Page 2 of 3
Re: Need some tips for a demo
Posted: Sun Feb 24, 2019 5:28 pm
by tepples
vivi168 wrote:The problem is, I should also increment the source address by 32, to skip the current row remaining tiles. Is there a way to do this during DMA?
Not directly from ROM to the PPU, no. As I explained above, you have to bounce it off WRAM.
- At any time, copy the tilemap entries from ROM into a buffer in WRAM. For the ROM data format that you have described, you'll need to increment by 64 bytes (32 tilemap entries) when reading and 2 bytes when writing.
- During vblank, DMA that buffer to the PPU.
Re: Need some tips for a demo
Posted: Sun Feb 24, 2019 9:54 pm
by Oziphantom
Added complication is that the 2x2 is stored as 4 separate screens and not 1 large screens this means you have to do some "fancy" maths to work out the address.
Base = (X & 40) << 8 + X & 3F
This means you also need to be careful with your horizontal DMA if it crosses the "32x32" barrier.
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 1:16 am
by calima
koitsu wrote:The answer is no, you can't control the "source increment" while DMAing to PPU RAM; it always reads in increments of 1, because that's the nature DMA (I don't know of any DMA systems that let you control that, but I suspect the one on the PS2 probably does -- it's DMA implementation is crazy).
N64's RSP DMA can. It's intended for copying sub-textures.
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 10:04 am
by ndiddy
Oziphantom wrote:Added complication is that the 2x2 is stored as 4 separate screens and not 1 large screens this means you have to do some "fancy" maths to work out the address.
Base = (X & 40) << 8 + X & 3F
This means you also need to be careful with your horizontal DMA if it crosses the "32x32" barrier.
That's why I like 16x16 tile mode (what the numbers I provided earlier were for)- you can stick with a 32x32 tilemap and still get seamless scrolling on both directions with a linear layout.
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 2:37 pm
by vivi168
You don't *have* to use DMA for these updates/writes, of course! It may make more sense to use DMA just for horizontal panning situations, and to do the $2118/2119 writes yourself natively for vertical panning situations
Oh ok, I was under the assumption one should never copy data manually because it would be too slow.
But in this case it makes sense, because there isn't much data to transfer after all ($80 byte for one column).
Thank you all for your reply, it's very instructive.
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 3:53 pm
by 93143
If you're loading the system heavily, such that there's a serious risk of running out of VBlank time at some point, you should probably use DMA for everything unless it's so small that you can show manual copy isn't wasting time.
But if you're not, there may be no reason to go to the trouble. Since you're coding on bare metal, the only real "rule" in SNES development is that what you do has to work. (It also helps if you can still read your code when you come back to it after a week...)
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 4:18 pm
by koitsu
You should use DMA if/when technically possible, but not all situations can be done using DMA. DMA is substantially faster than code, by around a factor of... 8? 10? I don't know (refs:
#1,
#2,
#3,
#4 but all of these talk about clocks, which is not the same thing as CPU cycles). Why you can't use DMA in your particular case here is because you want a way to increment the source address by something other than 1 or 2 -- SNES DMA can't do that. What most people end up doing is putting into WRAM, linearly, the bytes they want to be written to PPU RAM (by whatever increment) then use DMA for that.
Here are cycle counts for a 128 byte transfer into PPU RAM (with a 32x32 increment in PPU RAM, as well as from WRAM). Don't just use this, please read everything I've written. I haven't done SNES code in ~20 years so I may have parts of this wrong (ex. bits of $2115, PPU RAM layout for tilemap, etc.). Cut me some slack please.
Code: Select all
sep #$20 ; 3 cycles
; $2115 bit 7 = %1 = increment PPU RAM address on write to $2119 (low byte @ $2118, high byte @ $2119)
; $2115 bit 1,0 = %01 = increment PPU RAM address 32x32, e.g. one column at a time
;
lda #%10000001 ; 2 cycles
sta.l $002115 ; 5 cycles
rep #$30 ; 3 cycles
; XXXX = PPU RAM address of tilemap start; fill in yourself
;
lda #$xxxx ; 3 cycles
sta.l $002116 ; 6 cycles
ldx #0 ; 3 cycles
loop:
lda.l $7f0000,x ; 6 cycles
sta.l $002118 ; 6 cycles
txa ; 2 cycles
clc ; 2 cycles
adc #$40 ; 3 cycles
tax ; 2 cycles
cpx #$800 ; 3 cycles
bne loop ; 3 cycles if branching, 2 cycles if not
The initial setup (everything up to and including
ldx #0) takes 25 cycles.
Each loop iteration (of writing 2 bytes to PPU RAM) takes 27 cycles, including the cost of the branch being taken. 27*63 = 1701 cycles. The final transfer, where the branch isn't taken, takes 26 cycles. So 1701+26 = 1727 cycles total for the loop, or 1727+25 = 1752 cycles for everything you see above. (Edit: I suspect I may be off by 1 somewhere, as I had to edit my code due to forgetting you can't do
stx long).
This is a "slow but safe" routine. It can optimised in several different ways -- examples: not using long addressing when writing to $2118 (only will work in mode 20/LoROM), setting DB=$7F and then using absolute addressing for WRAM reads, switching DB=$00 and using absolute addressing for $2118/2119 writes, doing something like
lda #$2100 / tcd / sta $18 (to write to $2118), unrolling the loop entirely + not using X indexing at all since the $7fxxxx addresses can be pre-calculated (this has most savings but at cost of ROM space), etc...
I forget how much time there is in NMI/VBlank on the SNES, but I imagine it's only a bit more than this.
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 6:58 pm
by tepples
Vblank on NTSC Super NES in 224-line mode is 262 - 224 = 38 lines or thereabouts, and each line is (1364 - 40) / 8 = 165.5 slow cycles. (DMA and WRAM access use slow cycles.) I haven't tested this in detail, but I imagine the usable vblank time may be reduced by up to 1 line to allow for retrieving the first scanline's sprite data. So for now, I'll say 37 * 161.5 = 6123 cycles.
EDIT: Teaches me to trust mental math
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 9:08 pm
by koitsu
tepples wrote:Vblank on NTSC Super NES in 224-line mode is 262 - 224 = 38 lines or thereabouts, and each line is (1364 - 40) / 8 = 161.5 slow cycles. (DMA and WRAM access use slow cycles.) I haven't tested this in detail, but I imagine the usable vblank time may be reduced by up to 1 line to allow for retrieving the first scanline's sprite data. So for now, I'll say 37 * 161.5 = 5975 cycles.
Here's something that might help you know if your estimate is correct or not: the official developers manual has this little footnote on the bottom of page 2-17-2 (general description/introduction to general DMA) that says:
In case of 224 lines, general purpose DMA can transfer 6K byte of data maximum during V-Blank period.
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 9:47 pm
by 93143
(1364-40)/8 is 165.5, not 161.5. Multiply by 37 and you get 6123.5. Multiply by 38 (assuming that VRAM or at least CGRAM is safely writable while OAM is being scanned for the first time) and you get 6289.
Best not to try to pack it 100.000% full. NMI doesn't start instantly at the end of the last active line, and there's wobble in the timing due to CPU instruction handling. Also, it's not clear what the full exact intervals are in which OAM and VRAM are writable. There's even a short scanline during VBlank if interlace is off, although it's only by half a byte cycle (4 master clocks) and the rest of the timing slop dwarfs that.
Which reminds me - these line counts do not apply to PAL. A PAL console has 72/73 lines of VBlank in 239-line mode, or 87/88 lines in 224-line mode. And if interlace is on, there's a scanline at the end of VBlank that's half a byte cycle longer than usual.
Oh yeah, and interlace adds an extra line every other frame, on any console.
Lots of information.
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 9:58 pm
by psycopathicteen
tepples wrote:Vblank on NTSC Super NES in 224-line mode is 262 - 224 = 38 lines or thereabouts, and each line is (1364 - 40) / 8 = 161.5 slow cycles. (DMA and WRAM access use slow cycles.) I haven't tested this in detail, but I imagine the usable vblank time may be reduced by up to 1 line to allow for retrieving the first scanline's sprite data. So for now, I'll say 37 * 161.5 = 5975 cycles.
Isn't that 165.5 cycles, not 161.5 cycles.
@koitsu
You would also need to do some math with the VRAM address and X index. If you're using 16x16 sized tiles:
lda camera_x
lsr #4
and #$001f
sta temp
ora #map_address
sta $2116
If you're storing the level as 32x32 tile maps laid out horizontally then you would do this to calculate x index
lda camera_x
and #$fe00
asl
ora temp
asl
tax
Re: Need some tips for a demo
Posted: Mon Feb 25, 2019 11:33 pm
by koitsu
@psycopathicteen I don't think this is relevant to what the fellow is doing in his first pass, but we need clarification from him on what he's going with for now. I get the impression, as a first-pass-attempt, he's going to keep a raw copy of the tilemap (all $800 bytes) in WRAM, and wanted to update bits/pieces of it (ex. a column) and then write that to the related/appropriate part of PPU RAM -- i.e. $7F0000-01 ends up in (assuming SC base address = PPU RAM $2000) $2000-01, $7F0040-41 ends up in $2040-41, etc.. -- for the far left column.
You're just giving out magical variables that have no description, nor were those variables in my code, so... I don't think this helps someone learn, respectively. I actually understand the first set of code (for dynamically figuring out where in PPU RAM you want to start at), and relates to the whole "camera/view into PPU RAM/WRAM" concept, I know, but it's not easily explained from just 6 lines of code.
I don't know about the tile size, but I'm operating off of 8x8 under modes 0 through 4. 16x16 I think is a different situation (everything doubles, including the BG scroll ranges).
Also, I think I might have some of the values wrong -- it might need to be adc #$20 and cpx #$400, but I can't remember. This is one of the areas of the official documentation that sucks: they often talk about things in words (2 bytes) but then refer to indices etc. as +0, +1, +2, +3 (i.e. +0, +2, +4, +6). This compounded by it being 20+ years for me doesn't help. I'm trying my best off of memory though.
Re: Need some tips for a demo
Posted: Tue Feb 26, 2019 11:28 am
by psycopathicteen
What if I add some notes?
Code: Select all
lda camera_x //--MMMMMTTTTTPPPP M = map, T = tile, P = pixel
lsr #4 //------MMMMMTTTTT
and #$001f //-----------TTTTT
sta temp
ora #map_address
sta $2116
lda camera_x //--MMMMMTTTTTPPPP
and #$fe00 //--MMMMM---------
asl //-MMMMM----------
ora temp //-MMMMM-----TTTTT
asl //MMMMM-----TTTTT-
tax
Re: Need some tips for a demo
Posted: Tue Feb 26, 2019 3:36 pm
by vivi168
This is a "slow but safe" routine. It can optimised in several different ways -- examples: not using long addressing when writing to $2118 (only will work in mode 20/LoROM), setting DB=$7F and then using absolute addressing for WRAM reads, switching DB=$00 and using absolute addressing for $2118/2119 writes, doing something like lda #$2100 / tcd / sta $18 (to write to $2118), unrolling the loop entirely + not using X indexing at all since the $7fxxxx addresses can be pre-calculated (this has most savings but at cost of ROM space), etc...
I must admit I was a little lost here.
Why does long addressing only work in mode 20/LoROM?
As for TCD, I've made some research, it transfers the the 16 bits in of the accumulator to the Direct Page Register I think, but I'm not exactly sure what that means?
Does it affect which page/bank you access? For example instead of accessing Zero Page, you access Page 02?
@psycopathicteen I don't think this is relevant to what the fellow is doing in his first pass, but we need clarification from him on what he's going with for now. I get the impression, as a first-pass-attempt, he's going to keep a raw copy of the tilemap (all $800 bytes) in WRAM, and wanted to update bits/pieces of it (ex. a column) and then write that to the related/appropriate part of PPU RAM -- i.e. $7F0000-01 ends up in (assuming SC base address = PPU RAM $2000) $2000-01, $7F0040-41 ends up in $2040-41, etc.. -- for the far left column.
So to clarify, first of all, I've set $2105 (BGMODE) to $01 (bg mode 1, tile size 8). and $2107 (BG1SC) is set to $10 (only one 32x32 tilemap).
At the beginning, I first load the tilemap from the ROM to PPU via DMA. Then when going left and right, I scroll the background. For now, it only wraps back.
What I'm looking to do, is fetch a column of another tilemap in the ROM and write it over a column of the tilemap in the PPU. (So it gives the illusion of a longer level, like the GIF tepples posted). From what I've read in this thread, it might be a good idea to first transfer it to the WRAM, and then copy the column I need from the WRAM to the PPU.
I now have a precise idea of what I need to write to achieve this. I'm gonna try it and hopefully get back to you with something working
Thanks a bunch
Re: Need some tips for a demo
Posted: Tue Feb 26, 2019 4:28 pm
by tepples
Using 24-bit addressing works in both LoROM and HiROM. Not using long addressing, that is, using 16-bit absolute addressing without regard for the contents of the data bank register (B), is most practical in LoROM. The difference is that in HiROM, the data bank register will usually be in $C0-$FF, which is ROM, or $7E-$7F, which is WRAM. Neither of these banks allows access to MMIO. In LoROM, by contrast, the data bank register is more likely to be in $80-$BF, which contains the MMIO areas $2100-$21FF and $4200-$437F. There are more advanced ways to use MMIO with absolute addressing in HiROM, and they rely on the fact that the second half of each bank of $C00000-$FFFFFF ($C08000-$C0FFFF, $C18000-$C1FFFF, etc.) is mirrored down to banks $80-$BF. This requires bit manipulation either at assembly time or at runtime when determining which values to push onto the stack before plb, and it may require the linker script to place certain data in the second half of a bank.
When you load $0000 into the direct page base register (D), lda $3F reads $00003F.
When you load $0200 into the direct page base register, lda $3F instead reads $00023F.
When you load $0210 into the direct page base register, lda $3F reads $00024F and costs an extra cycle because bits 7-0 of the register are not 0.