Compiled stack proposal

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems. See the NESdev wiki for more information.

Moderator: Moderators

Garth
Posts: 246
Joined: Wed Nov 30, 2016 4:45 pm
Location: Southern California
Contact:

Re: Compiled stack proposal

Post by Garth »

Bregalad wrote: Wed May 13, 2020 2:24 am
Garth wrote: Wed May 13, 2020 2:10 am It seems like some key characteristics of stacks are perhaps not being understood correctly. There's a treatise on 6502 stacks (plural, not just the page-1 hardware stack) at http://wilsonminesco.com/stacks/ . Later chapters do get into local variables and environments, and recursion. You may not need recursion; but once you have local variables and environments, there's nothing keeping you from doing recursion. It's simply not a problem, as long as you don't run out of stack space.
Of course it is a problem, why would you sacrifice a register for being your stack pointer, and use longer, slower instructions, and possibly waste a lot of ROM space, when you're not going to use recursion anyway ?!
True, absolute addressing will take one more cycle than ZP addressing, and indexing will take one cycle more than non-indexed; but if you already have local variables and environments, recursion does not increase the requirements any. That was my point. I thought the context here was an HLL though, since C has been mentioned; and in that case you're making a big performance sacrifice anyway. Actually, I don't recall ever using recursion in my own programming; but I often use re-entrancy, particularly where an interrupt-service routine for an alarm may call a routine that got interrupted.
http://WilsonMinesCo.com/ lots of 6502 resources
User avatar
aa-dav
Posts: 220
Joined: Tue Apr 14, 2020 9:45 pm
Location: Russia

Re: Compiled stack proposal

Post by aa-dav »

Hisoft C is C compiler with 24Kb core within target machine (ZX Spectrum 48).

And it's source code for stdlib library is really cool:

Code: Select all

/*****************************/
/*         Hisoft C          */
/* Standard Function Library */
/*          HEADER           */
/*                           */
/* Copyright (C) 1984 Hisoft */
/* Last changed  15 Aug 1985 */
/*****************************/
Local variables were slow and you don't believe this trick:

Code: Select all

#define  FAST  static
TADAMMM!

There were no type void in this age, so:

Code: Select all

#define  void  int
Simple and efficient...

First generation of C language (Kernighan and Ritchie first books!) was very different to modern realities, so functions with non void and int return types requre some declarations:

Code: Select all

/*  Forward declarations for non-int library functions  */
extern char *strcat(), *strncat(), *strcpy(), *strncpy(), *strchr(), *strrchr(),
            *strpbrk(), *calloc(), *malloc(), *sbrk(),    *fgets(),  *gets();
var_args was implemented in non standard way:

Code: Select all

/*  Two variadic arithmetic functions (see manual for details)  */
int max(param_byte_count)  auto
{
  static int argc, *argv, max;

  argc = param_byte_count/2 - 1;
  argv = &param_byte_count  + argc;
  max  = -32767;

  while (argc--)
    {
      if (*argv > max) max = *argv;
      --argv;
    }

  return max;
}
Type casts were made in non-standard way too:

Code: Select all

typedef  char * __char_ptr;
int peek(address)
{
  return  * cast(__char_ptr) address;
}
void poke(address, value)
{
  * cast(__char_ptr) address = value;
}
There was support of 32-bit:

Code: Select all

void long_add(c, a, b)
  char *a, *b, *c;
{
  static unsigned u, i;

  u = 0;
  for (i = 0; i < 4; ++i)
    {
      u   +=  *a++  +  *b++;
      *c++ =  u & 0xff;
      u  >>=  8;
    }
}
Really! In this way...

And meet this true horror of programming in 1980:

Code: Select all

#define  FALSE   0 /* for Boolean operations */
#define  TRUE    1
Thew...
But isn't #define FAST static cool?
coto
Posts: 102
Joined: Wed Mar 06, 2019 6:00 pm
Location: Chile

Re: Compiled stack proposal

Post by coto »

aa-dav wrote: Wed May 13, 2020 11:35 am Hisoft C is C compiler with 24Kb core within target machine (ZX Spectrum 48).

And it's source code for stdlib library is really cool:

Code: Select all

/*****************************/
/*         Hisoft C          */
/* Standard Function Library */
/*          HEADER           */
/*                           */
/* Copyright (C) 1984 Hisoft */
/* Last changed  15 Aug 1985 */
/*****************************/
Local variables were slow and you don't believe this trick:

Code: Select all

#define  FAST  static
TADAMMM!

There were no type void in this age, so:

Code: Select all

#define  void  int
Simple and efficient...

First generation of C language (Kernighan and Ritchie first books!) was very different to modern realities, so functions with non void and int return types requre some declarations:

Code: Select all

/*  Forward declarations for non-int library functions  */
extern char *strcat(), *strncat(), *strcpy(), *strncpy(), *strchr(), *strrchr(),
            *strpbrk(), *calloc(), *malloc(), *sbrk(),    *fgets(),  *gets();
var_args was implemented in non standard way:

Code: Select all

/*  Two variadic arithmetic functions (see manual for details)  */
int max(param_byte_count)  auto
{
  static int argc, *argv, max;

  argc = param_byte_count/2 - 1;
  argv = &param_byte_count  + argc;
  max  = -32767;

  while (argc--)
    {
      if (*argv > max) max = *argv;
      --argv;
    }

  return max;
}
Type casts were made in non-standard way too:

Code: Select all

typedef  char * __char_ptr;
int peek(address)
{
  return  * cast(__char_ptr) address;
}
void poke(address, value)
{
  * cast(__char_ptr) address = value;
}
There was support of 32-bit:

Code: Select all

void long_add(c, a, b)
  char *a, *b, *c;
{
  static unsigned u, i;

  u = 0;
  for (i = 0; i < 4; ++i)
    {
      u   +=  *a++  +  *b++;
      *c++ =  u & 0xff;
      u  >>=  8;
    }
}
Really! In this way...

And meet this true horror of programming in 1980:

Code: Select all

#define  FALSE   0 /* for Boolean operations */
#define  TRUE    1
Thew...
But isn't #define FAST static cool?
Damn son.

-

I agree with tepples but I suspect it may be much more work than intended since the paging NES method must make relative calls within a desired scope (from the processor) and the global program scope (when reentrant calls sharing variables). Now, I hope you guys manage to do it and you've got my support!!

Oh, when done, please, add a lot of test cases when compiling different code. That will literally save a ton of wasted hours.
User avatar
NOOPr
Posts: 75
Joined: Tue Feb 27, 2018 10:41 am
Location: Brazil
Contact:

Re: Compiled stack proposal

Post by NOOPr »

I made something like this once, it's called pa65.
I have not yet publicly exposed it because it lacks a good documentation.
If you want to try, here it is https://github.com/Parisoft/pa65
User avatar
Bregalad
Posts: 8056
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Re: Compiled stack proposal

Post by Bregalad »

Bananmos wrote: Wed May 13, 2020 7:04 am Well, I was thinking it'd be nice if the system could just deduce it from parsing jsr instructions, without a need for the pseudo-instruction for most functions.
Exactly. Ideally you would need special key instructions hidden in the instructions only at relevant places, such as start/stop the tree system (around interupt routines for example; or special routines doing trickery beyond the normall call/return system such as my boss routines).

Maybe the program would output a list of .define statements which could in turn be included in the assembly process ?

For example when using semi-auto allocated variables it would be like:

Code: Select all

Myroutine:
   lda myroutine_temp1      ; this variable is auto-allocated
   clc
   adc player_HP                ; This one is definitely not
   sta myroutine_temp2      ; etc...
   jsr something                 ; the parser understands automatically we call something without any extra info needed
   bcc _next:
   rts                                 ; The parser understands automatically that's a possible end of myroutine
_next:
   lda myroutine_temp3    ; but figures there's still another end ahead
   ...
   jmp something_else       ; the parser should understand this is a tail-call
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Compiled stack proposal

Post by Oziphantom »

Oh yeah Trampolines, didn't even think of those.. they will basically become the nexus of your project and make everything dependent on everything. A "unlink" this function would be needed.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Compiled stack proposal

Post by tepples »

Oziphantom wrote: Wed May 13, 2020 10:46 am SelfMod, Dispatch via RTS, Double Interrupt stabilization, Main, IRQ, NMI and potentially BRK threads to consider, stack manipulation, rts chaining and just well I want to return a couple up so lets just pop a bunch of stuff off the stack and go for it.
Self-modifying code can be handled the same way so long as variable addresses aren't modified to point at a caller's variables. I haven't put much thought into it seeing as I need as much of the 2048 bytes of internal work RAM for game state as I can get.

RTS dispatch from a jump table is straightforward using tailcalls, which treats every entry in the jump table as a callee. I haven't considered RTS dispatch in the context of jsr coroutine_yield; I imagine it to be a lot trickier.

I'm not sure what "Double Interrupt stabilization" is, but main, IRQ, and NMI threads could each get their own static stack. Ideally, IRQ and NMI wouldn't be very deep anyway unless you're trying to (say) run the game in NMI and low-priority background tasks that can block (like art decompression) in main. And I hadn't considered syntax to add to one stack or the other because my IRQ and NMI routines have never been that deep.

rts chaining and just well I want to return a couple up so lets just pop a bunch of stuff off the stack Unless I'm misunderstanding something, multi-layer return wouldn't need to do anything, as the intermediate callee's local variable lifetimes are ending too. Could you explain what you mean by "RTS chaining"?
Oziphantom wrote: Wed May 13, 2020 10:46 am Having function jump into the middle of other functions
If subroutine X jumps into the middle of subroutine Y, this can be done one of two ways:
  1. Subroutine Y is actually two subroutines: Y1 and Y2, with X jumping to the start of Y2, and Y1 falling through to Y2. Fallthrough is marked as a tail call.
  2. Even simpler: Mark that X calls Y when it jumps to the middle of Y. This makes all of Y1's variables and all of Y2's variables live.
Bregalad's sample routine as marked up:

Code: Select all

.proc Myroutine
var temp1
var temp2
   lda temp1                ; this variable is auto-allocated
   clc
   adc player_HP            ; This one is definitely not
   sta temp2                ; etc...
   call something           ; tell the parser that something participates in this static stack in case something is in another source code file
   bcc _next
      rts                   ; Variable lifetime ends when a routine returns
   _next:
   lda myroutine_temp3      ; but figures there's still another end ahead
   ...
   tailcall something_else  ; even if something_else is in another file
.endproc
The primary reason for a distinction between call and jsr and between tailcall and jmp is that a preprocessor cannot tell whether or not a subroutine defined in another source code file participates in the static stack allocation. If call is used on a subroutine that is not a "callee", the other subroutine will not have defined and exported a symbol representing the end of its local variables. This will cause linking to fail with an unresolved external symbol. The other way is to run the preprocessor on everything and arrange for all subroutines to be marked as callees, even if neither var nor leaf is used in that subroutine, so that the preprocessor can emit allocation metadata for all subroutines. This fails when calling third-party libraries, which do not emit allocation metadata.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Compiled stack proposal

Post by Oziphantom »

making this NES 6502 rather than 6502 seems a bit of a waste. However you do also have 8K expansion RAM for the NES to consider.

Double Interrupt stabilization:
When you interupt you have a 0-8 clock window of jitter, so to defeat this you can do
Make Interupt fire on Line 10

Code: Select all

;setup irq to fire on next line
;save the current stack pointer
;cli
; waste clocks until near the end of the line if needed
nop
nop
nop
nop ; somewhere along here the next interrupt fires 
nop

;next interrupt
; restore stack to pop off the extra interrupts data
since you interrupted on a nop, a jitter is now 0-1
; if you need dead 0 jitter. 
; waste almost a line then set it up so the fetch of the cmp in 
; lda rasterline
; cmp rasterline 
; fall either on the last clock ( so it is the same) or first clock of the next line ( so it is different )
; beq *
; * ; now if the where equal the branch is taken 3 clock, it the differ not taken 2 clocks
; you are now on clock 3 of the line 100% 
so basically there are 2 interrupts but only 1 rti
The other trick is "executing I/O registers" so you start a timer that counts down then you set the byte before the timer, to be JMP or JSR as you need. and the set the byte after the timer to be 00.
thus as the clocks go you get
JMP $0800
JMP $0700
JMP $0600
....
in a loop, so you set the IRQ handler to the I/O registers so the IRQ fires, then it executes the JMP with the timer byte high given you a routine you enter knowing the jitter on.

rts chaining
Given

Code: Select all

#DrawDebugStringStackNumber kDebugStrings.playerx,0,0
	#DrawDebugStringStackNumber kDebugStrings.playery,0,1
	#DrawDebugStringStackNumber kDebugStrings.playerTileX,0,2
	#DrawDebugStringStackNumber kDebugStrings.playerTileY,0,3
	#DrawDebugStringStackNumber kDebugStrings.playerTileNum,0,4
	#DrawDebugStringStackNumber kDebugStrings.playerColFlags,0,5
	#DrawDebugStringStackNumber kDebugStrings.Map,35,0
	#DrawDebugStringStackNumber kDebugStrings.MapTCP,40,0
	#DrawDebugStringStackNumber kDebugStrings.MapTCP,51,1
	.rta TrampolineCallStackDrawExit 
where the macros expand as

Code: Select all

DrawDebugStringStackNumber .macro 
	.rta jDrawStringAtPrecalc
	.byte \1 
	.word $3000+(80*\3)+\2	
.endm
btw .rta inserts the return address of an object, so basically .word thing-1.
Then you set the stack to be where this data is(or copy it onto the stack), then you RTS to which the RTS then pops off the first address and thus calls it, pops its params and when done, just RTS which calls the next, and the next until it finally pops the last exit function. This way you can change the call order, don't have to put any logic in to work out what is next, no loops and its not statically linked.
multi layer return
Yes the scopes end, but you need to detect that when you do a

Code: Select all

pla
pla
pla
pla
rts
this ends the scope of it and 2 parents, and this function will not continue into any code after said two jsrs

Having function jump into the middle of other functions
I think a flaw with the current example is calling is not only done with a JSR, calling is also done with a JMP or Branch

so given the following structure

Code: Select all

entry 
set temp2 to something
some code
branch A
some code
branch B
rts

A
some code
set temp1 to something
jmp C

B some code
set temp1 t something
jmp C

D some code
set temp1 to something
set temp2 to something
branch C

C
some code
use temp1
jmp E

E
use temp2
some code rts
So the paths are
Entry -> A -> C -> E
Entry -> B -> C -> E
D -> C -> E
thus you need to treat
Entry + A + B as a " function"
C + E as a "function"
D as a "function"

but they don't all "call" each other because they are seen as one "group" i.e logically Entry + A + B + C + E make one function and the variable scope needs to be shared, i.e A sets up something to be used in E, and like wise D is part of the chain but a different branch, so it needs to set up things that are also used in C + E, so its variables must be allocated the same way as Entry + A + B
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Compiled stack proposal

Post by tepples »

2022 EuroLLVM Dev Mtg “LLVM-MOS 6502 Backend: Having a Blast in the Past” by Daniel Thornburgh (10 minute video) shows how LLVM is being ported to target the 6502 CPU. This includes static stack allocation and reservation of 32 bytes of zero page as virtual registers.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Compiled stack proposal

Post by Oziphantom »

sure but that is a C compiler which has the benefit of full control over the 6502 it generates, and hence knows all the cases it needs before hand. Not something trying to pass 6502 code and work out what the overlap is.
mysterymath
Posts: 6
Joined: Sat Apr 23, 2022 12:34 pm

Re: Compiled stack proposal

Post by mysterymath »

tepples wrote: Wed Jun 15, 2022 12:54 pm 2022 EuroLLVM Dev Mtg “LLVM-MOS 6502 Backend: Having a Blast in the Past” by Daniel Thornburgh (10 minute video) shows how LLVM is being ported to target the 6502 CPU. This includes static stack allocation and reservation of 32 bytes of zero page as virtual registers.
I'm actually working on extending this to:
1) Automatically promote static stack slots to the zero page, based on whole-program estimates of hotness, and
2) Allow ZP locations/static stack slots to overlap if the compiler can prove that their two containing functions can't be active at the same time.

You'd have to tell the compiler how many contiguous zero page locations it can expect; our SDK targets would provide maximal defaults. If you wanted to reserve zero page for application use (i.e., outside the compiler), you'd need to pass something like --mreserve-zp=64, which would cut out 64 bytes of zero page from the chunk allocated to the compiler.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Compiled stack proposal

Post by Oziphantom »

side note for Commodore machines you will need an option to set the lowest zp address it uses, as the 00 and 01 are "registers" internal to the CPU. Even in the 8502 when you invoke the MMU's ZP relocation the lower 2 ZP remain internal register.
Post Reply