Some mumblage about possible function call and register conventions for a new
lisp compiler on the RT.

			    FUNCTION TYPES

The old notion of a "function object" is now broken down into three different
parts:

    Function entry
        A function entry is a structure that holds the information that we need
        to call a function.

    Environment
        The environment is stuff that a function needs when it runs.  This
        includes constants computed at load time and variables closed over at
        run time.  Environment information may be allocated in the function entry
        structure after the required linkage information.

    Debug information
        This is information about a function which isn't normally needed at run
        time.  Debug information will be either allocated in or pointed to by
        the environment area.


Function entries:

We support two function calling conventions: named and anonymous.  A named call
is a call to a known global function, and is done through the link-table.  An
anonymous call is done when the function being called is not known at compile
time.  The format of a function entry looks like this:
    header word {function entry subtype}
    pointer to code-vector
    arg-count info or pointer to env that has this info
    environment info...
There are a fixed number of entry points, one for each possible number of
register args, and one for when there are more.  When there are stack args, the
total number of args is passed in the NARGS register.  The entry points are
direct pointers to the PC where the code starts.  If it is not legal to call
the function with a particular number of args, then the entry point points to a
standard error routine.

It would be possible to have additional entry points for specific numbers of
arguments >3, which could be used by named call, but it probably isn't worth
the effort.  If there are stack arguments, then you are losing pretty badly
already.

When we do an anonymous call (FUNCALL), we make sure the object is a function
object, grab the code vector and jump in at a fixed offset.  If it is not a
function object, we jump off somewhere else and deal with LAMBDAs and illegal
functions.  The 0 arg entry as at byte 0, the 1 arg entry at byte 4, etc.  Each
of these fixed entries will be either a relative branch to the real entry code
or an absolute branch to an error routine.  When snapping link-table links, we
look at these branches to find the real entry point, allowing us to skip this
extra branch during named call.

This representation makes no distinction between "closures" and functions
which have no closure variables.  For all that matter, there is no distinction
between "constants" and closure variables.  Purify will move all environments
and their dependent structure into read-only space.  If this will be bad for
you, then you can throw a switch that causes environments to be moved into
static space.

In some cases, we want to store environment info without any directly
associated function entry.  Such an environment is indicated by the subtype in
the header.  Presumably the environment can somehow be reached from the
environment area in some entry point.


				DEBUG INFO


Since the compiler takes substantial liberties in where it allocates values and
how it does function call, we need at least some special debug info before we
can do any kind of reasonable debugging.  There are three interesting levels of
debug info:
 1] Minimal info which allows for a backtrace with arguments.  This requires
    information about the locations of the basic context pointers such as the
    environment ptr, the previous environment and PC, and the arguments
    themselves.  Values which are allocated in a register may also have save
    locations which must be used when the routine is not the current one.  This
    info could be represented by a vector of alternating register numbers and
    stack offsets, with either one being null if the value isn't allocated
    there.  The debug info for each environment will also need the ranges of
    PCs in which it is current so that we can find the correct environment
    given the current PC.
 2] Info which allows location of user variables given their name.  A
    reasonably compact representation for this info would be to have a vector
    for each kind of storage, with a cell for each location.  Each cell would
    store the string name of the variable located there, or a vector of the
    strings if there is more than one.  The compiler should uniqueify repeated
    variable names somehow.
 3] Info which allows a precise source location for some PC to be determined.
    For each component, we have a list of structures describing each top-level
    form.  There is information about the source file and location in the file
    where the form starts, and a sorted vector which translates from PCs to
    source paths within the form.  The vector has alternating PC and source
    path entries.  The code between two PC entries belongs to the source path
    they surround.  The source path is represented as a 4bit I-Vector, with big
    indicies being represented by 15 followed by a 4 nybble offset.  We can
    locate the source path for a PC by doing a binary search.


		       CONTINUATION REPRESENTATION

A continuation does the same thing as a conventional stack frame, but is
conceptually a sort of stack closure.  It contains two areas of stuff: saved
registers and temporary data.  The saved registers are blasted out immediately
before a call using STM.  Temporary data is anything which didn't make it into
registers, and is accessed word by word.  CONT remains unchanged during the
execution of a function.  SP may be changed as the demand for temporary storage
changes.  Since the caller resets SP after a call, the amount of temporary
storage may be changed for free.

 SP   --> 
	       .
	       .
 CONT --> temporary data
	       .
	       .
	  saved registers

The arguments to the continuation are the values being returned.  If there is
other than one argument, we have a multiple-value return.  There are two
interesting problems in the continuation calling convention: argument
dispatching and argument passing.

The continuation argument dispatching problem is similar to the anonymous call
argument dispatching problem.  We could solve it in the same way, but can do
better if we recognize the supreme importance of calling a continuation with a
single argument (returning a single value).  The solution is twofold:
 1] If other than one value is being returned, this fact is flagged to the
    continued code.  This flag should be easy to set in the one-value position
    and easy to ignore.  A good possibility is to return a NEQ condition code
    if there is only one value.  Normal functions should find it easy to set
    this condition without doing any extra work, and it is trivial to ignore.
 2] If the continued code only wants one value, it may ignore extra values at
    no cost.  This imposes a constraint on the calling convention: the
    continuation must restore the SP, since it cannot be decremented until any
    return values are safe in temporary storage allocated by the
    continuation.  Normally this will only take a single instruction which adds
    a constant to the continuation and stores it in SP.

The argument passing convention may be the same as for normal functions.  The
continuation being invoked will be passed in the continuation register,
since a continuation has no continuation.  If a continuation returns other than
one value, all the unused arg registers are set to NIL.

We may have to close over a continuation due to a return or go out of a
function.  In this case, we treat the continuation like a lexical variable,
storing the stack pointer to the continuation in the variable.  When the
dynamic extent of the continuation is exited, we zero the closure variable so
that we can tell it is invalid.  For this non-local continuation case, we fetch
the return PC out of a known place in the continuation.  We actually know this
PC at the place where the continuation is invoked, but we may have to process
unwind-protects between here and there.  The stack pointer for the closed
continuation actually points at the PC which is somewhere within the
continuation for the function being continued. 

Things which unwind work by first finding the place to unwind to and then
invoking an unwind routine with this continuation.  The unwind routine looks at
the current unwind-protect and invokes it if it is dynamically below the
continuation.  The unwind-protect entry stashes the continuation being unwound
to, does the cleanup code, and then calls the unwind routine again.

An unwind-protect block is just a pair of cells in the continuation of some
function on the control stack:
  pc for entry point
  previous unwind-protect

A catch is pretty similar, except that catch blocks are linked separately from
unwind-protects and the tag is in the block:
  pc for catch entry point
  previous catch
  tag

A throw works by scanning through the linked catch blocks to find the
continuation to unwind to, and then calling the unwind routine.

At any of these unwind entry points, the continuation pointer is fixed to point
to the start of the whole continuation and any other necessary cleanup such as
special unbinding is done.  Of course, the stack pointers need to be closed
over somewhere in the continuation, but that may only have to be done once at
function entry.  It should also be fairly easy for the compiler to break up
catches and unwind-protects by storing into the current catch or
unwind-protect, rather than doing a throw.  These changes would make unused
catches quite a bit faster, which is probably worthwhile since most catches and
unwind-protects usually fall through.


			REGISTER ALLOCATION

Ultimately the registers will be allocated differently so as to give the
compiler as much freedom as possible.  The main idea is to globally allocate
only those registers with global significance.  There is only one: the control
stack pointer.  Scratch registers may be trashed by miscop calls, whereas
temporary registers may not.

     NL0: unboxed temporary
     NL1: unboxed temporary A.K.A. CONT, the continuation for a call.
      SP: control stack pointer
     NL2: unboxed scratch
     NL3: unboxed scratch
      A0: first function argument A.K.A. MA0, first miscop argument.
      A1: second function argument and boxed scratch.
      A2: third function argument A.K.A. MA1, second miscop argument.
      L0: boxed scratch A.K.A. NARGS, number of args in call (if not implicit.)
      L1: boxed scratch A.K.A. MA2, third miscop argument.
   L2-L5: boxed temporaries
      L6: boxed temporary A.K.A. ENV, the environment in a function call.
      L7: boxed temporary A.K.A. PC, function return address.

Rationale:
    Boxed temporaries go last, since we want to be able to use LM and STM for
reading and writing continuations (stack frames), and the contents will be
these registers.

    We want the miscop argument registers to alternate with scratch registers
so that shift-paired instructions can be used to extract the type of miscop
arguments.  These can be boxed temporaries since the extracted type code is a
valid fixnum.

    CONT, NARGS, ENV and PC are preallocated to the extent that a certain
useful thing is passed into a function in them.  The function may do whatever
it wants with the registers and values.

    NL1 will probably usually have the continuation in it.  We don't pass the
continuation in a scratch register since we will want it later and may not have
have to save it.

    Non-Lisp temporaries are more useful that you might suppose, since any
immediate object can be kept in them (i.e. fixnums.)  Unboxed registers become
quite important when you start open-coding arithmetic, since if you have an
actual unboxed thing (say a single-float), it is more expensive to spill than a
boxed quantity.

    This scheme separates the function argument (A) and miscop argument (MA)
registers.  This allows the block L0-L7 to be freely used to allocate locals
which will need to be closed in a continuation as long as no miscop calls are
done during the evaluation of arguments to the call.

    NARGS is passed in a scratch register (if at all), since it is useless
after the arguments are processed.  We don't have to pass NARGS when three or
fewer args are being passed.  This requires a little cleverness, since one of
the register args might be part of a rest arg.  What we do is always have an
entry point for each legal arg count less than four, rather than always jumping
in at a rest entry when there are rest args.

    ENV and PC come into a function where they do partly because that's where
LM leaves them.  This is a pretty good place anyway, since if we have a
continuation these things will want to go in it.


Probably we will initially keep the old register allocations for registers that
miscops care about.  This will eliminate the need to fix up all the assembler
sources.  The only big problem with the old allocations is that there aren't
any non-scratch unboxed registers.  This won't be a problem until we start
generating code that hacks unboxed numbers.  We can free up BS, FP, AF and PC,
allocating on an as-needed basis.  The revised allocations will look something
like this:

     NL0: A.K.A. NARGS
      A0
     NL1
      A1
      A3: fourth miscop argument
      A2
      CS: control stack pointer
      L0
      L1
      L2
      L3
      L4
      L5: (was BS, binding stack pointer)
      L6: A.K.A. CONT (was FP, active frame pointer)
      L7: A.K.A. ENV (was AF, active function)
      L8: A.K.A. PC


			SAMPLE CODE SEQUENCES

Named function call with no stack args.  We assume CONT has already been saved
and that CONT already points to our context.  This will have been done at
function entry.
	stm	CONT, start-reg, -size	; Save registers
	cau	NL3, link-tab-high
	lm	NL3, ENV, link-tab-low	; Get entry and env
	balr	PC, PC			; Go

The entry for a single-value continuation.
	cal	SP, CONT, size		; Reset SP
	lm	CONT, start-reg, -size	; Restore registers

Single value return from function which makes no function calls.
	brx	PC			; Return
	 something that sets NE condition

Single value return from function which made calls but was able to save PC
and CONT in registers.  ??? is an instruction that sets the CC and moves the
result.  (and immediate -1?)
	brx	save-pc			; Return and
	 ???	CONT, save-cont		; restore CONT while returning NEQ

Single value return from function which had to spill PC and CONT into memory.
	lm	CONT, R14, save-offset	; Load CONT and PC into R14 and R15.
	brx	PC			; Return and
	 ???	CONT, R14		; restore CONT while returning NEQ


Multiple value calling is vastly more efficient than it was if there are
three or fewer values being recieved.

The entry for a multiple-value continuation which wants two or three values:
	bex	skip			; If a MV return, do nothing.
	 cal	SP, CONT, size		; Reset SP
	cau	A1, nil-h		; Default unsupplied values to NIL.
	[lr	A2, A1]
skip
	lm	CONT, start-reg, -size	; Restore registers


The return of zero, two or three values is identical to a one-value return
except that the unused arg registers must be NIL'ed and the EQ condition
returned.

In the case where a continuation is used by more than one function call, we
must put in a jump  to the join point either before or after the balr.
We could put the jump after the balr and then let cross-jumping optimization
combine as much of the calling sequence code as possible.


			   SYSTEM ISSUES

GC is going to have to work some to deal with all these raw PC's lying around.
Too bad...  The scenario for GC is this: first find all functions on the stack
and in boxed registers, then find all code pointers on the stack and in boxed
registers, and for each one check to see if it points into the code vector of
one of the functions.  If so, adjust it if necessary.

A code pointer must be distinct from an I-Vector pointer because old PC's may
be left lying in temporary storage on the stack after their corresponding
function object is gone.  This is due to temporary storage allocation keeping
whatever stack garbage was there.  If we find a PC without any corresponding
function then we can just ignore it.  If we luck in to both a garbage PC and
function then it doesn't hurt anything, although the function won't get GC'd
even if it is true garbage.

Life is real tough for the debugger.  Without stack garbage, it is not too hard
to get a backtrace without arguments of non-tail-recursive calls.  It should be
possible to detect "calls" which are stack garbage by tracing up the stack and
making sure that the return addresses link together properly.

More elaborate debugging information could be enabled by a compiler switch.  At
function entry, we can STM all the boxed registers (A0-L7) into a known
location in temporary storage.  This would hold on to the arguments and make it
much easier to find the calling function and return PC.  Tail recursion can be
inhibited by wrapping a mv-prog1 around the body of every function.  This stuff
has a significant time penalty, but would allow reasonable debugging of
compiled code.

We could emit a storage map which tells where every variable is located.  By
adding spurious variable references, we could keep lexically visible variables
from disappearing when they die.  We could use the same technique for arguments
instead of saving them separately, but it is useful to have the actual argument
values rather than the possibly modified argument variables.  It would also be
possible to save the arguments without building a storage map, which would save
a great deal of space.


			    RANDOM IDEAS

Adding the notion of a constant function would allow a super-fast calling
sequence.  A constant function is a function which may not be redefined.  We
put the function object and entry PC in our constants vector.  We know which
registers it uses, so we might not need to save any.  We know how many values it
returns.  We know if it changes SP.  Calling a simple constant function may be
nearly as fast as a miscop call.

Call of a constant function known not to trash any of our registers.  Assume
ENV and PC were saved at function entry.  We don't have to do anything on
return.
	lm	env-save, ENV, offset	; Get new ENV and PC
	balr	PC, PC

This would hair up GC some, but not too much: all you have to do is look out for
function-objects followed by PC's while scavenging function space.


			MULTIPLE THREAD SUPPORT

    When we support multiple threads under Mach, we will have to scrounge a
THREAD register.  This will point to a structure which contains the
thread-specific context which doesn't fit in registers:
  Current special binding block
  Active catch pointer
  Any number stacks

  Information for management of the resources for this thread:
    Base and size of stacks.
    Descriptive info (name...)
    Policy info (priority...)
    OS level handle on the thread
    Synchronization info such as what lock the thread is waiting on, if any.

The THREAD pointer could actually be the base of the thread's control stack,
with the frequently used info being stored before any real stack stuff.  This
would allow fast checks for continuation validity, since both the top and
bottom of the cstack will be in registers.  This is useful, since a non-local
GO into another thread's stack would cause flaming death.


    We will also have to modify code which manipulates resources shared
between threads so that it does synchronization.  The big issue
performance-wise is the alloc-table: when there is no lock on the table, we
want to get at it real fast.  Contention for the alloc-table could be reduced
by having multiple allocation spaces (and tables) for important types.  You
could just grab a table off of a queue and put it back when you are done.

    It will be necessary to make all miscops "interruptable", since they will
have to be able to run concurrently.  Probably the only resource shared at the
miscop level is the alloc-table.  Note that software interrupts are going away,
so strictly speaking interrupts won't happen, but a scheduling break can
happen at any time.  The only big problem I see is with miscops which
manipulate raw pointers into objects.  The system must either avoid GC while
such a pointer is in use or adjust the raw pointer if the object is moved.
Probably the fix is to allow raw pointers in boxed registers as long as the
real object is also in a boxed register.  This would allow random pointer
hacking to be open-coded as well.  Being preempted during object allocation
would still be a problem, but the allocator will have to get a lock on the
alloc-table before it can do anything, and GC can't run without the alloc
table.

    Of course, we will have to go to deep binding when we support threads.
We can eliminate the binding stack by threading the BIND blocks together on the
control stack.  This will simplify the problem of allocating stack for each
thread.  Deciding on the exact implementation of deep-binding is non-trivial,
since there are many different ways it could be done.  The scheme used on the
S1 seems like a reasonable start, though.  A BIND block looks like this:
    <first symbol bound>
    <ptr to value cell of first symbol>
    .
    .
    .
    <n'th symbol bound>
    <ptr to value cell of n'th symbol bound>
    <pointer to start of previous BIND block>

The value cells are just allocated on the stack like a lexical variable.  The
BIND block is delimited by the prev pointer.  The reason for keeping a pointer
to the cell instead of just using the stack location directly is that it lets
us cache binding lookups.  Whenever we look up a variable, we link a LOOKUP
frame into the bindings.  This lookup frame contains the result of our search
and looks just like a BIND frame as far as the search is concerned.  The theory
is that this will reduce the average distance searched, since commonly used
variables will commonly have been used somewhere near you on the stack.  Since
we look up the address of the value cell rather than the value proper, we only
have to do the lookup once per function.  Note that there is no obligation to
link in a LOOKUP block; if it is unlikely or impossible that any function we
call will reference any of the variables that we reference, then not making a
LOOKUP block is a good idea.

    The S1 has an important optimization of variable lookup.  It is based on
the observation that many special variables are never actually bound.  If a
variable is known not to be bound, we can just go directly for the value cell.
One solution to the problem would be to introduce a new GLOBAL variable kind
which cannot be bound, but we can do nearly as well by noticing the lack of
bindings at run time.  The S1 keeps a count of bindings in a cell in each symbol;
if the count is zero at lookup time, we don't have to do any lookup.  I think
that it would work just as well to have an EVER-BOUND flag which is set and not
reset.  One advantage of this approach is that there is no need to do anything
when a binding is removed.  On the S1, THROW needs to scan up the bindings,
decrementing the count, which requires a way to distinguish BIND frames from
LOOKUP frames.

    There is a trick which might be a win if BIND and LOOKUP blocks
have a large enough average size.  The idea is to use a hash filter to
determine if a given symbol might be in a particular block.  Each symbol name
is hashed to a fixnum with a few (3 or 4) random bits on.  When a block is
built, the OR of these hash values is placed at the beginning.  When looking
for a symbol, we can skip a block if any of the bits for the symbol are zero in
the hash for the block.  All these hash codes are computed at compile
time, of course.  When binding an unknown symbol, we just use all 1's.  We
can't delimit the block with the prev link, since that has to be next
to the hash so that we can skip the block without going to the end.  A block
would look more like this:
    <hash of all symbols in block>
    <pointer to start of previous BIND block>
    <first symbol bound>
    <ptr to value cell of first symbol>
    .
    .
    .
    <n'th symbol bound>
    <ptr to value cell of n'th symbol bound>
    <marker for end of block (0?)>

In a block less than six or so entries, we can reject it with about 90%
probability.  The probability improves as the block size decreases, but the
advantage decreases as well.  The actual advantage also depends on the dynamic
characteristics of the lookup process, since if it turns out that a lookup is
usually satisfied on the previous block, we will never reject any blocks.

Bizzare optimization on deep-binding:

If we notice that a function which we call in a local call is referencing a
variable we bind, we can pass in the value cell pointer as an argument so that
it doesn't have to do the lookup.  It is questionable whether this would help
real code even with massive block compilation, but it sure would win for STAK.

We could actually generalize this optimization to be less obviously bogus.
What we do is shallow-bind the *location* of the current value cell for
each special variable which is both referenced and bound.

At each entry-point to the block, we build a specbind frame for variables
bound in the block (doesn't need to be all).  Initially all the value-cell
pointers are to the looked-up current values.  Within the block, we keep track
of the current value cell by shallow-binding it in some known location.  This
could be in a register or in the specbind frame itself.  Internally to the
block, we can do a special reference by indirecting through the magic location,
and can do a binding by saving the old value in our frame and then storing the
location of the new cell in our frame.  When we do a full function call, we
move any value cell pointers which are in registers into the correct place in
the specbind frame.

This could make STAK *faster* than normal shallow-bound implementations, and
might well help real programs also.


				FASL FORMAT

At a minimum we will need new FASLOPS to deal with multiple entry-points into
functions.  Our "no such thing as top-level" approach also has some
ramifications.  DEFUN et al work by expanding into real code that does stuff.
It is difficult to back-translate this real code into FOP-FSET and whatnot.
Having DEFUN be implemented by code in top-level forms is no problem for normal
load, but it creates problems for cold load, since we need to be able to
resolve function references at cold-load time.  What we could do is have a new
VOP which notes the presence of an entry point for cold-load purposes.  This
would be a noop in normal load.

It is interesting to note that the FASLOP language has been deprived of almost
all significance to normal load.  All it does is create constants and then make
functions which use these constants, possibly calling them.  This suggests a
radically different approach to the FASL format.  What we do is punt the FASLOP
byte-code and express all of the load-time semantics as native code.  To load a
file, you just suck in the load-time executed code and jump into it.  The code
in the file does the rest.

Some minimal amount of semantics probably wants to be declaratively specified
in the file.  We would use the Core File format, and define new entries that
throw information into the Table, which is a G-Vector that is passed to the
top-level code as its environment.  Ideas for entries:
  Link-table fixup, miscop fixup:
	(op, start-offset, fixup offset1, fixup offset2...)
	The offsets are places in the data pages that are to be patched.
  Intern:
	(package, len1 name1, len2 name2...)
  	The resulting symbols are pushed into the table.
  Unboxed constant:
	(type, start-offset, data-size, object1-offset, object2-offset...)
        Used to load strings and code vectors.  The objects are represented in
        their actual in-core format so that they can potentially be mapped in
        directly.  The objects are stored in the data part of the core file.
        Pointers to the loaded objects are pushed into the table.
  Top-level form:
        (start-offset, size)
	Just start running at some code in the data pages.  Size is there so
        that Genesis can pick up the code and move it.
We would probably want to relax the current page-alignment requirements so that
files wouldn't be mostly padding.  If a chunk looks big enough to be worth
mapping, then we could align it.

Boxed constants would be created with the normal consing operations.  [Possible
problems with interactions between mass interning and package hacking top-level
forms?  Seems like the problem isn't fundamentally worse than it currently is,
but people might bum out even worse than they do now if we interned all symbols
at the beginning of the file.  I guess we can force package hacks into their
own top-level forms, and do them separately.]

Changes along these lines could clearly wildly speed up fasloading, and would
also make dumping circular structures more reasonable.  The downside is:
  1] FASL files would probably be bigger.  The bulkup is not as big as you
     might suppose, since the actual code and string constants will dominate
     the load-time overhead.  A large part of the load time action is already
     in native code in top-level forms anyway.  With a dense byte-code
     instruction set like on the PERQ, native code may actually be more compact
     than FASLOPs.
  2] All pretense of portable FASL files would be abandoned, since basically
     nothing could be done with a FASL file without running machine code.  This
     pretense never seems to have done us much good.  Due to the use of
     top-level forms, it is currently impossible to generate a portable fasl
     file.  Theoretically this capability might be useful someday, but then
     again maybe not.  [Sending arbitrary data structures in messages?]
  3] Fasdumping would be slower.  Once again this might not be as big a problem
     as you might suppose, for the same reasons mentioned for 1 above.
  4] Fasdumping would be more complicated and less portable.  This is probably
     the biggest counter-indication.  Dumping code could be made mostly portable
     by emitting as Lisp rather than assembler, but this would slow down
     compilation.  To be fair, you would have to consider the simplification of
     the fasloader and cold loader, and the reduction in duplication of
     functionality.

Cold load would be similar to the way it currently is, but would be simpler
since it wouldn't need to know about as many different operations and data
types.  It would only need to handle the fixup/intern/unboxed-constant ops
described above.  Running of the top-level forms would generate the function
objects and the other constants.  Cold load would compute the table for each
top-level form and dump it as a g-vector in the core.  Mostly it would just
mash the little files together into a big one.

Another way to look at this the following: there is no such thing as a boxed
constant.  The "constants" for a function are simply variables in the top-level
form which have been closed over.  The top-level form initializes these
variables to the appropriate values.  If we actually implemented things this
way in the compiler, then it would have the desirable effect that the variables
in the standard "DEFUN in LET" hack would be dealt with just as efficiently as
"constants".  There is no real difference between a top-level var which is not
set and a constant.  LOAD-TIME-EVAL just magically binds a variable in the
current top-level form and then references it.