1  					-*- Dictionary: design; Package: C -*-
  2  Todo:
  3  Design:
  4  Glossary:
  5  Phases:
  6  			IR1 CONVERSION
  7  Canonical forms:
  8  Inline functions:
  9  Tail sets:
 10  Hairy function representation:
 11  IR1 representation of Catch and Unwind-Protect:
 12  Block compilation:
 13  			LOCAL CALL ANALYSIS
 14  Entry points:
 15  			FLOW GRAPH CANONICALIZATION
 16  			FLOW GRAPH SIMPLIFICATION
 17  			IR1 OPTIMIZE
 18  Bottom-up IR1 optimizations:
 19  Top-down IR1 optimizations:
 20  			TYPE CHECK
 21  			TYPE CONSTRAINT PROPAGATION
 22  			ENVIRONMENT ANALYSIS
 23  			CALL/LOOP ANALYSIS
 24  			GLOBAL TN ASSIGNMENT
 25  			LOCAL TN ASSIGNMENT
 26  			IR2 CONVERSION
 27  Stack analysis:
 28  Cleanup generation:
 29  			REACHING DEFINITIONS
 30  			LOOP INVARIANT OPTIMIZATION
 31  			COPY GENERATION
 32  		LOCAL COMMON SUBEXPRESSION ELIMINATION
 33  			PRELOAD GENERATION
 34  			LIFETIME ANALYSIS
 35  Flow analysis:
 36  Conflict detection:
 37  				PACKING
 38  Scarce SB packing:
 39  Load TN packing:
 40  Unbounded SB packing:
 41  Ranking:
 42  			CONTROL OPTIMIZATION
 43  			BRANCH DELAY
 44  			CODE GENERATION
 45  			ASSEMBLY
 46  			IR1 FINALIZE
 47  			RETARGETING
 48  Storage bases and classes:  
 49  Type system parameterization:
 50  VOP Definition:
 51  Lifetime model:
 52  VOP Cost model:
 53  Implementation parameterizations:
 54  Special-case IR2 conversion:
 55  			INTERPRETER INTERFACE
 56  			IR2 CONSISTENCY CHECKING
 57  			OBSERVED BUGS AND POSSIBLE FIXES

Todo:

Change miscop generators to emit fixup, and add fixup dumping.

Make top-level functions be XEPs, and make dumper call them.

Eliminate named call, and add a special "symbol-function for call" operation
that is used when calling a symbol.

Do type checking on the combination function continuation.  
Make the derived type of all function refs be plain FUNCTION?

Rationalize the constants used by generators: make our own database.

Add lifetime/pack support for environment-live TNs and pre-packed save TNs.

Change GTN/IR2 conversion to make fixed format frames.

Dump full debug info.  [Just component name at first]

Figure out source-map representation that can handle multiple files.

Fix up top-level loop so that block compilation is optional, the right package
is current, etc.  Change so that each form is initially converted in a
different env?

Implement unknown values: figure out the representation, and do stack analysis.
Implement non-local exits. [and UWP]
Implement mv-call.
Implement progv.
Implement closures in functional reference and function entry.
Implement type checking and testing.  [ptype VOP annotation macro]
Implement generators for lots of random stuff.

Assembly code changes: mostly existing code is either flushed or unchanged.
Eventually internal errors and interrupts will need to be bashed into the
"escape frame" model.  We will also need a little support for throw, unwind,
and possibly MV-Call.

Change system so that the definition cell always hold a funcallable function.
Initialize function cells to the undefined function.

Fix up genesis (ripping out link-table).
Fix up loader.

Implement compile-to-core.
Fix up GC so that it understands return PCs.

Write new interpreter (possibly w/o interpreted function support at first.) 

Write new debugger based on abstract interface to low-level stuff.

Add support for unpacking to pack load TNs.

Fix up type-intersection, types-intersect.

Merge in Alien code. 


Add "Arg-Documentation" or some such field to the Functional.  Have
IR1-Convert-Lambda-Body set this to some appropriate string, and have %DEFMACRO
override this.

Dump a function type for each XEP in the entry-info structure?  We probably
want some way to be able to do run-time machine-sensible query of a function's
protocol.  Since we don't need the min/max arg info at run-time anymore, we can
flush that.  Probably we want to dump a list-style specifier, since this is
more tractable than dumping function-type structure.  (efficiency of access
isn't terribly important here.)  It makes sense to dump this as a list, since
these should be meaningfully machine-grovelable (this isn't nearly as
inefficient as for arglists, since the type names will all be present as
symbols anyway; shareability is also high.)


So far as dumping is concerned, we want the top-level lambda to be an
entry-point, i.e. have an XEP.  There is a possible complication here, since
the top-level lambda would then have functions in it (the XEPs), which was
previously never the case.  If this is a problem, we could make the top-level
lambda be it's own XEP: there is no problem, since it has no arguments.


Change initial component hackery to handle many top-level and initial
components.  We then convert each top-level form separately, and if we are
block compiling, we make cross-component references and combine.  This has the
advantage of interleaving eval-when processing and macroexpansion with reading.
It also makes the process of separating unconnected parts of the flow graph
less important/simpler, since usually the code within a top-level form is
connected, and separate code will start out separate.  [But consider eval-when,
progn...]

This would require some funny business with top-level components: we would
either have to merge top-level components when we merge components under them,
or we would have to allow a single non-top-level component to have multiple
top-level components.

[I think this would also simplify the top-level loop, since the block and
non-block cases would be more similar.]


When recovering from a read error, return proxy error form rather than the next
form or EOF?


Fix up IR1 convert methods for catch, unwind-protect.  Cons up forms and
IR1-convert, so that funny functions are known, etc...


Let-convert XEP calls when there aren't any stray references to the XEP that
might be converted.


Do functions really ever have no return PC or old-cont?  If so, tail call needs
to be fixed.  If not, some conditionals can be ripped out elsewhere.


Emission order consistency checking.


Sometime, jam together the lifetime post-pass and pack pre-passes into one loop
over ir2-blocks, with multiple loops over each block.


Somehow want to realize that a wild result type is o.k. when we have a type
assertion of T.  Is this a special hack for T?  Note that our interpretation of
a non-values type assertion is that the first value (or NIL if none) must match
the assertion.  Anything is subtypep T, so we don't need to check T assertions.


Handling of named constants is odd.  It seems that we would like to be able to
fold together named and anonymous uses of the same constant.  Does Common Lisp
allow this?  What would be the significance of entering the same Constant
structure under multiple names in *free-variables*? 


With restricted temps, make the attempted pack order be the order specified in
the restriction?  Do this by making the costs 0, 1, 2...  Otherwise, there is
no point in having costs for temps, since costs are only used for
representation selection, and all usage is known to the VOP definer, who can
specify the best SCs.


Deal with joining components on local call conversion during ir1 optimization.
Probably can't just set Component-Reanalyze, since the function(s) need to be
moved to the new component, which FIND-DFO doesn't do.  Moving functions into a
component doesn't actually require DFO recomputation anyway, only let
conversion actually fucks with the flow graph.

Can promote-to-live lose when doing full call?  (passing around lots of wired
TNs...)


Either make the node for advanced returns be the call, or give ir2-block a back
pointer.  Should VOP have a back-pointer to the IR2 block?  Useful for
consistency checks, and might also be useful in load-tn pack.


Have variants of multiple-value return that take passing locations wired to the
beginning of the frame?  Then we wouldn't need to squeeze out intervening crap,
supposing that this was a thing we needed to do.  It would also allow us to
target stack values to a useful locations, instead of having push-values move
them onto the stack top.

Currently we never emit type checks for the values of unused continuations.  Do
we believe in this?


Compute closure must propagate closure from sets as well as refs?  
[18 august 88: seems to be a genuine bug.]


Recompute DFO more often so that we are sure all unreachable code is flushed?
Perhaps on policy?  It would be useful to know if DFO deleted any blocks (but I
guess delete-block will be setting component-reoptimize).


When a top-level form is broken off into a lambda, the form is in a for-value
context even though the value is discarded.


Named VOP temporary mechanism.  Only wired temps?
Mechanism in VOP for automatically emitting different VOPs depending on
  policy? 

If defining :conditional VOP, ensure that appropriate codegen-info args
defined.  

In a :Conditional VOP, check that result type is Boolean?  We are assuming this
now that we set the Predicate attribute.  Previously we could have :Conditional
templates for non-predicates: these templates just wouldn't be used when not
used as a predicate.  But this probably isn't very useful.


Macro for defining new template annotations and primitive-type attributes:
coerce-to/from-t, move, type-check, type-test, ...

Macro for defining non-vop (composite) templates (define-template?).

In VOP*, do run-time checking that a legal number of arguments has been
supplied.

Make primitive-type return the appropriate primitive-types for all possible
types.  Main thing currently missing is float types.

Special-case Defstruct accessors and predicates in IR1 conversion, making
the calls known functions and figuring out type info.  Eventually we will
probably want to make the setf expansion for a slot accessor be something
like (%set-slot '<name> <structure> <value>) so that we don't have to
actually create named setter functions.

Fix function type database.

Basic type inference methods.

Make sure everyone that should be marking blocks as needing to be optimized
  is doing so.  This primarily concerns control optimizations, although
  there may also be missing places in IR1 optimization itself.


How do we get the #' in the defun expansion to access the actual object?
Perhaps have a special special form?

DEFTYPE type stuff?
Fix type-union and type-intersection to handle numeric types correctly. 
Array types also broken:
    (types-intersect string (simple-array * (*)) => nil
    (type-intersection string (simple-array * (*)) => nil

Missing source transforms/canonicalization?

Change function type hackery to refect the weak interpretation of function
  declarations.  Basically we ignore function type declarations and we
  revert to plain FUNCTION when we union or intersect functions types.  We
  infer assumed function types from declarations in the actual definition.
[### This probably isn't exactly right.  Our notion of getting argument type
assertions from a "function type" just doesn't correspond to Common Lisp
function types.  What we do is keep our function type, but have no Common Lisp
way of getting at our "function types".  The complex Common Lisp FUNCTION type
specifier will always turn into just plain FUNCTION.  We can continue to
special-case calls to functions that have a "function type" on the call
continuation, since these types can only be created by magic.]

Make the asserted type for all function continuations be FUNCTION?  Make the
derived type for functional/global-function references be FUNCTION?  The
purpose of this would be to cause type checking on the function in funcall (or
any other call where the function type isn't known a priori).  This allows us
to unbundle the type checking, so that it can be optimized away or omitted
according to policy.

Fix up named constant handling code to always evaluate the expression at
compile time, rather than only evaluating when it is known to be constant.

Fix things up to correspond to the cleanup proposal.  The main change is in
replacing the inlinep values with a general purpose integration level.  The
globaldb compiler environment support also needs to be rethought.  Change
constant stuff to eval the value expression at compile time and flush the
"unknown constant" support.


Ultimate read-loop:
  Expand macros looking for package frobbing forms.
  Whizzy read-error recovery:
    We remember the starting position of the form that we are reading.  If
    we get a read error, then we back up and read again with
    *read-suppress* on.  We display some context around the place where the
    error happened.  If we hit end of file, then display the start of the
    form being read.

Make IR1 optimize cleverly use the Call attribute.  [That is, get a worst case
by combining the attributes of the actual functional args, rather than totally
punting when Call is specified.

Change TAGBODY (and BLOCK?) to preserve the drop-thrus in the original code.

Remaining special forms: UNWIND-PROTECT, PROGV.
Handle recursive types.
Add IGNORABLE declaration.

Make definition macros totally real by having the load-time functions
  deal with clearing compiler info and similar stuff.

Make IR1 conversion of special optionals less pessimal.

User level &more support.

IR1 values type hackery, especially in mv-bind.  Probably want a derive-type
method for Values too...

Substitute non-set let variables bound to effectless and unaffected calls
of non-set lexical variables or constants, when the variable is referenced
only once (and not inside a loop).  We need only move the combination node,
since the evaluation of such arguments is always delayed until the value is
needed.  This optimization should be useful for macros and inline function
calls (such as transforms).

PSETQ isn't propagating type assertions to the new-value forms.  We either need
an IR1 optimization that can discover type assertions on local call args, or we
need a special-case IR1 convert method for PSETQ.


Check that SCs specified for a restricted temp are a subset of the SCs allowed
by the primitive type.  (requires meta-compile-time primitive-type
information).  Give warning if an unbounded SC is allowed? 


Factor out non-Common-Lisp file-position hackery somehow. 


Design:

Variable maps:

There are about five things that the debugger might want to know about a
variable:

    Name
	Although a lexical variable's name is "really" a symbol (package and
	all), in practice it doesn't seem worthwhile to require all the symbols
	for local variable names to be retained.  There is much less VM and GC
	overhead for a constant string than for a symbol.  (Also it is useful
	to be able to access gensyms in the debugger, even though they are
	theoretically ineffable).

    ID
	Which variable with the specified name is this?  It is possible to have
	multiple variables with the same name in a given function.  The ID is
	something that makes Name unique, probably a small integer.  When
	variables aren't unique, we could make this be part of the name, e.g.
	"FOO#1", "FOO#2".  But there are advantages to keeping this separate,
	since in many cases lifetime information can be used to disambiguate,
	making qualification unnecessary.

    Type 
	When unboxed representations are in use, we must have type information
	to properly read and write a location.  We only need to know the
	primitive-type for this, which would be amenable to a space-saving
	numeric encoding.

	But if we allow user modification of locations, it would be nice for
	the debugger to be able to type-check user modifications.  It is also a
	useful sanity check to be able to check that variables hold values of
	the correct type: this could help to find bugs in code that wasn't
	compiled safely.  For example, checking the type could be a side-effect
	of printing the value.

	[### Or no...  What we really need to recover the representation is the
	SC.  This also already has a convenient numeric encoding.  Neither the
	primitive-type nor the actual type are enough to recover the
	representation, since they don't include representation decisions made
	by pack.  There is little point in dumping the primitive-type, since it
	contains less information than the actual type.  So we must dump the
	SC, and we can also dump the type if we believe the above argument
	about its utility.]

    Location
	Simple: the SB and offset.  [Actually, we need the save location too.]

    Lifetime
	In what parts of the program does this variable hold a meaningful
	value?  It seems prohibitive to record precise lifetime information,
	both in space and compiler effort, so we will have to settle for some
	sort of approximation.

	The finest granularity at which it is easy to determine liveness the
	the block: we can regard the variable lifetime as the set of blocks
	that the variable is live in.  Of course, the variable may be dead (and
	this contain meaningless garbage) during arbitrarily large portions of
	the block.

	Note that this subsumes the notion of which function a variable belongs
	to.  A given block is only in one function, so the function is
	implicit.


The variable map should represent this information space-efficiently and with
adequate computational efficiency.

The SC and ID can be represented as small integers.  Although the ID can in
principle be arbitrarily large, it should be <100 in practice.  The location
can be represented by just the offset (a moderately small integer), since the
SB is implicit in the SC.

The lifetime info can be represented either as a bit-vector indexed by block
numbers, or by a list of block numbers.  Which is more compact depends both on
the size of the component and on the number of blocks the variable is live in.
In the limit of large component size, the sparse representation will be more
compact, but it isn't clear where this crossover occurs.  Of course, it would
be possible to use both representations, choosing the more compact one on a
per-variable basis.  Another interesting special case is when the variable is
live in only one block: this may be common enough to be worth picking off,
although it is probably rarer for named variables than for TNs in general.

If we dump the type, then a normal list-style type descriptor is fine: the
space overhead is small, since the shareability is high.

We could probably save some space by cleverly representing the var-info as
parallel vectors of different types, but this would be more painful use.
It seems better to just use a structure, encoding the unboxed fields in a
fixnum.  This way, we can pass around the structure in the debugger, perhaps
even exporting it from the the low-level debugger interface.

[### We need the save location too.  This probably means that we need two slots
of bits, since we need the save offset and save SC.  Actually, we could let the
save SC be implied by the normal SC, since at least currently, we always choose
the same save SC for a given SC.  But even so, we probably can't fit all that
stuff in one fixnum without squeezing a lot, so we might as well split and
record both SCs.

In a localized packing scheme, we would have to dump a different var-info
whenever either the main location or the save location changes.  As a practical
matter, the save location is less likely to change than the main location, and
should never change without the main location changing.

One can conceive of localized packing schemes that do saving as a special case
of localized packing.  If we did this, then the concept of a save location
might be eliminated, but this would require major changes in the IR2
representation for call and/or lifetime info.  Probably we will want saving to
continue to be somewhat magical.]


How about:

(defstruct var-info
  ;;
  ;; This variable's name. (symbol-name of the symbol)
  (name nil :type simple-string)
  ;;
  ;; The SC, ID and offset, encoded as bit-fields.
  (bits nil :type fixnum)
  ;;
  ;; The set of blocks this variable is live in.  If a bit-vector, then it has
  ;; a 1 when indexed by the number of a block that it is live in.  If an
  ;; I-vector, then it lists the live block numbers.  If a fixnum, then that is
  ;; the number of the sole live block.
  (lifetime nil :type (or vector fixnum))
  ;;
  ;; The variable's type, represented as list-style type descriptor.
  type)

Then the debug-info holds a simple-vector of all the var-info structures for
that component.  We might as well make it sorted alphabetically by name, so
that we can binary-search to find the variable corresponding to a particular
name.

We need to be able to translate PCs to block numbers.  This can be done by an
I-Vector in the component that contains the start location of each block.  The
block number is the index at which we find the correct PC range.  This requires
that we use an emit-order block numbering distinct from the IR2-Block-Number,
but that isn't any big deal.  This seems space-expensive, but it isn't too bad,
since it would only be a fraction of the code size if the average block length
is a few words or more.

An advantage of our per-block lifetime representation is that it directly
supports keeping a variable in different locations when in different blocks,
i.e. multi-location packing.  We use a different var-info for each different
packing, since the SC and offset are potentially different.  The Name and ID
are the same, representing the fact that it is the same variable.  It is here
that the ID is most significant, since the debugger could otherwise make
same-name variables unique all by itself.


Stack parsing:

There are currently three relevant context pointers:
  -- The PC.  The current PC is wired (implicit in the machine).  A saved
     PC (RETURN-PC) may be anywhere in the current frame.
  -- The current stack context (CONT).  The current CONT is wired.  A saved
     CONT (OLD-CONT) may be anywhere in the current frame.
  -- The current code object (ENV).  The current ENV is wired.  When saved,
     this is extra-difficult to locate, since it is saved by the caller, and is
     thus at an unknown offset in OLD-CONT, rather than anywhere in the current
     frame.

We must have all of these to parse the stack.

With the proposed Debug-Function, we parse the stack (starting at the top) like
this:
 1] Use ENV to locate the current Debug-Info
 2] Use the Debug-Info and PC to determine the current Debug-Function.
 3] Use the Debug-Function to find the OLD-CONT and RETURN-PC.
 4] Find the old ENV by searching up the stack for a saved code object
    containing the RETURN-PC.
 5] Assign old ENV to ENV, OLD-CONT to CONT, RETURN-PC to PC and goto 1.

If we changed the function representation so that the code and environment were
a single object, then the location of the old ENV would be simplified.  But we
still need to represent ENV as separate from PC, since interrupts and errors
can happen when the current PC isn't positioned at a valid return PC.

[### We may need to be able to tell whether a call is local or not, since a
local call doesn't have to save ENV.  I guess we can look at the ENV, and see
if the code object contains the PC: if so, we win, if not (perhaps not a code
object at all), then look farther down the stack.  Note that there wouldn't be
any problem if we had a single-object function representation, since ENV is
implicit in the RETURN-PC.]


How much to we really gain by allowing the context pointers to be in arbitrary
locations?  It seems worthwhile allowing OLD-CONT and RETURN-PC to be in
arbitrary locations in the current function, since we can then save then in
registers if we don't do any calls.  This can significantly speed up calls to
"trivial" functions, which seems worthwhile.  But when we do save these things
on the stack, there is no real advantage in using arbitrary locations.

It seems like it might be a good idea to save OLD-CONT, RETURN-PC and ENV at
the beginning of the frame (before any stack arguments).  Then we wouldn't have
to search to locate ENV, and we also have a hope of parsing the stack even if
it is damaged.  As long as we can locate the start of some frame, we can trace
the stack above that frame.  We can recognize a probable frame start by
scanning the stack for a code object (presumably a saved ENV).

It would also be possible to parse the stack from the bottom up, given this
information and also some special consideration in the escape frame format.
This is because the caller is responsible for SP after the call, so the caller
has to know how big its frame is.  If we are guaranteed that all stuff on the
stack is "inside" a frame, we can parse the stack from the bottom up by
starting at the stack bottom and skipping over frames using the frame size
information.

We augment each Debug-Function with either a constant frame size (for a fixed
size frame) or a saved SP location (for frames that receive unknown MVs).
[Note that in a given component, all constant-size frames are the same size, so
were it not for variable-size frames, this information could be stored in the
Debug-Info structure.]

[### But not really, since we can't tell what function we are running in
without the return-pc, thus we can't tell the frame size.

I guess one possibility would be to make the fixed/variable frame decision on a
per-component rather than a per-function basis.  Then in a variable-frame
component, we would always store the frame end at a fixed location at the frame
beginning.  This is a little unpleasant, though, since any use of unknown MVs
in a component would have an efficiency penalty for all calls in that
component.  This would hurt large block compilations.

Alternately, we could guarantee things about the format of the variable size
stuff so that we could recognize and skip it.  For example, if what would be
the OLD-CONT in a real frame is guaranteed in the variable part guaranteed to
never be the current frame, then we can verify that we have found the beginning
of the next frame by checking that the frame's OLD-CONT is the current frame.
If assuming a fixed-size frame doesn't check out, then we must be in a variable
sized frame, so we access the saved frame end to find the next frame.

A related idea would be to make the variable part of the frame look like a
special "values" frame.  A values frame would directly incorporate the values
count, allowing the the values glob to be skipped.  We could indicate a values
frame by putting some distinctive non-code-object thing in the ENV save
location.

There are probably all kinds of nasty problems with parsing the stack in the
presence of interrupts, since we could be stopped while a function call is in
progress.  [Maybe even an NLX, although we would probably want to make that
uninterruptable.]  Hopefully the debugger can handle these things by some case
analysis.  

We can definitely be interrupted during UWP cleanup code, so the stack must be
left in some sort of sensible state when processing an unwind-protect.  This
seems like an argument in favor of squeezing out frame as we unwind, leaving
only the state needed to continue the unwind on the stack.  Except it is pretty
hard at run-time to determine the end of the "current frame" so as to leave the
return values on top of it.  So maybe unwind should use the unwinder's frame to
keep stuff in until the values receiver gets around to grabbing the stuff.

But then there will be all these o.k. looking frames on the stack that have
really been unwound: the only cue that they aren't real is that the current
CONT points to the UWP frame, and the current PC is in the function associated
with that frame.  This means that parsing the stack from the bottom must use
CONT to determine when it has hit the top of the stack.


Note that we currently have a bad problem: when the compiler can prove that a
function never returns normally, then it doesn't save the OLD-CONT and
RETURN-PC.  If something bad happened in such a function, then we wouldn't be
able to parse the stack.  This can happen fairly easily in system code such as
the top-level R-E-P loop.  

There isn't any efficiency reason for not saving the context, since such calls
are dynamically rare (and the function must eventually do a relatively
expensive NLX).  The problem is that the compiler isn't currently very good at
retaining "useless" information.  Probably we want some fairly general
mechanism for specifying that a TN should be considered to be live for the
duration of a specified environment.  It would be somewhat easier to specify
that the TN is live for all time, but this would become very space-inefficient
in large block compilations.

This mechanism could be quite useful for other debugger-related things.  For
example, when debuggability is important, we could make the TNs holding
arguments live for the entire environment.  This would guarantee that a
backtrace would always get the right value (modulo setqs).  

Note that in this context, "environment" means the Environment structure (one
per non-let function).  At least according to current plans, even when we do
inter-routine register allocation, the different functions will have different
environments: we just "equate" the environments.  So the number of live
per-environment TNs is bounded by the size of a "function", and doesn't blow up
in block compilation.

The implementation is simple: per-environment TNs are flagged by the
:Environment kind.  :Environment TNs are treated the same as :Normal TNs by
everyone except for lifetime/conflict analysis.  An environment's TNs are also
stashed in a list in the IR2-Environment structure.  During during the conflict
analysis post-pass, we look at each block's environment, and make all the
environment's TNs always-live in that block.

We can implement the "fixed save location" concept needed for lazy frame
creation by allocating the save TNs as wired TNs at IR2 conversion time.  We
would use the new "environment lifetime" concept to specify the lifetimes of
the save locations.  There isn't any run-time overhead if we never get around
to using the save TNs.  [Pack would also have to notice TNs with pre-allocated
save TNs, packing the original TN in the stack location if its FSC is the
stack.]


We want a standard (recognizable) format for an "escape" frame.  We must make
an escape frame whenever we start running another function without the current
function getting a chance to save its registers.  This may be due either to a
truly asynchronous event such as a software interrupt, or due to an "escape"
from a miscop.  An escape frame marks a brief conversion to a callee-saves
convention.

Whenever a miscop saves registers, it should make an escape frame.  This
ensures that the "current" register contents can always be located by the
debugger.  In this case, it may be desirable to be able to indicate that only
partial saving has been done.  For example, we don't want to have to save all
the FP registers just so that we can use a couple extra general registers.

When when the debugger see an escape frame, it knows that register values are
located in the escape frame's "register save" area, rather than in the normal
save locations.

We can mark an escape frame by having the ENV save location be some distinctive
value (as proposed for values frames).  The problem with this marking mechanism
is that ENV is not in general initialized until someone does a call out of the
frame, which means that arbitrary garbage may be in this slot in frames that
are escaped from.  This means that in a bottom-up parse, we can't tell whether
a frame is truly a special frame, or just an open frame, unless we know whether
the next frame is an escape frame.

We also can't locate the next frame (to see if it is escape frame), since ENV
may not have been saved yet, and we need ENV to compute the frame size.

The solution seems to be to require the escape process to save bottom-up
linkage information in the open frame.  In particular, if escape saves ENV in
the standard ENV save location, we can skip over fixed-size frames.  This also
eliminates the problem of open frames possibly looking like escape or values
frames, since we use the ENV location to flag special frames.

If we allow arbitrary variable garbage at the end of the frame, we still have a
problem, since even if the SP location is can be determined from the debug
info, it may not have been properly initialized.  This seems to be an argument
for requiring the variable stuff to be self-describing, i.e. the values-frame
idea.  

A related possibility would be to require the MV-returner to store the end of
the values glob into the returnee's frame at a standard offset (using OLD-CONT
as a base).  This would be a standardized SP save location that is guaranteed

When the Lisp-level escape routine is called it is passed the escape frame as
OLD-CONT, and a special return routine as RETURN-PC.  Different return routines
are used, depending on the nature of the escape.  For example, in an interrupt,
a return must restore all registers, whereas in a miscop bugout, we don't want
to damage the argument registers, since they may have a return value in them.
It would also be possible for a miscop to bug out passing a return PC that is
inside the miscop, so a bugout can happen in the middle of a miscop, rather
than being required to replace the miscop.


Have a feature for allowing templates to take their operands in the standard
argument passing locations.  This would be used primarily for miscop linkage.
It would easily allow arbitrary (and variable) argument miscops.  Fixed arg
miscops could this new mechanism, or could continue to work as now.


We should definitely bring the system up without the link-table at first, since
we can just rip out and ignore the link-table hair.  The efficiency of
symbol-function + funcall should be good enough so that the possible win of a
link-table would be <10% of total system performance.  

Other optimizations should come first, and these could also help function call.
The main example would be greater load-time smarts in referencing the global
environment.  Global constants (including functions) can be referenced at load
time and directly incorporated into the constant pool.  This could potentially
be generalized to a Scheme-like "global environment", which is a somewhat
link-table-like idea.

An easy optimization for function call is to guarantee that the symbol
definition cell always contains a callable function.  We use a special
(recognizable) "undefined function" when the symbol is undefined.  In a call
context, we can reference the cell and call the result without doing any boundp
or type check.  

This would require that we keep macro and special-form definitions in a
hashtable somewhere, but that is no big deal.  We might store a different
"illegal function" function in the definition cell of such things.

This would mean that in the case of an undefined function error, the name
wouldn't be readily available.  This should be livable in the presence of good
source-map information.

[loop-invariant and common-subexpression would also help repeated calls to the
same function.]


The area of function-object representation needs much thought when we do our
redesign.

We want to be able to represent a function by a single "code pointer" to an
object that combines the constant pool and the code:
	code tag
	<size info needed by GC [boxed words, code words]
	first constant
	...arbitrary boxed stuff
	start of code
	...arbitrary unboxed stuff
	code pointer tag and code length
	offset back to start of code object
	...code for function

The idea is that any needed constants can be accessed as offsets from the code
pointer, rather than having to indirect through the "function object" to get
the code.  Then to do a call, we just do a jump (skipping over the code pointer
header).  This also has the advantage that a closure needs only one slot to
represent the function, rather than the three in the current scheme.

Note that in a lowtag scheme, handling "raw" return PCs becomes more awkward:
the PC not only points to garbage, it also has information in the low bits.  We
are forced to have the return PC point to an aligned header block (much like a
code pointer).  A return point would then be much like a call entry, although
it would have different "argument count" conventions.


Of course, the link-table also wins with argument-count hair.  But this is a
non-issue in the dominant case of fixed arguments: we have one (optional)
check.  When arguments are variable, the overhead is still not great.

The hack of having a few standard entries (with jumps to error routines) is
probably unworthwhile: it takes up space, and doesn't save any time in the
fixed-arg case.  It costs time in the unsafe fixed-arg case.


Constant pool must always have a code-vector pointer to keep GC happy.  We put
this in the first slot so that a function entry can also be a constant pool
(since the code vector is the first thing in a function entry).

Some details of maintaining ENV need to be worked out.  Probably for now, we
should have the full call VOPs save/restore ENV in a stack temp.  There are
potentially more efficient saving strategies (as for saving in general), but we
don't want to risk pessimization.  It is worth playing it safe here, since we
never have to save/restore ENV in local call: ENV is only set in XEPs.  

It might seem that we could optimize components with no constants by not
maintaining the current function in ENV.  GC does need to be able to find the
function so that it can find the current code vector, but we could save it on
the stack.  But this isn't really very attractive.  There isn't any way to save
the constant pool on component entry, since the XEP call will be TR, and not
have a frame to save anything in.  We certainly don't want to push the overhead
for not losing the constant pool into all local function calls.  So, instead
we don't treat ENV any differently when there are no constants.

But when there are no constants, we can save a memory read on component entry
by not indirecting through the function entry to grab the constant pool.  We
could also dispense with the constant pool entirely, storing the debug info at
the end of the entry info for each entry.  This would save three words.
Another related optimization is to place the constants at the end of the entry
info when the component has only one entry: this has the same speed/space
advantage.  [But probably lots of fucking around with function object formats
is not worth the complexity it will cause in the dumper interface.]

This is actually separable into two parts: deciding how to load (or not load)
the constant pool gives the speed win.  We already need some hair here, since
in a closure XEP, we have to fetch the constants out of the closure.  (And we
get an even bigger win with a closure that has no constants, since we don't
have to either double-indirect or store the constant pool directly in the
closure).  

Once the decision not to load the constant pool has been made, the decision of
whether to flush the shared constant pool can actually be made entirely by the
dumper.  [This is an argument for not making the debug-info and code-vector
explicit constants, since this would obscure the issue of whether there
actually were any constants.  But not really.]

Probably what we want is a VM-supplied function that gets to grovel each
component before IR2 conversion.  This function annotates the IR2 component
according to decisions about how to represent the functions.  It can stick
placeholders into the constant pool to reserve room for implementation
overheads.


So far as input to the dumper/loader, how about having a list of Entry-Info
structures in the IR2-Component?  These structures contain all information
needed to dump the associated function objects, and are only implicitly
associated with the functional/XEP data structures.  Load-time constants that
reference these function objects should specify the Entry-Info, rather than the
functional (or something).  We would then need to maintain some sort of
association so IR2 conversion can find the appropriate Entry-Info.
Alternatively, we could initially reference the functional, and then later
clobber the reference to the Entry-Info.

We have some kind of post-pass that runs after assembly, going through the
functions and constants, annotating the IR2-Component for the benefit of the
dumper:
    Resolve :Label load-time constants.
    Make the debug info.
    Make the entry-info structures.


Fasl dumper and in-core loader are implementation (but not instruction set)
dependent, so we want to give them a clear interface.

open-fasl-file name => fasl-file
    Returns a "fasl-file" object representing all state needed by the dumper.
    We objectify the state, since the fasdumper should be reentrant.  (but
    could fail to be at first.)

close-fasl-file fasl-file abort-p
    Close the specified fasl-file.

fasl-dump-component component code-vector length fixups fasl-file
    Dump the code, constants, etc. for component.  Code-Vector is a vector
    holding the assembled code.  Length is the number of elements of Vector
    that are actually in use.  Fixups is a list of conses (offset . fixup)
    describing the locations and things that need to be fixed up at load time.
    If the component is a top-level component, then the top-level lambda will
    be called after the component is loaded.

load-component component code-vector length fixups
    Like Fasl-Dump-Component, but directly installs the code in core, running
    any top-level code immediately.  (???) but we need some way to glue
    together the componenents, since we don't have a fasl table.


More args need some thought.  Probably we don't need more-arg cleanups, since
%more-args will be implemented by moving the more args into TNs wired in our
frame, thus there is no stack garbage.  A more XEP will still require some
song-and-dance, but not using the cleanup mechanism.  Instead, we immediately
explicitly insert cleanup code as an MV-prog1.  

Also, when there are more args, we must spit out some funny function before the
call of the more EP.  This will turn into code that saves CONT somewhere (in a
wired register) so that it can be used as the more-arg context, then moves CONT
above the args and sets SP accordingly.  [But this is a special case of
function-entry magic that we have to do at any XEP.  Also, since the header
block in an optional-dispatch XEP is never actually executed, this code must be
replicated before each EP call.  Probably we want to spit out some sort of
%function-entry marker before each EP call, rather than just one in the header
block.  And we want to suppress all IR2 conversion of the XEP's bind node
(other than maybe to emit some arg-count dispatching OP.  Even in fixed-arg
functions, we still need to set up the entry vector).]

Also, if we require stack values to be left immediately on top of the caller's
frame, then the more-arg-entry cleanup code will be required to BLT the return
values down over the more args.  But this should be an automatic consequence of
the implicit MV-Prog1/Return.  But the actual EP may not be called for unknown
values, but, but...  Glag...

Also, we need to somehow bind a variable to the original CONT within the XEP so
that the cleanup code can restore it.  This could be a LET.


Figure out when it is actually an optimization to move register saving to
writers.  Current heuristic of doing it whenever there is a single writer loses
in some cases (such as TAK) where the writes may be executed without the call
happening.

Probably the cleverest thing to do is to be more conservative about the motion,
trying to move to some intermediate place that is provably good.  This could be
done using dominator info.  If there is some save that dominates other saves
and is dominated by the only write, then flush the unnecessary saves.  Note
a restore can also be flushed when there is no reference to the register
before a following restore.

Possibly this could be integrated with the local packing algorithm.  A TN could
be packed on the stack within an inner extent, and still be in a register
outside.  We want to identify single-entry extents with finer granularity than
loops.  Traces perhaps?  Anyway, this is all way down the road.


Whizzy error system interface for errors in compiled code.  Call a miscop as
now, but the miscop saves registers and bugs out to the out-of-line version of
the function.  If the function returns, then the miscop returns with the
function's value.  We target the operands to the miscop passing locations, but
postpone doing the moves into the error code.  We don't try to share error
code, since each distinct error can have different operands, and needs to jump
back.  But type errors can't use this mechanism, since type checks for
open-coded functions are factored out.  The safe (miscop) versions could choose
to signal errors this way, but this is out of our hands.


Dumping:

Dump code for each component after compiling that component, but defer dumping
of other stuff.  We do the fixups on the code vectors, and accumulate them in
the table.

We have to grovel the constants for each component after compiling that
component so that we can fix up load-time constants.  Load-time constants are
values needed my the code that are computed after code generation/assembly
time.  Since the code is fixed at this point, load-time constants are always
represented as non-immediate constants in the constant pool.  A load-time
constant is distinguished by being a cons (Kind . What), instead of a Constant
leaf.  Kind is a keyword indicating how the constant is computed, and What is
some context.

Some interesting load-time constants:

    (:label . <label>)
        Is replaced with the byte offset of the label within the code-vector.

    (:code-vector . <component>)
        Is replaced by the component's code-vector.

    (:entry . <function>)
    (:closure-entry . <function>)
	Is replaced by the function-entry structure for the specified function.
	:Entry is how the top-level component gets a handle on the function
	definitions so that it can set them up.

We also need to remember the starting offset for each entry, although these
don't in general appear as explicit constants.

We then dump out all the :Entry and :Closure-Entry objects, leaving any
constant-pool pointers uninitialized.  After dumping each :Entry, we dump some
stuff to let genesis know that this is a function definition.  Then we dump all
the constant pools, fixing up any constant-pool pointers in the already-dumped
function entry structures.

The debug-info *is* a constant: the first constant in every constant pool.  But
the creation of this constant must be deferred until after the component is
compiled, so we leave a (:debug-info) placeholder.  [Or maybe this is
implicitly added in by the dumper, being supplied in a ir2-component slot.]


Load TN targeting doesn't succeed as often as it might because operand aren't
annotated for targeting when they conflict.  The load TNs allocated might not
conflict.  Perhaps we could somehow redo targeting.  Also, there currently
isn't any point in making targeting conditional, since nobody cares if there is
only one target path: currently the TN must have a single reference.  [but not
totally true, since we can only properly represent one target path through a
VOP, and binary ops may have two.  If we reject one due to a conflict, then we
win.]


Should we specify the actual unknown-values passing locations as the results to
multiple full call VOPs, or should the VOP be responsible for moving from the
actual locations to the supplied locations?


Idea: in call we could pass in an argument pointer, instead of requiring
the arguments to be on the stack top.  Then we could target the arguments to
wired TNs at the beginning of our frame.  The called function's contract is
only to unwind its frame: it is assumed that the arguments are somehow popped
by the caller (since they are part of its frame).  In the special-case of a
tail-call, the argument pointer could be the same as the frame pointer.  Only
in a tail call is the argument pointer different from old-cont.

How does this win?  It would shorten the call sequence, since we would do fewer
moves due to targeting the result to the passing location.  It would probably
also lengthen the function prologue, since for stack arguments we would have to
emit a move from the argument area to the actual variable, since even when the
variable and the passing location are packed in the "same" location, they
(might) not be in the same frame.  But this is still a big win in terms of
space, since there are many more calls than functions.  It is also a speed win,
since the moves moved from the call site to the called function may be combined
with moves that we would be doing anyway.  [It also makes the IR2
representation for call simpler, and thus more efficient, since we don't have
to explicitly create and pass around the frame; frame creation can be rolled
into the call VOP.]

The only noticeable weirdness here is that if we preferenced the stack passing
locations to the variable locations, then it would look like honoring the
preference saves one memory reference (v.s. a move to a register), when it fact
it costs two memory references (a move from the argument area to the current
frame).  We get around this by having a special kind of preference that is only
valid during finite pack.

Note that the caller would still be responsible for allocating the callee's
frame, since in the case of a tail call, we want to be able to reuse the
caller's frame.

Having an explicit argument pointer also seems to make more args more natural,
since they want to have the arguments outside of the current frame.

And if we do a full call with the original XEP arguments, we don't have to do
any move at all... (perhaps a worthwhile optimization, but I guess this could
actually be done with any argument passing convention (at least with a TR full
call)).  [Might be worth accessing this through a specific syntax such as
More-Call.  This kind of thing would be useful for implementing encapsulation
and CLOS generic function dispatchers.

How does this idea relate to unknown values by symmetry?  I guess it is
actually more similar to MVs than our current call idea, since our current
unknown-values return concept involves returning a pointer to the values,
rather than assuming they are on stack top.  It is like the idea of returning
MVs by wiring TNs to the beginning of the current frame (which IR2tran
currently doesn't implement.)


In this scheme, the more-arg XEP would bump CONT up to the stack top, and
then increase SP to allocate a whole frame disjoint from the arguments.  If the
original call wasn't TR, this will result in "waste space" between the caller's
frame and the callee's frame, since the arg pointer will be into the caller's
frame; big deal.  To deallocate this stuff, we just remember what CONT was on
entry, and restore that as SP.  But this loses if we want to do a TR call,
since nobody knows to pop the more arg.  So it seems we really need to
introduce a call in the more XEP to indicate that shit needs to be cleaned up,
or maybe not if we just introduce explicit cleanup code in the XEP.  We need to
do cleanup in the XEP, since the more arg has dynamic extent.  Doing a TR call
from the EP shouldn't cause the more-arg to be deallocated, since it might be
referenced in a downward funarg.

#|
Fix Convert-More-Call to make a More-Args cleanup.  General rethinking of XEP
more args?  Make more-arg XEP take only the context and count so that we can be
sure stack temps won't be required for the check/coerce?  Fixed args would be
gotten with %more-arg when in the more-arg entry.  The setup code at the
more-arg entry would allocate a whole frame on top of the args, and then pass a
pointer to the args into a call to the more-arg entry.

About popping more args: how?  Do we really need a more-args cleanup?  Nobody
can ever locally unwind out of an XEP, so it seems that we can just stick a
%Pop-More-Args marker after the call to the more EP in an MV-prog1.

If the prologue at the more XEP entry point does the right stuff, we don't need
to introduce an unnecessary call.  It would stash the pointer to the first arg
and the count somewhere, and then allocate a whole frame on top of supplied
arguments.  Once we have a sane environment, then we can access fixed arguments
using %More-Arg, letting the appropriate type checking/coercion happen.  (I
guess the prologue would always move any register arguments onto the stack).
Uses of the Arg special-form would turn into %More-Arg with the number of fixed
args added to the specified offset.  That way we can always use the value of
CONT passed into the XEP as the more arg context.

How is the frame filled out for a non-more XEP?  It seems we want some
convenient hook for inserting code before each XEP call, since we will want to
be able to vector in directly at the XEP call.  Having a single %Function-Entry
isn't the right thing.  Anything that needs to be done, needs to be replicated
at each entry.
|#


### Watch out for nodes that deliver their values to a return node in a
different environment.  This actually could be implemented like any other
tail-recursive return, but we don't want to fall on our sword by getting
tangled up in the non-local-exit code.  I guess non-local-exit detection could
realize that TR uses of the result continuation in functions in the same
tail-set is o.k.  But this seems totally unimportant as an optimization -- we
just don't want to lose.


Unreferenced TNs?  Can certainly happen once IR2 optimizations are going on,
but probably shouldn't be happening now.

In general, it is a danger sign if a generator references a TN that isn't an
operand or temporary, since lifetime analysis hasn't been done for that use.
We are doing weird stuff for the old-cont and return-pc passing locations,
hoping that the conflicts at the called function have the desired effect.
Other stuff?  When a function returns unknown values, we don't reference the
values locations when a single-value return is done.  But nothing is live at a
return point anyway.


It is worth thinking about how we are going to get the lifetimes right with
"simple" call (call in the same environment)?  Things actually work pretty
straightforwardly, much like TR call.  The only sticky point is how we move the
return values into the result continuation locations (possibly massaging them).
This code would need to be in a block between the actual return from the
function and the nominal return point (represented in IR1).  This means we
would need to introduce a block. 

One possibility would be to make this block explicit in the IR1 representation
of local call from the beginning, but this would probably obscure useful
information such as which calls are potentially TR.  But if these blocks are
introduced later on, they would most naturally be introduced in IR2 conversion;
this would require some care, since people such as loop analysis may believe
they already understand all existing block.

I guess we want some sort of utility which, when given a node and some
representation of where to link the block in, will make IR1 and IR2 blocks,
using the node to make a new node for the new IR1 block that has the same
source context.  It would enter these blocks in all appropriate data
structures.  People could then freely emit whatever IR2 they wanted into the
IR2 block, using the new node as the context.


Note: the move VOP cannot have any wired temps.  (Move-Argument also?)
This is so we can move stuff into wired TNs without stepping on our toes.

How do we prevent XEPs from using the stack?  I guess this is only a problem
when there are more args, so maybe we should handle that differently.


Emit explicit you-lose operation when we do a move between two non-T ptypes,
even when type checking isn't on.  Can this really happen?  Seems we should
treat continuations like this as though type-check was true.  Maybe LTN should
leave type-check true in this case, even when the policy is unsafe.  (Do a type
check against NIL?)

Make Emit-Type-Check understand Check-Type templates with non-t operands. (???)
Probably silly.  We may want to check types with non-t operands, but these will
be complex things such as integer subrange.  Since primitive types are always
disjoint except for T, then if there is any doubt of whether an object is of a
certain primitive type, then it must be of primitive type T.  It would be
meaningful to generate the check-type result as other than T, but this probably
isn't useful.  If we make the result of type-check VOPs be boxed, then they
don't have to worry about unboxing the result.

At continuation use time, we may in general have to do both a coerce-to-t and a
type check, allocating two temporary TNs to hold the intermediate results.


Theory behind coerce-XXX-t VOPs:

Coerce-To-T is used when we are computing a value that might be unboxed into a
location that must be boxed.  For example, we would use coerce-to-t when using
a possibly unboxed variable as an argument to a template that accepts only
boxed values.  

Coerce-From-T is used when we are moving a value known to be boxed to a
location that might not be boxed.  Since unboxing is often cheaper than boxing,
it is worth recognizing the difference between this and coerce-to-t.  

Note that the Coerce-XXX-T templates might not be distinct from the Move
template, since the move template must be prepared to handle either coercion.
The reason for having the coerce templates is to indicate when implicit
coercions are necessary.  Automatically generated implicit coercions remove the
burden of representation conversion from templates that don't use special
representations.  For example:
    (cons (the float x) nil)
    (the float (car x))
We don't want CAR and CONS to have to understand unboxed numbers, so we
automatically emit a coerce-to-t on X in the first case, and a coerce-from-t on
the result of CAR in the second case.

It seems worth splitting the to/from conversion, since in many cases the from
coercion is much cheaper than the to coercion.  Unboxing doesn't cons, and is
thus unlikely to need miscop linkage registers, etc.

Do we want all primitive types to have a move vop?  Use the generic Move if not
any special requirements.  Since the presence of the coerce-XXX-t templates
indicates when coercions need to be done, the presence/absence of the Move
template contains no information.  But not clear that this buys anything, other
than saving a hashtable lookup.


Change POLICY to enforce the default ordering? 
    safety > brevity > speed > space > cspeed
[i.e. somehow arrange that this ordering holds when qualities are equal.]


The current node type annotations seem to be somewhat unsatisfactory, since we
lose information when we do a THE on a continuation that already has uses, or
when we convert a let where the actual result continuation has other uses.  

But the case with THE isn't really all that bad, since the test of whether
there are any uses happens before conversion of the argument, thus THE loses
information only when there are uses outside of the declared form.  The LET
case may not be a big deal either.

Note also that losing user assertions isn't really all that bad, since it won't
damage system integrity.  At worst, it will cause a bug to go undetected.  More
likely, it will just cause the error to be signaled in a different place (and
possibly in a less informative way).  Of course, there is an efficiency hit for
losing type information, but if it only happens in strange cases, then this
isn't a big deal.


In GTN analysis, should we be more careful than just assuming that no known
function can be part of a TR loop?  A new function attribute, perhaps?


Generalized back-end notion provides dynamic retargeting?  (for byte code)

Constant TNs: we count on being able to indirect to the leaf, and don't try to
wedge the information into the offset.  We set the FSC to an appropriate
immediate SC.


Type checking of integer subranges, member types and structure types are
handled using specifc VOPs that should be in all VMs.

We might also want a VOP that can handle unions of any of the above.

We need at least (not cons), and might want arbitrary negations.

If too hairy, then we do a function call to %check-type or something.
We can also just punt any checks we don't feel like generating, especially at
first.

check-fixnum-subrange value result "min" "max"
check-member value result "members"
check-structure value result "type"

check-union-type value result "types"
    Each type must be non-hairy.

Also have a type-predicate VOP for each ptype so we can have table-driven
%Typep conversion.


Mechanism for conditionally inhibiting emission of certain VOPS, such as a
guard function.  Could be useful for changing emitted code depending on the
hardware configuration, i.e. what FPA is present.  But much of the same
effect could be obtained by having the VOP generator emit the right code.
But inhibiting VOP emission is useful where we have an open-coding that
is only enabled for certain configurations, and we handle other
configurations by miscop call.

Not quite clear how you get the effect of different configurations having
different SBs.  For FP, you could assume a lowest-common denominator size,
and always use the same SB.  Otherwise, it seems that the SCs in a
primitive-type would have to be conditionalized by the configuration, since
the ptype SCs determine which SCs the TN can be packed in.  I guess we
could get the effect by frobbing non-existent SBs so that we don't pack
anything in them.  Make them non-packed SBs, or have a special nonexistent
flag, or something.


We restore dynamic state (except possibly the stack top) at NLX entries by
shallow-binding the current state in TNs.  On entry to an environment
containing an NLX target, we load these TNs with the current dynamic state.
Any dynamic environment manipulation done within the environment keeps these
TNs up to date, setting to the new value on entry, and restoring the old value
on exit.  When we come in at an NLX entry, we just the dynamic state indicated
by the saved environment, and then jump to where we are going (massaging the
returned values into the desired format.)

Do we maybe want to make an NLX target look more like a returning function,
kind of analogous to making XEPs look like local calls?  If we stuck in a
funny function, that could represent the returning of unknown values and
provide a place to stick the entry code.

We have to do some kind of weird shit to represent the lifetimes of the
state TNs.  This is because there isn't any obvious way to link in the
cleanup block that will represent the possibility of the state being needed
at nearly any time.

Probably the best solution it to force all TNs live at a NLX EP to be
stack-allocated for the duration of the function being reentered.  The amount
of stack-frame bloat that this can cause is bounded by the number of NLX EP per
function * the number of live TNs at NLX EP.  Realistically, this means a few
slots now and again, which seems quite acceptable.

I guess this scheme kind of assumes that there aren't truly asynchronous
exits being done, since the state TNs won't be in their save locations
unless we have saved them and exited the environment.

In fact, this is true for all TNs.  We can't get an interrupt in an
environment, and then throw back into that environment, since the NLX
target will try to restore from the save locations, and the stuff won't be
there.

I guess it might work to conditionally not restore if you somehow determine
that you were already running in the right environment.  But this would
preempt possible optimizations such as dirty copies.

It seems that it might actually be a win to either:
 -- Make the body of the catch be a distinct function.  Instantiation of
    the catch frame would be done just before the call.  Doing the call
    forces the state to be saved so that we can restore it.  A throw would
    then look like an alternate return from this function.  And if we're
    doing this, we could just snapshot the state at the time of the call,
    and fuck this state-shallow-binding business.
 -- Represent the continuation as a real lambda.


Well, maybe throw isn't like return-from at all.  Anyway, it kind of begs
the issue to say "throw is just like return-from".  It's easy enough to
wrap a function around the body of catch and UWP at IR1 conversion time,
and it is also pretty clearly necessary that the catch (or especially UWP)
be usable when running in the same function.  Catch (and UWP) also have
arbitrary (or no) values.  These forms also have truly random control flow.

In contrast, with a lexical exit, we don't know until after environment
analysis which continuations will be NLX targets, it is less obvious that
we have to deal with exits at arbitrary times (which can only happen when
the continuation is heap-closed), and we do know what values are being
returned, and the control flow is explicit (ignoring heap-closures).

In closed (return-from) continuation, just cons the entire unwind-block on
the heap.  In the closure, we just represent the continuation with a
pointer to this.  When we unlink it, we invalidate it.  [But current Common
Lisp semantics don't require lexical exit targets to be linked in anywhere.
The proposed change (clarification?) to unwind-protect/"throw" semantics
would presumably require it, though.]

We can allocate indirect value-cells and return-from unwind-blocks on the
stack when they are never closed over by an XEP *and* we can show that
stack garbage won't inhibit tail-recursion.

Probably we are going to end up taking seriously the idea of implementing
catch and unwind-protect using lexical exits.  Now that XEPs are distinct,
it seems that we have more control over "closures" v.s. other stuff.
%Catch might "do the right thing" by referencing the actual escape function
rather than the XEP (in violation of the usual rules).


Someone has to take responsibility for adding the function prologue at
XEPs:
    Allocating stack storage
    Moving the constant pool from the function entry into the
      correct register.

Part of this is done by the ir2-convert method for %Function-Entry.

Moving of closed values out of the closure and into the environment passing
locations can probably be finessed by making the environment locations in
the XEP function be non-packed TNs representing the closure locations.
This info can then also be used to generate the closure at references to
the function.

How do we know what function entries to create?  Where does the detailed
knowledge of target function representations come in?  The IR2-Convert
method for %Function-Entry seems like a good hook for creating the non-code
data structures so that the dumper can dump them.


We create set closure variables using the Value-Cell VOP, which takes a value
and returns a value cell containing the value.  We can basically use this
instead of a Move VOP when initializing the variable.  Value-Cell-Set and
Value-Cell-Ref are used to access the value cell.  We can have a special effect
for value cells so that value cells references can be discovered to be common
subexpressions or loop invariants.


A tail-recursive full call is somewhat like a local one, except the passing
locations are wired to the beginning of the frame, and you must deallocate the
rest of the frame before you jump.


Perhaps make IF node be an implicit type test in IR1.  The default type is
null, but if a %typep call is the predicate, we can absorb that type.  We
could also special-case a number of other predicates such as EQ and numeric
relations.


Tail recursion issues:
There are various stack things that could effectively be argument to an
apparently tail recursive call:
    stack closures
    stack allocated value cells (if we have any)
    more-arg blocks
    unwind blocks
Stack closures aren't really bad news, since they would only be used as
funargs to known functions, and these are unlikely to be on a
tail-recursion path.

More args and value cells are pretty much something or other.

Unwind blocks for a lexical exit are the biggest concern.

How does making stack closures interact with tail-recursion?  There's no
problem with partial stack closures, since we compile it as though it was
just a bunch of implicit arguments (static links).  In the case of a set
closure var or a full closure, we probably want to inhibit tail recursion,
and deallocate the frame on return.


Glossary:

assert 
	type
	see also restrict
back end
bind (node, special bind)
block (basic block) (compilation)
call graph
CLEANUP (cleanup code) 
closure
	variable
	continuation
	for environment
	heap, stack
	set variables
code generator (in define-vop)
CODEGEN
constant pool
constant TN
combination
common subexpression
component
	top-level
	initial
	ordinary
	head, tail
conflict (of TNs, set)
continuation
CONT
copy (tn, generation) clean, dirty
cost
CS
CSTACK
dead (TN)
DEST
DFN
DFO
DOMINATOR
DROPTHRUS
EFFECTFUL
EFFECTLESS
entry point (external)
ENV
environment
	analysis
	nesting
	top-level
	null
equivalence class (for function return conventions)
external entry point
FIXUP
flow graph
FLOWSIMP
FOLDABLE
FP
front end
FSC
full call
function attribute
function
	"real" (allocates environment)
	meaning function-entry
	more vague (any lambda?)
funny function
GEN (kill and...)
global TN, conflicts, preference
GTN (number)
IR IR1 IR2  IR1 conversion, IR2 conversion (translation)
inline expansion, call
kill (to make dead) also in common subexpression
known function
LAMBDA
leaf
let call
lifetime analysis, live (tn, variable)
load tn
LOCS (passing, return locations)
local call
local TN, conflicts, (or just used in one block)
local preference
location (selection)
loop
	nesting
	depth
	factor
	segment (of strange loop)
	head, tail(s), exits
LTN (number)
main entry
mess-up (for cleanup)
MISCOP
more arg (entry)
MV
natural loop
non-local exit
non-packed SC, TN
non-set variable
operand (to vop)
optimizer (in ir1 optimize)
optional-dispatch
pack, packing, packed
pass (in a transform)
passing 
	locations (value)
	conventions (known, unknown, fixed, standard)
policy (safe, fast, small, ...)
predecessor block
PREFERENCE
PRELOAD
primitive-type
reaching definition
REF
rank (of TN, global preference)
representation
	selection
	for value
result continuation (for function)
result type assertion (for template) (or is it restriction)
restrict
	a TN to finite SBs
	a template operand to a primitive type (boxed...)
	a tn-ref to particular SCs

return (node, vops)
safe, safety
saving (of registers, costs)
SB
SC (restriction)
scarce (SB, pack) [finite also used...]
semi-inline
set variable
side-effect
	in IR1
	in IR2
SP
sparse set
SPECBIND
splitting (of IR2 blocks)
SSET
strange loop
SUBPRIMITIVE
successor block
tail recursion
	tail recursive
	tail recursive loop
	user tail recursion

template
tick (up, down) in ir1-optimize
TN
TNBIND
TN-REF
transform (source, IR1)
type
	assertion
	inference
		top-down, bottom-up
	assertion propagation
        derived, asserted
	descriptor, specifier, intersection, union, member type
        check
type-check (in continuation)
unbounded (sb, pack)
UNBOXED (boxed) descriptor
unknown values continuation
unset variable
unwind-block, unwinding
used value (dest)
value passing
VAR
VM
VOP
XEP

[Merge in phases stuff?]


Phases:

The structure of the compiler may be broadly characterized by describing the
compilation phases and the data structures that they manipulate.  The steps in
the compilation are called phases rather than passes since they don't
necessarily involve a full pass over the code.  The data structure used to
represent the code at some point is called an IR (Intermediate Representation).

Two major IRs are used in the compiler:
    IR1 is used to represent the lisp-level semantics of the source code during
    the initial phases.  Meta-evaluation and semantic analysis are done on this
    representation.  IR1 is roughly equivalent to a subset of Common Lisp, but
    is represented as a flow-graph rather than a syntax tree.  Phases which
    only manipulate IR1 comprise the "front end".  It would be possible to use
    a different back end such as one that directly generated code for a stack
    machine. 

    IR2 is used to represent the implementation of the source code on a virtual
    machine.  The virtual machine may vary depending on the the target
    hardware, but the IR2 representation is sufficiently stylized that most of
    the phases which manipulate it are portable.


Each phase is briefly described here.  The ordering below is approximate.  Some
phases could moved around.  The phases from Local call analysis to Type
constraint propagation all interact; for maximum optimization, they should be
repeated until nothing new is discovered.  The name of a phase is in brackets
if it may be omitted to save compilation time.

IR1 conversion
    Convert the source into the IR1 representation, doing macroexpansion and
    simple source-to-source transformation.  All names are resolved at this
    time, so we don't have to worry about name conflicts.

Local call analysis
    Find calls to local functions and splice them into the flow graph so that
    we can do flow analysis.

Flow graph canonicalization
    Find flow graph components and compute depth-first ordering.  Locate
    IF-TYPEP constructs.

Flow graph simplification
    Merge blocks and eliminate IF-IFs.  Locate Let calls.

[IR1 optimize]
    Fold constant functions, propagate types and eliminate code that computes
    unused values.  Special-case calls to some known global functions by
    replacing them with a computed function.

[Type check]
    Check that type assertions are satisfied, marking places where type checks
    need to be done.

[Type constraint propagation]
    Use global flow analysis to propagate information about lexical variable
    types.   Eliminate unnecessary type checks and tests.

Environment analysis
    Determine which distinct environments need to be allocated, and what
    context needed to be closed over by each environment.  We detect non-local
    exits and set closure variables.  We also emit cleanup code as funny
    function calls.  This is the last pure IR1 pass.

[Call/loop analysis]
    Make IR2 annotations about the call and loop structure of the component.
    The call information is used to determine the feasibility of inter-routine
    register allocation, and the loop information is used to detect inner loops
    for register allocation and loop optimization.

Global TN allocation (GTN)
    Walk over the environment nesting, determining function calling conventions
    and assigning TNs to local variables.

Local TN allocation (LTN)
    Use type and policy information to determine which IR2 translation to use
    for known functions, and then create TNs for expression evaluation
    temporaries.  We also accumulate some random information needed by IR2
    conversion.

IR2 conversion
    Convert IR1 into IR2 by translating nodes into VOPs.  Emit type checks.
    Create External Entry Points for entry-point functions.

[Reaching definitions]
    Compute the reaching definitions for TNs.  Use this information to
    eliminate unnecessary copying of TN values.

[Loop invariant optimization]
    Move simple expressions out of loops where possible.

[Copy generation]
    Replace some references to TNs with references to copies that may get
    allocated in better places.

[Local common subexpression elimination]
    Combine expressions that are duplicated within a block.

Lifetime analysis
    Do flow analysis to find the set of TNs that have lifetimes that
    overlap with the lifetimes of each TN being packed.  Annotate call VOPs
    with the TNs that need to be saved across calls.

[Preload generation]
    On some machines, move memory references backward in the code so that they
    can overlap with computation.

Packing
    Use cost information to assign each TN to the "best" storage location,
    choosing the "best" code generator for each VOP as a side-effect.

[Control optimization]
    Linearize the flow graph in a way that minimizes the number of branches.

[Branch delay]
    On machines with delayed branch instructions, locate VOPs that can be moved
    into delay slots.

Code generation
    Call the VOP generators to emit assembly code.

Assembly
    Resolve branches and dump into the output file with appropriate loader 
    directives.

IR1 finalize
    This phase is run after all components have been compiled.  It scans the
    global variable references, looking for references to undefined variables
    and incompatible function redefinitions.


			IR1 CONVERSION


The set of special forms recognized is exactly that specified in the Common
Lisp manual.  Everything that is described as a macro in CLTL is a macro.

Large amounts of syntactic information are thrown away by the conversion to an
anonymous flow graph representation.  The elimination of names eliminates the
need to represent most environment manipulation special forms.  The explicit
representation of control eliminates the need to represent BLOCK and GO, and
makes flow analysis easy.  The full Common Lisp LAMBDA is implemented with a
simple fixed-arg lambda, which greatly simplifies later code.
      
The elimination of syntactic information eliminates the need for most of the
"beta transformation" optimizations in Rabbit.  There are no progns, no
tagbodys and no returns.  There are no "close parens" which get in the way of
determining which node receives a given value.

In IR1, computation is represented by Nodes.  These are the node types:
    If:  Represents all conditionals.
    Set: Represents a SETQ.
    Ref:
         Represents a constant or variable reference.
    Combination:
         Represents a normal function call.
    MV-Combination:
         Represents a MULTIPLE-VALUE-CALL.  This is used to implement
         all multiple value receiving forms except for MULTIPLE-VALUE-PROG1,
         which is implicit.
    Bind:
         This represents the allocation and initialization of the variables in
         a lambda.
    Return:
         This collects the return value from a Lambda and represents the
         control transfer on return.

Some slots are shared between all node types.  This information held in common
between all nodes often makes it possible to avoid special-casing nodes on the
basis of type.  This shared information is primarily concerned with the order
of evaluation and destinations and properties of results.  This control and
value flow is indicated in the node primarily by pointing to continuations.

The Continuation structure represents stuff that is sufficiently related to the
normal notion of a continuation that naming it so seems sensible.  Basically, a
continuation represents a place in the code, or alternatively the destination
of an expression result and a transfer of control.  These two notions are bound
together for the same reasons that they are related in the standard functional
continuation interpretation.

A Continuation may be deprived of either or both of its value or control
significance.   If the value of a continuation is unused due to evaluation for
effect, then the continuation will have a null DEST.  If the NEXT node for a
continuation is deleted by some optimization, then NEXT will be :NONE.

The Block structure represents a basic block, in the the normal sense.  Control
transfers other than simple sequencing are represented by information in the
Block structure.  The continuation for the last node in a block represents
only the destination for the result.

It is very difficult to reconstruct anything resembling the original source
from IR1, so we record the original source form in each node.  The location of
the source form within the input is also recorded, allowing for interfaces such
as "Edit Compiler Warnings".

Forms such as special-bind and catch need to have cleanup code executed at all
exit points from the form.  We represent this constraint in IR1 by annotating
the code syntactically within the form with a Cleanup structure describing what
needs to be cleaned up.  Environment analysis determines the cleanup locations
by watching for a change in the cleanup between two continuations.  We can't
emit cleanup code during IR1 conversion, since we don't know which exits will
be local until after IR1 optimizations are done.

Special binding is represented by a call to the funny function %Special-Bind.
The first argument is the Global-Var structure for the variable bound and the
second argument is the value to bind it to.

Some subprimitives are implemented using a macro-like mechanism for translating
%PRIMITIVE forms into arbitrary lisp code.  Subprimitives special-cased by IR2
conversion are represented by a call to the funny function %%Primitive.  The
corresponding Template structure is passed as the first argument.

We check global function calls for syntactic legality with respect to any
defined function type function.  If the call is illegal or we are unable to
tell if it is legal due to non-constant keywords, then we give a warning and
mark the function reference as :notinline to force a full call and cause
subsequent phases to ignore the call.  If the call is legal and is to a known
function, then we annotate the Combination node with the Function-Info
structure that contains the compiler information for the function.

IR1 conversion keeps top-level code in a special top-level component, which is
distinct from the initial component that all other code is placed in (see Flow
Graph Canonicalization).  We do this because many phases need to treat
top-level specially:
 -- It isn't productive to spend lots time optimizing top-level code, since it
    only runs once at load time.
 -- Top-level code may contain control transfers into normal code (in the form
    of top-level calls to global functions.)  It isn't desirable to convert
    these to local calls, since it would cause one-shot load-time code to be
    merged into the same component as real run-time code.
 -- We want to be able to compile the top-level code after all other code has
    been compiled so that the top-level component can build the
    function-entries for all the defined functions.
 -- Environment analysis needs to special-case the top-level component, since
    it lexically encloses all the rest of the code.  We don't want to have to
    do environment analysis across components, and also want to have top-level
    be distinct.  What we do is guarantee that environment analysis is trivial
    for the top-level component by requiring that it have no variables,
    functions or non-local exit targets.  If any complicated code is
    encountered at top-level, then this code is moved into an anonymous
    lambda in the initial component.  This function is then called from the
    top-level component.

Someday we may want to make IR1 conversion change the current component
periodically so that top-level forms don't get too bloated.  We would then need
a way to keep track of the multiple top-level components and would have to make
sure that they get dumped in the right order.


Canonical forms:

A bunch of functions have source transforms that convert them into the
canonical form that later parts of the compiler want to see.  It is not legal
to rely on the canonical form since source transforms can be inhibited by a
Notinline declaration.  This shouldn't be a problem, since everyone should keep
their hands off of Notinline calls.

Some transformations:

Endp  ==>  (NULL (THE LIST ...))
(NOT xxx) or (NULL xxx) => (IF xxx NIL T)

Any type predicate => TYPEP
TYPEP of AND, OR and NOT types turned into conditionals over multiple TYPEP
calls.  This makes hairy TYPEP calls more digestible to type constraint
propagation, and also means that the TYPEP code generators don't have to deal
with these cases.  [### In the case of union types we may want to do something
to preserve information for type constraint propagation.]

Make-String => Make-Array
N-arg predicates associated into two-arg versions.
Associate N-arg arithmetic ops.
Expand CxxxR and FIRST...nTH
Zerop, Plusp, Minusp, 1+, 1-, Min, Max, Rem, Mod
(Values x), (Identity x) => (Prog1 x)

All specialized aref functions => (aref (the xxx) ...)

Convert (ldb (byte ...) ...) into internal frob that takes size and position as
separate args.  Other byte functions also...

Change for-value primitive predicates into (if <pred> t nil).  This isn't
particularly useful during IR1 phases, but makes life easy for IR2 conversion.

This last can't be a source transformation, since a source transform can't tell
where the form appears.  Instead, IR1 conversion special-cases calls to known
functions with the Predicate attribute by doing the conversion when the
destination of the result isn't an IF.  It isn't critical that this never be
done for predicates that we ultimately discover to deliver their value to an
IF, since IF optimizations will flush unnecessary IFs in a predicate.


Inline functions:

We only record a function's inline expansion in the global environment when the
function is in the null lexical environment, since it the expansion must be
represented as source.

We do inline expansion of functions locally defined by FLET or LABELS even when
the environment is not null.  Since the appearances of the local function must
be nested within the desired environment, it is possible to expand local
functions inline even when they use the environment.  We just stash the source
form and environments in the Functional for the local function.  When we
convert a call to it, we just reconvert the source in the saved environment.

An interesting alternative to the inline/full-call dichotomy is "semi-inline"
coding.  Whenever we have an inline expansion for a function, we can expand it
only once per block compilation, and then use local call to call this copied
version.  This should get most of the speed advantage of real inline coding
with much less code bloat.  This is especially attractive for simple system
functions such as Read-Char.

The main place where true inline expansion would still be worth doing is where
large amounts of the function could be optimized away by constant folding or
other optimizations that depend on the exact arguments to the call.


Tail sets:
#|
Probably want to have a GTN-like function result equivalence class mechanism
for IR1 type inference.  This would be like the return value propagation being
done by Propagate-From-Calls, but more powerful, less hackish, and known to
terminate.  The IR1 equivalence classes could probably be used by GTN, as well.

What we do is have local call analysis eagerly maintain the equivalence classes
of functions that return the same way by annotating functions with a Tail-Info
structure shared between all functions whose value could be the value of this
function.  We don't require that the calls actually be tail-recursive, only
that the call deliver its value to the result continuation.

We can then use the Tail-Info during IR1 type inference.  It would have a type
that is the union across all equivalent functions of the types of all the uses
other than in local calls.  This type would be recomputed during bottom-up
optimization of return nodes.  When the type changes, we would propagate it to
all calls to any of the equivalent functions.

How do we know when and how to recompute the type for a tail-info.  Perhaps
recomputation should be driven by bottom-up type propagation on the result
continuation.

Do we want to annotate result continuations with the tail-info so this is
easy to do?  Alternatively, we could do all the action in bottom-up
optimization of local calls, which we need to special-case anyway.

But not quite...  We could discover new type information about one of the
non-recursive uses (which is after all, where all the type info comes from).

This special-casing of result continuation type propagation can also be thought
of as special-casing RETURN nodes.  We could put a tick in the return node to
keep track of the maximum tick for any of the non-call uses, and the union of
the corresponding types.  We can then recompute the overall union by taking the
union of the type per return node, rather than per-use.

How do result type assertions work?  We can't intersect the assertions across
all functions in the equivalence class, since some of the call combinations may
not happen (or even be possible).  We can intersect the assertion of the result
with the derived types for non-call uses.

When we do a tail call, we obviously can't check that the returned value
matches our assertion.  Although in principle, we would like to be able to
check all assertions, to preserve system integrity, we only need to check
assertions that we depend on.  We can afford to lose some assertion information
as long as we entirely lose it, ignoring it for type inference as well as for
type checking.

Things will work out, since the caller will see the tail-info type as the
derived type for the call, and will emit a type check if it needs a stronger
result.

A remaining question is whether we should intersect the assertion with
per-RETURN derived types from the very beginning (i.e. before the type check
pass).  I think the answer is yes.  We delay the type check pass so that we can
get our best guess for the derived type before we decide whether a check is
necessary.  But with the function return type, we aren't committing to doing
any type check when we intersect with the type assertion; the need to type
check is still determined in the type check pass by examination of the result
continuation.

What is the relationship between the per-RETURN types and the types in the
result continuation?  The assertion is exactly the Continuation-Asserted-Type
(note that the asserted type of result continuations will never change after
IR1 conversion).  The per-RETURN derived type is different than the
Continuation-Derived-Type, since it is intersected with the asserted type even
before Type Check runs.  Ignoring the Continuation-Derived-Type probably makes
life simpler anyway, since this breaks the potential circularity of the
Tail-Info-Type will affecting the Continuation-Derived-Type, which affects...

What do we do when a given return has no non-call uses?  In this case, we have
nothing to union into the overall type.  We could represent this by using
*empty-type*, or we could somehow represent the fact that nothing needs to be
unioned in.  We need pointers to all equated functions (even when they don't
have any useful type information), since we need to change their tail-info when
we equate two previously distinct Tail-Info structures.
|#


Hairy function representation:

Non-fixed-arg functions are represented using Optional-Dispatch.  An
Optional-Dispatch has an entry-point function for each legal number of
optionals, and one for when extra args are present.  Each entry point function
is a simple lambda.  The entry point function for an optional is passed the
arguments which were actually supplied; the entry point function is expected to
default any remaining parameters and evaluate the actual function body.

If no supplied-p arg is present, then we can do this fairly easily by having
each entry point supply its default and call the next entry point, with the
last entry point containing the body.  If there are supplied-p args, then entry
point function is replaced with a function that calls the original entry
function with T's inserted at the position of all the supplied args with
supplied-p parameters.

We want to be a bit clever about how we handle arguments declared special when
doing optional defaulting, or we will emit really gross code for special
optionals.  If we bound the arg specially over the entire entry-point function,
then the entry point function would be caused to be non-tail-recursive.  What
we can do is only bind the variable specially around the evaluation of the
default, and then read the special and store the final value of the special
into a lexical variable which we then pass as the argument.  In the common case
where the default is a constant, we don't have to special-bind at all, since
the computation of the default is not affected by and cannot affect any special
bindings.

Keyword and rest args are both implemented using a LEXPR-like "more args"
convention.  The More-Entry takes two arguments in addition to the fixed and
optional arguments: the argument context and count.  (ARG <context> <n>)
accesses the N'th additional argument.  Keyword args are implemented directly
using this mechanism.  Rest args are created by calling %Listify-Rest-Args with
the context and count.

The More-Entry parses the keyword arguments and passes the values to the main
function as positional arguments.  If a keyword default is not constant, then
we pass a supplied-p parameter into the main entry and let it worry about
defaulting the argument.  Since the main entry accepts keywords in parsed form,
we can parse keywords at compile time for calls to known functions.  We keep
around the original parsed lambda-list and related information so that people
can figure out how to call the main entry.

Someday, we could add a similar capability to parse keywords to well-known
global functions such as sequence functions.  All we need is something like
a KEYWORD-ENTRY proclamation which indicates that the specified function should
have a special entry for calling with pre-parsed keywords.  This will only work
in conjunction with an assertion that the function interface is the same in the
runtime environment as it is in the compiler.  This can be provided by a
FUNCTION proclamation.  Actually, we could do this whenever we have a
proclaimed function interface, KEYWORD-ENTRY notwithstanding.  The only problem
with this is that we might feel some compulsion to provide better inconsistency
checking and facilities for redefinition if we automatically made keyword
entries.  For this external keyword entry always pass a supplied-p arg for
every keyword, rather than having the callers have to know about and worry
about the exact defaulting mechanism and presence of supplied-p args.


IR1 representation of Catch and Unwind-Protect:


We represent CATCH using the lexical exit mechanism.  We do a transformation
something like this:
   (catch 'foo xxx)  ==>
   (block #:foo
     (%catch 'foo #'(lambda () (return-from #:foo (%unknown-values))))
     xxx)

%Catch just sets up the catch frame which points to the exit function.  %Catch
is an ordinary function as far as IR1 is concerned.  The fact that the catcher
needs to be cleaned up is expressed by the Cleanup slots in the continuations
in the body.  %Unknown-Values is a dummy function call which represents the
fact that we don't know what values will be thrown.  The cleanup encloses the
escape function and the body.

We use a similar hack in Unwind-Protect to represent the fact that the cleanup
forms can be invoked at arbitrarily random times.
    (unwind-protect p c)  ==>
    (flet ((#:cleanup () c))
      (block #:return
	(let ((#:next
	       (block #:unwind
		 (%unwind-protect #'(lambda (x) (return-from #:unwind x)))
		 (return-from #:return
			      (multiple-value-prog1 p
				(#:cleanup))))))
	  (#:cleanup)
	  (%unwind #:next))))

We use the block #:unwind to restore the dynamic state before doing the cleanup
in the case where we are non-locally unwound.  Calling of the cleanup function
in the drop-through case (or any local exit) is handled by cleanup generation.
We make the cleanup a function so that cleanup generation can add calls at
local exits from the protected form.  #:next is some kind of state that
indicates where we go after doing the cleanup in the case where we are unwound.
The cleanup encloses only the protected form.

Notice that implementing these forms using closures over continuations
eliminates any need to special-case IR1 flow analysis.  Obviously we don't
really want to make heap-closures here.  We don't necessarily need full stack
closures, since the code in the functions is highly constrained.


Block compilation:

One of the properties of IR1 is that supports "block compilation" by allowing
arbitrarily large amounts of code to be converted at once, with actual
compilation of the code being done at will.  In order to preserve the normal
semantics we must recognize that proclamations (possibly implicit) are scoped.
A proclamation is in effect only from the time of appearance of the
proclamation to the time it is contradicted.  The current global environment at
the end of a block is not necessarily the correct global environment for
compilation of all the code within the block.

We solve this problem by closing over the relevant information in the IR1 at
the time it is converted.  For example, each functional variable reference is
marked as inline, notinline or don't care.  Similarly, each node contains a
structure known as a Cookie which contains the appropriate settings of the
compiler policy switches.

The interface to block compilation will be primarily through extensions to
Compile-File rather than syntax in the source file.
  1] The file argument may be a list of files to be compiled together into a
     single binary.
  2] The keyword :Block-Compile specifies whether function references may be
     resolved at compile time.  (default NIL)  When enabled, full call may
     still be forced in a case-by-case basis by NOTINLINE declarations.
  3] The keyword :Entry-Points determines which DEFUNs are given globally
     callable definitions.  Eliminating entry points always saves space, and
     saves time when the function is used in only one place.  It also allows us
     to derive argument types, since we are aware of all of the uses.
     These values are legal:
       :ALL
	   All functions get entry points (the default).
       :SPECIFIED
	   Functions in an ENTRY-POINT proclamation get entry points.
       :EXPORTED
	   Functions are given entry points if they are either specified as an
	   ENTRY-POINT or are exported from their name's home package.


			LOCAL CALL ANALYSIS

All calls to local functions (known named functions and LETs) are resolved to
the exact LAMBDA node which is to be called.  If the call is syntactically
illegal, then we emit a warning and mark the reference as :notinline, forcing
the call to be a full call.  We don't even think about converting APPLY calls,
since APPLY is not special-cased at all in IR1.  We also take care not to
convert calls in the top-level component, which would join it to normal code.

With local functions, we splice the called function into the flow graph for the
callee so that flow analysis can be done more easily.  Each call to a local
function causes the block to be ended at that point, and the body of the
function made the successor to the caller's block.  Similarly, the block
containing the code after call is made the successor to the last block in the
function.  The combination node still has the actual continuation for the call
so that we can still find the receiver of the value.

Calls to functions with rest args and calls with non-constant keywords are
handled by converting a call directly to the More-Entry.  We use the funny
function %More-Args to gather together the extra arguments in a format that can
be understood by the more entry.  In order to represent the cleanup action of
deallocating the more-args block (which inhibits tail-recursion), we make a
:More-Args cleanup that encloses code from the %More-Args call through the
actual call.

We also convert MV-Calls that look like MULTIPLE-VALUE-BIND to local calls,
since we know that they can be open-coded.  We replace the optional dispatch
with a call to the last optional entry point, letting MV-Call magically default
the unsupplied values to NIL.  Although we may not do much flow analysis on
mv-binds, we certainly don't want them turning into closures.  We could also do
the same for MV-Calls with known numbers of values.

When IR1 optimizations discover a possible new local call, they explicitly
invoke local call analysis on the code that needs to be reanalyzed. 


Entry points:

As a side-effect of finding which references to known functions can be
converted to local calls, we find any references that cannot be converted.
References that cannot be converted to a local call must evaluate to a
"function object" (or function-entry) that can be called using the full call
convention.  A function that can be called from outside the component is called
an "entry-point".

Lots of stuff that happens at compile-time with local function calls must be
done at run-time when an entry-point is called.

It is desirable for optimization and other purposes if all the calls to every
function were directly present in IR1 as local calls.  We cannot directly do
this with entry-point functions, since we don't know where and how the
entry-point will be called until run-time.

What we do is represent all the calls possible from outside the component by
local calls within the component.  For each entry-point function, we create a
corresponding lambda called the external entry point or XEP.  This is a lambda
with no arguments in the null lexical environment.

The XEP contains a call to the funny %Function-Entry function that serves as a
placeholder for runtime argument count checking and dispatching.  Following
this call, there are distinct sucessor blocks with local calls to the actual
entry-point functions.  In a lambda, there is only one.  In an optional
dispatch, there is a call to each optional entry and to the more entry.

We represent the possibility of an entry-point being called from arbitrary
external locations by directly linking the beginning and end of the XEP
function to the head and tail of the component.  Entry points thus serve as the
roots of the flow graph.

We represent the possibly arbitrary arguments to entry-points by supplying
inscrutable placeholder expressions as the arguments in the local call.
For each positional argument N, we say (%XEP-ARG <n>), and for a more-arg
starting at arg N, we say (%XEP-MORE-ARGS-CONTEXT <n>) and 
(%XEP-MORE-ARGS-COUNT <n>).  These forms serve as placeholders for the code
that will ultimately fetch the arguments from the standard full-call
argument-passing locations.  The XEP call to a more-args entry is enclosed in a
more-arg cleanup, the same as for a local more-arg call.

All non-local-call references to functions are replaced with references to the
corresponding XEP.  IR1 optimization may discover a local call that was
previously a non-local reference.  When we delete the reference to the XEP, we
may find that it has no references.  In this case, we can delete the XEP,
causing the function to no longer be an entry-point.


			FLOW GRAPH CANONICALIZATION

This is a post-pass to IR1 conversion that massages the flow graph into the
shape subsequent phases expect.  Things done:
  Compute the depth-first ordering for the flow graph.
  Find the components (disconnected parts) of the flow graph.
  Note (IF (typep <var> '<type>) ...) constructs.

This pass need only be redone when newly converted code has been added to the
flow graph.  The reanalyze flag in the component structure should be set by
people who mess things up.

IR1 optimization can change the connectivity of the flow graph by discovering
new calls or eliminating dead code.  The standard DFO computation routine
merges components that have become connected, but doesn't attempt to split
components that have been disconnected.

Another thing we do [### But not yet... Wait for type constraint propagation]
is check whether the predicate of each IF is a %TYPEP test with a useful
constant type and a lexical variable argument.  If it is, we annotate the IF
node with the variable and the type tested.  This information is used by later
type propagation steps.  A useful type is one that we can hack at compile time:
not a Hairy-Type.  If the type is a NOT type, then we can handle it by swapping
the alternatives.  (if <var> ...) is consider a type test against (not null).

We create the initial DFO using a variant of the basic algorithm.  The initial
DFO computation breaks the IR1 up into components, which are parts that can be
compiled independently.  This is done to increase the efficiency of large block
compilations.  In addition to improving locality of reference and reducing the
size of flow analysis problems, this allows back-end data structures to be
reclaimed after the compilation of each component.

Components are basically connected pieces of the flow graph (hence the name).
If we find a component that contains a function that is nested within a
function in another component, then we combine the two components so that
environment analysis doesn't have to be done across component boundaries.


			FLOW GRAPH SIMPLIFICATION

Although not necessary for correctness, skipping this phase would be a bad
idea, since the proliferation of blocks would make mandatory flow analysis
passes run slower.  This phase should be redone whenever later phases think
they might have created a simplification opportunity, since this phase is
fairly fast when it doesn't do anything.

Things done:
    Delete blocks with no predecessors.
    Merge blocks that can be merged.
    Convert local calls to Let calls.
    Eliminate degenerate IFs.

We take care not to merge blocks that are in different functions or have
different cleanups.  This guarantees that non-local exits are always at block
ends and that cleanup code never needs to be inserted within a block.

If a lambda is called in only one place, then we delete the return node and
replace the dummy result continuation with the actual continuation.  We equate
the dummy continuation to the real continuation so that code in the function
body can still see what is going on.  This is called a "let" call, since a Let
would turn into one.

In addition to enabling various IR1 optimizations, the let/non-let distinction
has important environment significance.  We treat the code in function and all
of the lets called by that function as being in the same environment.  This
allows exits from lets to be treated as local exits, and makes life easy for
environment analysis.  

Since we will let-convert any function with only one call, we must be careful
about cleanups.  It is possible that a lexical exit from the let function may
have to clean up dynamic bindings not lexically apparent at the exit point.  We
handle this by annotating lets with any cleanup in effect at the call site.
The cleanup for continuations with no immediately enclosing cleanup is the
lambda that the continuation is in.  In this case, we look at the lambda to see
if any cleanups need to be done.

Let conversion is disabled for entry-point functions, since otherwise we might
convert the call from the XEP to the entry point into a let.  Then later on, we
might want to convert a non-local reference into a local call, and not be able
to, since once a function has been converted to a let, we can't convert it
back.

A function's return node may also be deleted if it is unreachable, which can
happen if the function never returns normally.  Such functions are not lets.

We eliminate IFs with identical consequent and alternative.  This would most
likely happen if both the consequent and alternative were optimized away.
[Could also be done if the consequent and alternative were different blocks, but
computed the same value.  This could be done by a sort of cross-jumping
optimization that looked at the predecessors for a block and merged code
shared between predecessors.  IFs with identical branches would eventually be
left with nothing in their branches.]

We eliminate IF-IF constructs:
    (IF (IF A B C) D E) ==>
    (IF A (IF B D E) (IF C D E))

In reality, what we do is replicate blocks containing only an IF node where the
predicate continuation is the block start.  We make one copy of the IF node for
each use, leaving the consequent and alternative the same.  If you look at the
flow graph representation, you will see that this is really the same thing as
the above source to source transformation.


			IR1 OPTIMIZE

#|

The IF-IF optimization can be modeled as a value driven optimization, since
adding a use definitely is cause for marking the continuation for
reoptimization.  [When do we add uses?  Let conversion is the only obvious
time.]  I guess IF-IF conversion could also be triggered by a non-immediate use
of the test continuation becoming immediate, but to allow this to happen would
require Delete-Block (or somebody) to mark block-starts as needing to be
reoptimized when a predecessor changes.  It's not clear how important it is
that IF-IF conversion happen under all possible circumstances, as long as it
happens to the obvious cases.

Maybe we want to jam the remaining flowsimp optimizations into IR1 optimize?
All that's left is deleting unreachable blocks and joining.  Now that let
conversion happens in local call analysis, flowsimp is pretty optional.  Most
of the flowsimp file is component/DFO determination.

[### It isn't totally true that code flushing never enables other worthwhile
optimizations.  Deleting a functional reference can cause a function to cease
being an XEP, or even trigger let conversion.  It seems we still want to flush
code during IR1 optimize, but maybe we want to interleave it more intimately
with the optimization pass.  

Ref-flushing works just as well forward as backward, so it could be done in the
forward pass.  Call flushing doesn't work so well, but we could scan the block
backward looking for any new flushable stuff if we flushed a call on the
forward pass.

But if we flush local call args when local call conversion can still happen,
then we have to be paranoid about deleting XEP args.  We could either punt or
somehow let local call analysis know which vars have been deleted.  I guess it
could just always drop args corresponding to unreferenced vars.  This would
also eliminate the need for a special pass to flush vars the never had any
references, since the initial conversion of the call would drop the
corresponding arg.  Argument substitution/type-inference would need to know
that unreferenced vars don't have corresponding arguments. 

Note that we can delete vars with no refs even when they have sets.  I guess
when there are no refs, we should also flush all sets, allowing the value
expressions to be flushed as well.


There are three dead-code flushing rules:
 1] Refs with no DEST may be flushed.
 2] Known calls with no dest that are flushable may be flushed.  We null the
    DEST in all the args.
 3] If a lambda-var has no refs, then it may be deleted by deleting the var and
    the corresponding arguments in all calls.  The flushed argument
    continuations have their DEST nulled.

[The old optimization of deleting the BIND for no-variable lets is gone, since
it is a pointless waste of time.  It is trivial for the back-end to do nothing
when there are no vars.]

These optimizations all enable one another.  We scan in reverse DFO, looking
for blocks with the reoptimize flag set.  We scan these blocks backward,
looking for nodes whose CONT has no DEST, then type-dispatching off of the
node.  If we delete a ref, then we check to see if it is a lambda-var with no
refs.  When we flush an argument, we mark the blocks for all uses of the CONT
as needing to be reoptimized.  [Flag unflushable calls so that we don't
waste time repeatedly trying to flush an unflushable call on repeated scans?]

Note that we can flush unreferenced variables even in functions with XEPs,
since the XEP argument computation can be deleted like anything else. 
[This would be hairier if it were still possible to convert new local calls.]


Goals for IR1 optimizations:

When an optimization is disabled, code should still be correct and not
ridiculously inefficient.  Phases shouldn't be made mandatory when they have
lots of non-required stuff jammed into them.

|#

This pass is optional, but is desirable if anything is more important than
compilation speed.

This phase is a grab-bag of optimizations that concern themselves with the flow
of values through the code representation.  The main things done are type
inference, constant folding and dead expression elimination.  This phase can be
understood as a walk of the expression tree that propagates assertions down the
tree and propagates derived information up the tree.  The main complication is
that there isn't any expression tree, since IR1 is flow-graph based.

This phase is divided into two passes: a top-down pass and a bottom-up pass.
We repeat these passes until we don't discover anything new.  This is a bit of
feat, since we dispatch to arbitrary functions which may do arbitrary things,
making it hard to tell if anything really happened.  Even if we solve this
problem by requiring people to flag when they changed or by checking to see if
they changed something, there are serious efficiency problems due to massive
redundant computation, since in many cases the only way to tell if anything
changed is to recompute the value and see if it is different from the old one.

We solve this problem by requiring that optimizations for a node only depend on
the properties of the CONT and the continuations that have the node as their
DEST.  If the continuations haven't changed since the last pass, then we don't
attempt to re-optimize the node, since we know nothing interesting will happen.

We keep track of which continuations have changed by recording the time that
they were modified.  We represent the time by a global counter known as the
"tick" which is incremented whenever we discover something interesting.  There
are actually two ticks, *up-tick* and *down-tick*, which are used to control
bottom-up and top-down optimization, respectively.

Both the Node and the Continuation structures have Up-Tick slots.  When the
bottom-up information for a Continuation is unchanged, its tick is >= to the
maximum of the ticks for all the nodes that use it.  When we discover that the
bottom-up information for a continuation has changed, we set the
Continuation-Up-Tick to (incf *up-tick*).

When doing the bottom up pass, we dispatch to type specific code that knows how
to tell when a node needs to be reoptimized and does the optimization.
Currently only two node types are handled: Combination and If.  These nodes
contain info about the ticks for the continuations at the last time the node
was optimized.  If we discover something about the result, we increment the
node's up tick, otherwise we just set the recorded ticks so that we know we
don't need to do anything the next time around.

Keeping top-down information up to date is somewhat different.  Nodes and
continuations have Down-Tick slots.  When we change to top-down information for
a continuation, we set the continuation's down-tick to (incf *down-tick*).

When a combination node discovers that its CONT has a down-tick greater than
its own, it propagates any new assertions to its arguments.  The combination
node also has a Fun-Tick slot, which is the last Up-Tick for the function
continuation.  This is used to detect when the function information might have
changed, so that we know when where are new assertions that could be propagated
from the function type to the arguments.

When we discover something about a leaf, or substitute for leaf, we increase
the up-tick for all the Ref and Set nodes.  When we delete a reference to a
lambda-var or discover something about a call to a lambda, then we increase the
down-tick for the lambda's Bind node.  [Don't currently do any of this.
Someday we should do something like it...]

As long as the optimizers never say that they did something when they didn't,
this process must terminate, since we only add to the global tick when
something happens.  In other cases, we only compute the maximum of other ticks,
and since the maximum can't be larger than the global tick, we can only compute
successively larger maximums a finite number of times.

We have flags in each block that indicate when any nodes or continuations in
the block need to be re-optimized, so we don't have to scan blocks where there
is no chance of anything happening.


Bottom-up IR1 optimizations:

In the bottom-up pass, we scan the code in forward depth-first order.  We
examine each call to a known function, and:
    Replace calls of foldable functions with constant arguments with the
    result.  We don't have to actually delete the call node, since Top-Down
    optimize will delete it now that its value is unused.

    Run any Optimizer for the current function.  The optimizer does arbitrary
    transformations by hacking directly on the IR.  This is useful primarily
    for arithmetic simplification and similar things that may need to examine
    and modify calls other than the current call.  The optimizer is responsible
    for recording any changes that it makes.  An optimizer can inhibit further
    optimization of the node during the current pass by returning true.  This
    is useful when deleting the node.

    Do IR1 transformations, replacing a global function call with equivalent
    inline lisp code.

    Do bottom-up type propagation/inferencing.  For some functions such as
    Coerce we will dispatch to a function to find the result type.  The
    Derive-Type function just returns a type structure, and we check if it is
    different from the old type in order to see if there was a change.

    Eliminate IFs with predicates known to be true or false.

    Substitute the value for unset let variables that are bound to constants,
    unset lambda variables or functionals.

    Propagate types from local call args to var refs.


We use type info from the function continuation to find result types for
functions that don't have a derive-type method.


IR1 transformation:

[There really aren't currently many things that would have to be IR1
transformations, and even fewer of them are really important.  The main thing
we would use these for is sequence transforms and similar stuff.  Probably some
of the flamage below about appropriate uses can be cut, since fairly reasonable
code would be generated even if we don't get around to doing all possible
transforms.  This is because most of the type-specific dispatching for
low-level primitives will be done by the IR2 conversion functions, rather than
being done as transforms.  We should also consider inhibiting transforms if
space is important, since these kinds of transformations will almost always
take up more space than the original call.]


IR1 transformation does "source to source" transformations on known global
functions, taking advantage of semantic information such as argument types and
constant arguments.  This should be thought of as a prelude to IR2 conversion,
rather than as an IR1 optimization; transformation is concerned with the
implementation of functions rather than with their meaning.  A transform is
recommended if:
  1] The applicability of the transform depends on the context in which the
     function is used: the values of the arguments or the types of the
     arguments or result.  If the transform is not context dependent, then it
     should be implemented as a source transform, since they are more
     efficient.
  2] The purpose is not be to reveal new information about the function
     that will be useful primarily to optimization phases such as constant
     folding.  Derive-Type and Optimizer functions do this sort of thing more
     efficiently.  If transformation does reveal useful information, then
     repetition of the optimization passes will exploit it, but this wastes
     time.

Transformation is optional, but should be done if speed or space is more
important than compilation speed.  Transformations which increase space should
pass when space is more important than speed.

A transform is actually an inline function call where the function is computed
at compile time.  The transform gets to peek at the continuations for the
arguments, and computes a function using the information gained.  Transforms
should be cautious about directly using the values of constant continuations,
since the compiler must preserve eqlness of named constants, and it will have a
hard time if transforms go around randomly copying constants.

The lambda that the transform computes replaces the original function variable
reference as the function for the call.  This lets the compiler worry about
evaluating each argument once in the right order.  We want to be careful to
preserve type information when we do a transform, since it may be less than
obvious what the transformed code does.

There can be any number of transforms for a function.  Each transform is
associated with a function type that the call must be compatible with.  A
transform is only invoked if the call has the right type.  This provides a way
to deal with the common case of a transform that only applies when the
arguments are of certain types and some arguments are not specified.  We always
use the derived type when determining whether a transform is applicable.  Type
check is responsible for setting the derived type to the intersection of the
asserted and derived types.

If the code in the expansion has insufficient explicit or implicit argument
type checking, then it should cause checks to be generated by making
declarations.

A transformation may decide to pass if it doesn't like what it sees when it
looks at the args.  The Give-Up function unwinds out of the transform and deals
with complaining about inefficiency if speed is more important than brevity.
The format args for the message are arguments to Give-Up.  If a transform can't
be done, we just record the message where IR1 finalize can find it.  note.  We
can't complain immediately, since it might get transformed later on.


Top-down IR1 optimizations:

In the top-down pass, we walk the code in reverse depth-first order and:

    Eliminate any effectless nodes with unused values.  In IR1 this is the only
    way that code is deleted other than the elimination of unreachable blocks.

    Eliminate any bindings for unused variables.

    Delete lets with no variables left.

    Do top-down type assertion propagation.  For some functions such as + we
    will dispatch to a function for this.  In local calls, we propagate
    asserted and derived types between the call and the called lambda.


			TYPE CHECK

This phase is optional, but should be done if anything is more important than
compile speed.  

Type check is responsible for reconciling the continuation asserted and derived
types, emitting type checks if appropriate.  If the derived type is a subtype
of the asserted type, then we don't need to do anything.

If there is no intersection between the asserted and derived types, then there
is a manifest type error.  We set the derived type to NIL (the type) and print
a warning message, indicating that something is almost surely wrong.  This
will inhibit any transforms or generators that care about their argument types,
yet also inhibits further error messages, since NIL is a subtype of every type.

If the intersection is not null, then we set the derived type to the
intersection of the asserted and derived types and set the Type-Check flag in
the continuation.  We always set the flag when we can't prove that the type
assertion is satisfied, regardless of whether we will ultimately actually emit
a type check or not.  This is so other phases such as type constraint
propagation can use the Type-Check flag to detect an interesting type
assertion, instead of having to duplicate much of the work in this phase.

Type checks are generated on the fly during IR2 conversion.  When IR2
conversion generates the check, it prints an efficiency note if speed is
important.  We don't flame now since type constraint progpagation may decide
that the check is unnecessary.

In local function call, it is the caller that is in effect responsible for
checking argument types.  This happens in the same way as any other type check,
since IR1 optimize propagates the declared argument types to the type
assertions for the argument continuations in all the calls.

Since the types of arguments to entry points are unknown at compile time, we
want to do runtime checks to ensure that the incoming arguments are of the
correct type.  This happens without any special effort on the part of type
check, since the XEP is represented as a local call with unknown type
arguments.  These arguments will be marked as needing to be checked.

Note that this phase reveals useful information, possibly enabling
type-specific optimizations, since THE declarations are not taken into account
in the initial derived type for a continuation.


			TYPE CONSTRAINT PROPAGATION

[### Implementations way want to transform some %typep calls into other stuff,
rather than having to generate hairy type tests during IR2 conversion.  E.g.
    (typep x '(integer 7 13))  ==>  (and (fixnump x) (<= 7 x 13))
When we do this, we want to preserve the original type information.  How?  This
might just fall out if we preserve the type test annotation on the original IF.

]

This phase is optional, but is desirable if anything is more important than
compilation speed.  We use an algorithm similar to available expressions to
propagate variable type information that has been discovered by implicit or
explicit type tests, or by type inference.

We must do a pre-pass which locates set closure variables, since we cannot do
flow analysis on such variables.  All we have to do is scan down the lambda
nesting looking for set variables, and then scanning up from all references to
see if we pass though an entry-point lambda on the way.  We set a flag in each
set closure variable so that we can quickly tell that it is losing when we see
it again.  Although this may seem to be wastefully redundant with environment
analysis, the overlap isn't really that great, and the cost should be small
compared to that of the flow analysis that we are preparing to do.  [Or we
could punt on set variables...]

A type constraint is a structure that includes sset-element and has the type
and variable.  Each variable has a list of its type constraints.  We create a
type constraint when we see a type test or check.  If there is already a
constraint for the same variable and type, then we just re-use it.  If there is
already a weaker constraint, then we generate both the weak constraints and the
strong constraint so that the weak constraints won't be lost even if the strong
one is unavailable.

We find all the distinct type constraints for each variable during the pre-pass
over the lambda nesting.  Each constraint has a list of the weaker constraints
so that we can easily generate them.

Every block generates all the type constraints in it, but a constraint is
available in a successor only if it is available in all predecessors.  We
determine the actual type constraint for a variable at a block by intersecting
all the available type constraints for that variable.

This isn't maximally tense when there are constraints that are not
hierarchically related, e.g. (or a b) (or b c).  If these constraints were
available from two predecessors, then we could infer that we have an (or a b c)
constraint, but the above algorithm would come up with none.  This probably
isn't a big problem.

[### Do we want to deal with (if (eq <var> '<foo>) ...) indicating singleton
member type?]

We detect explicit type tests by looking at type test annotation in the IF
node.  If there is a type check, the OUT sets are stored in the node, with
different sets for the consequent and alternative.  Implicit type checks are
located by finding Ref nodes whose Cont has the Type-Check flag set.  We don't
actually represent the GEN sets, we just initialize OUT to it, and then form
the union in place.

When we do the post-pass, we clear the Type-Check flags in the continuations
for Refs when we discover that the available constraints satisfy the asserted
type.  Any explicit uses of typep should be cleaned up by the IR1 optimizer for
typep.  We can also set the derived type for Refs to the intersection of the
available type assertions.  If we discover anything, we should consider redoing
IR1 optimization, since better type information might enable more
optimizations.

	
			ENVIRONMENT ANALYSIS
#|

A related change would be to annotate IR1 with information about tail-recursion
relations.  What we would do is add a slot to the node structure that points to
the corresponding Tail-Info when a node is in a TR position.  This annotation
would be made in a final IR1 pass that runs after cleanup code is generated
(part of environment analysis).  When true, the node is in a true TR position
(modulo return-convention incompatibility).  When we determine return
conventions, we null out the tail-p slots in XEP calls or known calls where we
decided not to preserve tail-recursion. 

In addition to directly indicating whether a call should be coded with a TR
variant, the Tail-P annotation flags non-call nodes that can directly return
the value (an "advanced return"), rather than moving the value to the result
continuation and jumping to the return code.  Then (according to policy), we
can decide to advance all possible returns.  If all uses of the result are
Tail-P, then LTN can annotate the result continuation as :Unused, inhibiting
emission of the default return code.
|#

The primary activity in environment analysis is the annotation of IR1 with a
hierarchy of environment structures describing where variables are allocated
and what values the environment closes over.  The nesting of environments is
determined by original syntactic nesting of the lambdas.

Each lambda points to the environment where its variables are allocated, and
the environments point back.  We always allocate the environment at the Bind
node for some lambda in the environment, so there is a close relationship
between environments and functions.  Each "real function" (i.e. not a LET) has
a corresponding environment.

We attempt to share the same environment among as many lambdas as possible so
that unnecessary environment manipulation is not done.  During environment
analysis the only optimization of this sort is realizing that a Let (a lambda
with no Return node) cannot need its own environment, since there is no way
that it can return and discover that its old values have been clobbered.

When the function is called, values from other environments may need to be made
available in the function's environment.  These values are said to be "closed
over".

The basic mechanism for closing over values is to pass the values as additional
implicit arguments in the function call.  This technique is only applicable
when:
 -- the calling function knows which values the called function wants to close
    over, and
 -- the values to be closed over are available in the calling environment.

The first condition is always true of local function calls.  Environment
analysis can guarantee that the second condition holds by closing over any
needed values in the calling environment.

If the function that closes over values may be called in an environment where
the closed over values are not available, then we must store the values in a
"closure" so that they are always accessible.  Closures are called using the
"full call" convention.  When a closure is called, control is transferred to
the "external entry point", which fetches the values out of the closure and
then does a local call to the real function, passing the closure values as
implicit arguments.

In this scheme there is no such thing as a "heap closure variable" in code,
since the closure values are moved into TNs by the external entry point.  There
is some potential for pessimization here, since we may end up moving the values
from the closure into a stack memory location, but the advantages are also
substantial.  Simplicity is gained by always representing closure values the
same way, and functions with closure references may still be called locally
without allocating a closure.  All the TN based IR2 optimizations will apply
to closure variables, since closure variables are represented in the same way
as all other variables in IR2.  Closure values will be allocated in registers
where appropriate.

Closures are created at the point where the function is referenced, eliminating
the need to be able to close over closures.  This lazy creation of closures has
the additional advantage that when a closure reference is conditionally not
done, then the closure consing will never be done at all.  The corresponding
disadvantage is that a closure over the same values may be created multiple
times if there are multiple references.  Note however, that IR2 loop and common
subexpression optimizations can eliminate redundant closure consing.  In any
case, multiple closures over the same variables doesn't seem to be that common.

Even if a value is not referenced in a given environment, it may need to be
closed over in that environment so that it can be passed to a called function
that does reference the value.  When we discover that a value must be closed
over by a function, we must close over the value in all the environments where
that function is referenced.  This applies to all references, not just local
calls, since at other references we must have the values on hand so that we can
build a closure.  This propagation must be applied recursively, since the value
must also be available in *those* functions' callers.

If a closure reference is known to be "safe" (not an upward funarg), then the
closure structure may be allocated on the stack.

Closure analysis deals only with closures over values, while Common Lisp
requires closures over variables.  The difference only becomes significant when
variables are set.  If a variable is not set, then we can freely make copies of
it without keeping track of where they are.  When a variable is set, we must
maintain a single value cell, or at least the illusion thereof.  We achieve
this by creating a heap-allocated "value cell" structure for each set variable
that is closed over.  The pointer to this value cell is passed around as the
"value" corresponding to that variable.  References to the variable must
explicitly indirect through the value cell.

When we are scanning over the lambdas in the component, we also check for bound
but not referenced variables.

Environment analysis emits cleanup code for local exits and markers for
non-local exits.

A non-local exit is a control transfer from one environment to another.  In a
non-local exit, we must close over the continuation that we transfer to so that
the exiting function can find its way back.  We indicate the need to close a
continuation by placing the continuation structure in the closure and also
pushing it on a list in the environment structure for the target of the exit.
[### To be safe, we would treat the continuation as a set closure variable so
that we could invalidate it when we leave the dynamic extent of the exit point.
Transferring control to a meaningless stack pointer would be apt to cause
horrible death.]

Each local control transfer may require dynamic state such as special bindings
to be undone.  We represent cleanup actions by funny function calls in a new
block linked in as an implicit MV-PROG1.


			CALL/LOOP ANALYSIS

[### This should be rethought at some point.  The current code for strange
loops seems not to be totally debugged, and it isn't really clear that anyone
cares.  If we aren't using loop info for environment hackery, then we may want
to redo stuff to be more useful for cost assignment and IR2 optimizations.

Probably what we really want to do is to compute the call graph before we do
loop analysis.  It seems that we should be able to use this information to
avoid getting confused during loop analysis.

We still represent the calls the same way in the flow graph, since this makes
the actual control flow explicit for flow analysis.

If we do a depth-first walk of the call graph, then we can find recursive calls
using the depth-first numbering.  Every recursion will involve an arc from a
higher numbered node to a lower numbered one.  If a function is the destination
of such an arc, then we mark it as recursive.

Nodes in the call graph ("functions") will be similar to loops.  We do loop
analysis on each function, [walking the call graph bottom-up?], using the
function as the root for the loop nesting.  We want to integrate the "loop
depth" used for cost computation with the information from the call graph.  For
example, a non-recursive function would have the "depth" of the maximum of its
call sites, and a recursive function would have 1+ this, since the recursion is
a "loop".

Probably we want the "function" node to be the IR2 environment structure.  This
will presumably also include some "loop" structure.

We definitely need to be aware of tail recursions at this point.  Tail
recursive calls are represented at the flow graph level, and don't appear in
the call graph at all.  Tail recursions will be indistinguishable from explicit
iteration.

]

 
This phase is optional, but should be done if speed or space is more important
than compile speed.  We annotate the IR2 with information about loops.

First we find the set of the blocks which dominate each block.  Note that if we
keep the set sorted by DFN, then they will be in order of dominance as well.
We probably want to special-case local function call, since standard flow
analysis won't realize that the return point of the call is dominated by the
call point.

The Loop structure contains a list of all the blocks in it, and also the loop
depth.  We arrange things so that Loops are reasonably nested, and have up and
down pointers.  We omit blocks contained within inner loops from our list.  We
also list the exit points from the loop.

We must deal with "strange" loops (not just a concession to TAGBODY, since
mutual recursion can create non-reducible flow graphs.)  First we find the back
edges and natural loops, and then we find the strange loops.  A strange loop is
a cycle which contains no back edges.  It turns out that every strange loop
contains a retreating edge which is not a back edge, and all such edges appear
in strange loops.

Strange loops are broken into segments, where each segment is the code
in the loop which is dominated by a given entry point.  The representation for
a segment of a strange loop is about the same as for a natural loop.

We can find the heads for the segments in a strange loop by doing a graph walk
forward from the ascending non-back edge, recursing edges which are not back
edges.  If we reach the start node or a node already in the loop, then we add
all the nodes on our path to the strange loop as we unwind.  Each block in the
strange loop with predecessors outside of the loop is the head of a segment.

When there are multiple back branches to the same start block, we consider all
the code to be in the same loop.  A loop is effectively defined by its head,
rather than by any particular back branch.

Loop nesting is determined by the dominance relation on loop heads.  A loop is
nested within another if the head of the inner loop is dominated by the head of
the outer loop.

Local call tends to cause the flow graph to be non-reducible.  This shouldn't
be a correctness issue, since other code should be assuming arbitrarily bizarre
flow graphs.  This is an efficiency issue, since it could inhibit the
combination of the allocations for some functions, resulting in more
environments being allocated than is really necessary.  This can probably be
fixed by special-casing local call during loop analysis.

All we need to do to get the correct dominators is to union the dominators at
the call site with the dominators at each return site.  It is less clear how
this should affect loop analysis.  In some ways, it would make sense to treat
local functions as being nested in all the loops that call them.  On the other
hand, if we are interested in the dominance properties of nesting, then it
makes more sense for the code for the function to be in none of the loops.

I guess that this conflict is due to different people wanting different things
from loops.  Some people are only interested in the importance of the code,
which they determine from the loop depth.  Other people such as environment
analysis are interested in proving that there are no cyclic paths, and want the
dominance relations preserved.  Other people are interested in doing "loop
optimizations", and are primarily interested in entry and exit points.  We may
want to do funny things with the loop depth anyway, since we will want to bound
it so that costs remain fixnums.


			GLOBAL TN ASSIGNMENT

#|
Having the Tail-Info would also make return convention determination trivial.
We could just look at the type, checking to see if it represents a fixed number
of values.  To determine if the standard return convention is necessary to
preserve tail-recursion, we just iterate over the equivalent functions, looking
for XEPs and uses in full calls.
|#

The Global TN Assignment pass (GTN) does a tree walk over the environment
nesting for a component, assigning the TNs used to hold local lexical variables
and pass arguments and return values and determining the value-passing strategy
used in local calls.

To assign return locations, we find the equivalence classes of continuations
which would ideally have the same value-passing strategy and locations.  We
examine each local function, equating our result continuation with the result
continuations of all functions called tail-recursively with that continuation.

If a call is tail-recursive, then the CONT for the call is the function's
result continuation.  An MV-Prog1 or implicit cleanup code could cause such a
call to be non-tail-recursive, but don't have to worry about this here.  It is
harmless to occasionally ensure identical passing locations when they aren't
needed.  [### But not really...]

If the result continuation for an entry point is used as the continuation for a
full call, then we may need to constrain the continuation's values passing
convention to the standard one.  This is not necessary when the call is known
not to be part of a tail-recursive loop (due to being a known function).

Once we have figured out where we must use the standard value passing strategy,
we can use a more flexible strategy to determine the return locations for local
functions.  We determine the possible numbers of return values from each
function by examining the uses of all the result continuations in the
equivalence class of the result continuation.

If every use of all equated continuations yields the same constant number of
values, then we return that fixed number of values from all the functions whose
result continuations are equated.  If the number of values is not fixed, then
we must use the unknown-values convention, although we are not forced to use
the standard locations.  We assign the result TNs at this time.

We need to use the function continuation equivalence classes to see what
convention we want to use.  What we do is use the full convention for any
function that has a XEP result in its equivalence class even if we aren't
required to do so by a tail-recursive full call, as long as there are no
non-tail-recursive local calls in the equivalence class.  This prevents us from
gratuitously using a non-standard convention when there is no reason to.

A somewhat separable issue is that sometimes the XEP call can't be
tail-recursive even when we do use the unknown values convention.  This happens
when there is a &more arg.  Probably we can ignore this here, and let someone
else figure this out.  [### But not really...]

Note that if a function has only tail-recursive calls and an XEP continuation
in its equivalence class, then we can just jump back to the XEP, rather than
saving a return PC and doing a return.  The common case of a function only
called from an XEP is a degenerate case of this.  This is really a LET call.
But there isn't any advantage to doing this except when we are forced to make
the XEP call non-tail-recursive.


			LOCAL TN ASSIGNMENT

#|

In LTN, we use the :Safe template as a last resort even when the policy is
unsafe.  Note that we don't try :Fast-Safe; if this is also a good unsafe
template, then it should have the unsafe policies explicitly specified.

With a :Fast-Safe template, the result type must be proven to satisfy the
output type assertion.  This means that a fast-safe template with a fixnum
output type doesn't need to do fixnum overflow checking.  [### Not right to
just check against the Node-Derived-Type, since type-check intersects with
this.]

It seems that it would be useful to have a kind of template where the args must
be checked to be fixnum, but the template checks for overflow and signals an
error.  In the case where an output assertion is present, this would generate
better code than conditionally branching off to make a bignum, and then doing a
type check on the result.

    How do we deal with deciding whether to do a fixnum overflow check?  This
    is perhaps a more general problem with the interpretation of result type
    restrictions in templates.  It would be useful to be able to discriminate
    between the case where the result has been proven to be a fixnum and where
    it has simply been asserted to be so.

    The semantics of result type restriction is that the result must be proven
    to be of that type *except* for safe generators, which are assumed to
    verify the assertion.  That way "is-fixnum" case can be a fast-safe
    generator and the "should-be-fixnum" case is a safe generator.  We could
    choose not to have a safe "should-be-fixnum" generator, and let the
    unrestricted safe generator handle it.  We would then have to do an
    explicit type check on the result.

    In other words, for all template except Safe, a type restriction on either
    an argument or result means "this must be true; if it is not the system may
    break."  In contrast, in a Safe template, the restriction means "If this is
    not true, I will signal an error."

    Since the node-derived-type only takes into consideration stuff that can be
    proved from the arguments, we can use the node-derived-type to select
    fast-safe templates.  With unsafe policies, we don't care, since the code
    is supposed to be unsafe.

|#

Local TN assignment (LTN) assigns all the TNs needed to represent the values of
continuations.  This pass scans over the code for the component, examining each
continuation and its destination.  A number of somewhat unrelated things are
also done at the same time so that multiple passes aren't necessary.
 -- Determine the Primitive-Type for each continuation value and assigns TNs
    to hold the values.
 -- Use policy information to determine the implementation strategy for each
    call to a known function.
 -- Clear the type-check flags in continuations whose destinations have safe
    implementations.
 -- Determine the value-passing strategy for each continuation: known or
    unknown.
 -- Note usage of unknown-values continuations so that stack analysis can tell
    when stack values must be discarded.
 
If safety is more important that speed and space, then we consider generating
type checks on the values of nodes whose CONT has the Type-Check flag set.  If
the destinatation for the continuation value is safe, then we don't need to do
a check.  We assume that all full calls are safe, and use the primitive
generator information to determine whether inline operations are safe.  See the
discussion of primitive generators.

It is not in general feasible to generate complex type tests such as union
types in IR2.  Since we are never required to do type checking, we could just
punt if we are asked to check something too hairy, but we wouldn't want to do
this very often.  This means that the VM needs to provide a function that
determines whether a type check can be done inline, in addition to the function
that actually emits the inline check.  

What we do is force the use of a safe implementation for the DEST when we want
to check the type, but it is too hairy.  [### Or maybe we just punt...]  Often
a reasonable thing to do is to check that operands satisfy the primitive type
corresponding to the Lisp type assertion.  So we would just check that (integer
7 13) is a FIXNUM.  Checking primitive types should cover the system's ass
pretty well, although you could lose with things like integer overflow, since
(+ 13 13) is a fixnum, but the addition of two arbitrary fixnums isn't.  Note
that this is not the same as a check for satisfying the template operand type
restriction -- it is stronger when there is no restriction or the restriction
is T.  The template type restriction can assume that the declared Lisp type is
satisfied.  It is purely for discrimination and representation restriction.

The VM should always be able to check primitive types inline (probably the
type-check operation should be part of the primitive-type definition).  It
should also be possible to test structure types inline.  If we really believe
in inferring fixnum constraints from integer subrange declarations, then being
able to check fixnum subranges efficiently is also important.

It seems like there ought to be some way to check types exactly under the
appropriate compiler policy.  I guess we can handle arbitrarily complex times
types (with an efficiency penalty) by calling a function, passing the type
descriptor.

How about restricting ourselves to checking these things:
    Primitive types
    Integer subranges
    Member types
    Structure types
    Unions of any of the above
    Perhaps negations too... (ATOM at least...)

Basically these are all things that we have a reasonable expectation of being
able to check without doing a function call.  Anything hairier that this, and
we either test for an easier supertype or call TYPEP.  None of this is
scary except for union types.  A possible way to deal with them would be to
have a check-union-type VOP that took the list of types as codegen-info and
allocated worst-case (presumably miscop) temporaries.  The actual type testing
code should be factored out into a separate function anyway, since we would
want to share it with the type predicate VOPs.

This phase is where compiler policy switches have most of their effect.  The
speed/space/safety tradeoff can determine which of a number of coding
strategies are used.  It is important to make the policy choice in IR2
conversion rather than in code generation because the cost and storage
requirement information which drives TNBIND will depend strongly on what actual
VOP is chosen.  In the case of +/FIXNUM, there might be three or more
implementations, some optimized for speed, some for space, etc.  Some of these
VOPS might be open-coded and some not.

We represent the implementation strategy for a call by either marking it as a
full call or annotating it with a "template" representing the open-coding
strategy.  Templates are selected using a two-way dispatch off of operand
primitive-types and policy.  The general case of LTN is handled by the
LTN-Annotate function in the function-info, but most functions are handled by a
table-driven mechanism.  There are four different translation policies that a
template may have:

    Safe
        The safest implementation; must do argument type checking.

    Small
        The (unsafe) smallest implementation.

    Fast
        The (unsafe) fastest implementation.

    Fast-Safe
        An implementation optimized for speed, but which does any necessary
        checks exclusive of argument type checking.  Examples are array bounds
        checks and fixnum overflow checks.

Usually a function will have only one or two distinct templates.  Either or
both of the safe and fast-safe templates may be omitted; if both are specified,
then they should be distinct.  If there is no safe template and our policy is
safe, then we do a full call.  The small and fast templates are always
specified, but are often identical to the fast-safe template or the safe
template.

We use four different coding strategies, depending on the policy:
  Safe          - safety > space > speed, or
                  we want to use the fast-safe template, but there isn't one.

  Small         - space > (max speed safety)

  Fast          - speed > (max space safety)

  Fast-Safe +
   type check	- safety > speed > space, or
                  we want to use the safe template, but there isn't one.

"Space" above is actually the maximum of space and cspeed, under the theory
that less code will take less time to generate and assemble.  [### This could
lose if the smallest case is out-of-line, and must allocate many linkage
registers.]


			IR2 CONVERSION

#|

IR2 Control representation:

Currently things are a bit ucky in the IR2 control representation, since we
haven't explicitly represented unconditional branches.  So some blocks end with
an explicit control transfer (IF or CALL), and some don't.  Which is which?  It
seems easier to represent all control transfer explicitly.  In particular,
:Conditional VOPs are changed to take a single Target continuation and a Not-P
flag indicating whether the sense of the test is negated.  Then an
unconditional Branch VOP will be emitted afterward if the other path isn't a
drop-through.  

So we linearize the code before IR2-conversion.  This shouldn't be a problem,
since there isn't much change in control flow after IR2 conversion (none until
loop optimization requires introduction of header blocks.)  It does make
cost-based branch prediction a bit ucky, though, since we don't have any cost
information in IR1.  Actually, I guess we do have pretty good cost information
after LTN even before IR2 conversion, since the most important thing to know is
which functions are open-coded.

|#

IR2 preserves the block structure of IR1, but replaces the nodes with a target
dependent virtual machine (VM) representation.  Different implementations may
use different VMs without making major changes in the back end.  The two main
components of IR2 are Temporary Names (TNs) and Virtual OPerations (VOPs).  TNs
represent the locations that hold values, and VOPs represent the operations
performed on the values.

A "primitive type" is a type meaningful at the VM level.  Examples are Fixnum,
String-Char, Short-Float.  During IR2 conversion we use the primitive type of
an expression to determine both where we can store the result of the expression
and which type-specific implementations of an operation can be applied to the
value.

The VM specific definitions provide functions that do stuff like find the
primitive type corresponding to a type and test for primitive type subtypep.
Usually primitive types will be disjoint except for T, which represents all
types.

The primitive type T is special-cased.  Not only does it overlap with all the
other types, but it implies a descriptor ("boxed" or "pointer") representation.
For efficiency reasons, we sometimes want to use
alternate representations for some objects such as numbers.  The majority of
operations cannot exploit alternate representations, and would only be
complicated if they had to be able to convert alternate representations into
descriptors.  A template can require an operand to be a descriptor by
constraining the operand to be of type T.

#|
How to deal with list/null(symbol)/cons in primitive-type structure?  Since
cons and symbol aren't used for type-specific template selection, it isn't
really all that critical.  Probably Primitive-Type should return the List
primitive type for all of Cons, List and Null (indicating when it is exact).
This would allow type-dispatch for simple sequence functions (such as length)
to be done using the standard template-selection mechanism.
|#


A TN can only represent a single value, so we bare the implementation of MVs at
this point.  When we know the number of multiple values being handled, we use
multiple TNs to hold them.  When the number of values is actually unknown, we
use a convention that is compatible with full function call.

Everything that is done is done by a VOP in IR2.  Calls to simple primitive
functions such as + and CAR are translated to VOP equivalents by a table-driven
mechanism.  This translation is specified by the particular VM definition; IR2
conversion makes no assumptions about which operations are primitive or what
operand types are worth special-casing.  The default calling mechanisms and
other miscellaneous builtin features are implemented using standard VOPs that
must implemented by each VM.

Type information can be forgotten after IR2 conversion, since all type-specific
operation selections have been made.

Type checking is explicitly done using CHECK-xxx VOPs that correspond to lots
of the IF-xxx VOPs.  They act like innocuous effectless/unaffected VOPs which
return the checked thing as a result.  This allows loop-invariant optimization
and common subexpression elimination to remove redundant checks.  All type
checking is done at the time the continuation is used. 

Note that we need only check asserted types, since if type inference works, the
derived types will also be satisfied.  We can check whichever is more
convenient, since both should be true.

Constants are turned into special Constant TNs, which are wired down in a SC
that is determined by their type.  The VM definition provides a function that
returns constant a TN to represent a Constant Leaf. 

Each component has a constant pool.  There is a register dedicated to holding
the constant pool for the current component.  The back end allocates
non-immediate constants in the constant pool when it discovers them during
translation from IR1.  If there is only one entry to a component, then program
constants may be directly allocated in that function's entry info, rather than
in a component global constant pool.

Since LTN only deals with values from the viewpoint of the receiver, we must be
prepared during the translation pass to do stuff to the continuation at the
time it is used.
 -- If a VOP yields more values than are desired, then we must create TNs to
    hold the discarded results.  An important special-case is continuations
    whose value is discarded.  These continuations won't be annotated at all.
    In the case of a Ref, we can simply skip evaluation of the reference when
    the continuation hasn't been annotated.  Although this will eliminate
    bogus references that for some reason weren't optimized away, the real
    purpose is to handle deferred references.
 -- If a VOP yields fewer values than desired, then we must default the extra
    values to NIL.
 -- If a continuation has its type-check flag set, then we must check the type
    of the value before moving it into the result location.  In general, this
    requires computing the result in a temporary, and having the type-check
    operation deliver it in the actual result location.
 -- If the template is Result-Descriptor-P, then we must generate a boxed
    temporary to compute the result in when the continuation TN isn't
    Descriptor-P.

We may also need to do stuff to the arguments when we generate code for a
template.  If an argument continuation isn't annotated, then it must be a
deferred reference.  We use the leaf's TN instead.  We may have to do any of
the above use-time actions also.  Alternatively, we could avoid hair by not
deferring references that must be type-checked or may need to be boxed.


Stack analysis:

We do a graph walk from the root of the graph, simulating the pushes and pops
as we go.  We find the partial ordering of the values globs for unknown values
continuations in each environment.  We don't have to scan the code looking for
unknown values continuations since LTN annotates each block with the
continuations that were popped and not pushed or pushed and not popped.  This
is all we need to do the inter-block analysis.

After we have found out what stuff is on the stack at each block boundary, we
scan through the blocks in DFO, looking for blocks with predecessors that have
junk on the stack.  For each such block, we introduce a new block containing
code to restore the stack pointer.  We save the top pointer in a TN before the
thing is pushed and use a Move VOP to restore the value when we want to pop it.

[Or perhaps each unknown-values continuation has a top-pointer associated with
it.  When we do the control transfer, we restore the top pointer from the 
saved pointer for the new topmost continuation.  If we represent unknown-values
with (start, end), rather than (start, count), then we already have the top
pointer anyway.  I guess it's actually kind of redundant to have both start and
end for each continuation, since end for one is start for the one on top of it.
We really only need end, since start can always be figured out.]

Note that there is only doubt about how much stuff is on the control stack,
since only it is used for unknown values.  Any special stacks such as number
stacks will always have a fixed allocation.


Cleanup generation:

When doing the actual translation pass in IR2 conversion, we check for changes
in the dynamic binding environment that require cleanup code to be generated.
We just check for changes in the Continuation-Cleanup between the translation
of two nodes.  If it changes from an inner dynamic context to an outer one that
is in the same environment, then we emit code to clean up the dynamic bindings
between the old and new continuation.

In addition to special variable bindings, Catch and Unwind-Protect are
considered to be dynamic bindings.  To clean up a Catch, we just restore the
active catch to the previous catch block.  To clean up an Unwind-Protect, we
evaluate the cleanup code after unlinking the unwind-protect block.

If the starting and ending continuations are not in the same environment, then
the control transfer is a non-local exit.  In this case just call Unwind with
the appropriate stack pointer, and let the code at the re-entry point worry
about fixing things up.

On entry to each environment which is the target of a non-local exit, we take a
complete snapshot of the dynamic state:
    the top pointers for all stacks
    current Catch and Unwind-Protect
    current special binding (binding stack pointer in shallow binding)

[The mess-up point for a cleanup is the place where we save information about
how to get back into a given dynamic state.  It might be reasonable to snapshot
at the mess-up point for the innermost cleanup enclosing the NLX target.  This
could be simpler than inferring the incremental changes from the cleanup
nesting.]

We insert code at the re-entry point which restores any saved registers and
then fixes up the dynamic environment using the snapshot as a basis.  We can't
simply revert to the snapshot, since there may have been changes made to the
dynamic state between the allocation point for the environment and the re-entry
point.  We can determine any changes made to the dynamic bindings (specbind,
catch, unwind-protect) by looking at the current Cleanup structure at the
re-entry point.

We can't mindlessly do all the cleanups in one shot, since there may be unwind
protects.  If there is an unwind-protect, then we must clean up things only far
enough to restore the dynamic environment for the cleanup form, call the
cleanup function, and then proceed.

We don't have to do much work to restore stack top pointers, since stack
analysis has figured out what is on each stack at each point in the program.
The only thing we have to deal with is stacks unused by this function, such as
the number stack in a function with no unboxed variables.


			REACHING DEFINITIONS

This phase is optional, but should be done whenever speed or space is more
important than compile speed.  We use global flow analysis to find the reaching
definitions for each TN.  This information is used here to eliminate
unnecessary TNs, and is also used later on by loop invariant optimization.

In some cases, IR2 conversion will unnecessarily copy the value of a TN into
another TN, since it may not be able to tell that the initial TN has the same
value at the time the second TN is referenced.  This can happen when IR1
optimize is unable to eliminate a trivial variable binding, or when the user
does a setq, or may also result from creation of expression evaluation
temporaries during IR2 conversion.  Whatever the cause, we would like to avoid
the unnecessary creation and assignment of these TNs.

What we do is replace TN references whose only reaching definition is a Move
VOP with a reference to the TN moved from.  This deletion of references can
cause the TN to be dead at the location of the Move VOP, causing conflict
analysis to force it into the bit-bucket SC.  The generator for Move will then
realize that it doesn't have to do anything.

Global preferences of the old TN should be moved to the new TN.

Local call can confuse our forward flow analysis, but it isn't a correctness
problem.  The uncertainly about where you return to only causes spurious
definitions to be found, possibly inhibiting optimizations.

Some degree of cleverness is probably useful to prevent the flow analysis from
being too expensive.  As for lifetime analysis, we only need to do flow
analysis on global packed TNs.  We can't do the real local TN assignment pass
before this, since we allocate TNs afterward.  Probably we do some kind of
pre-pass that marks the TNs that are local for our purposes.  We don't care if
block splitting eventually causes some of them to be considered global.


			LOOP INVARIANT OPTIMIZATION

This phase is optional, but should be done whenever speed is more important
than compile speed and space.

We scan the loops from the inside out, moving VOPs to before the head of the
loop when we can show that they compute the same value every time around the
loop.  Probably most loop invariant expressions will be due to code implicitly
emitted by the compiler, the largest contribution being error checking code.

We need to be more careful than lots of compilers, since we must guarantee that
we don't evaluate the invariant expressions when they would not have been
previously.  For example, it would be unacceptable to move an error check out
of the loop when the loop might run zero times.

The simplest solution is to only consider VOPs in blocks that dominate all the
loop exits.  This is almost worthless, since most loops have the exit test at
the head.  What we have to do is guard the invariant code with a replication of
the exit test.  A simple but useful version of this general transformation can
be implemented when the head is an exit.  We do something like this:
   LOOP
    <invariant-1>
    (if test (go EXIT))
    <invariant-2>
    <stuff>
    .
    .
    (go LOOP)
   EXIT

 ==>
    <invariant-1>
    (if test (go EXIT))
    <invariant-2>
    (go SKIP)
   LOOP
    (if test (go EXIT))
   SKIP
    <stuff>
    .
    .
    (go LOOP)
   EXIT

What we do is remove invariants in the head block, then copy the entire head
block and use it as a guard for the other invariants, which must dominate every
other exit, but don't dominate the head.  It makes absolutely no difference what
the code in the head block is or what the exit test is.  Of course, copying
blocks can be a major space waste.  We might want to inhibit copying of large
blocks unless optimize space is 0.

We determine what expressions are invariants by doing global flow analysis to
find the reaching definitions for each TN.  A VOP is a potential invariant when
every arg is either constant, has all its definitions outside of the loop, or
has as its only definition an invariant VOP inside the loop.  Such a VOP can be
moved out of the loop when we know that it isn't affected by any side-effects
in the loop.  This will be trivially true if the VOP isn't affected by any
side-effects.  A somewhat more general solution would be to take the union of
the side-effects of all the VOPs in the loop, and only do the move when the VOP
is not affected by any of them.


			COPY GENERATION

This optimization is optional, but is desirable if either space or speed is
more important than compilation speed.

One problem with TNBIND is that each TN is packed in the same place for all
time, whereas the best place for a TN may vary from one place in the code to
another.  To reduce this problem, we introduce new TNs which are copies of the
original TN.  Since the lifetime and uses of the copy are more restricted than
those of the original, the copy may be packed in a way which better reflects
the local demands of the code.  Heuristics are used to guess good places to
introduce copies.

We do this phase after reaching definitions and loop optimization because it
introduces lots of global TNs which would make life unnecessarily hard to
compute the reaching definitions.  Another reason is that some of the TNs
introduced to hold loop invariants may be candidates for copies.  This phase
also works at cross purposes with reaching definitions optimization, since it
introduces unnecessary moves.

Pack will replace a copy with the original when the cost of making and using a
copy exceeds the cost of referencing the original TN.  For example, if
the original TN is already in a register, then there is no point in making a
register copy.  All the copy TNs should be preferenced together, including
indirect ones; this discourages creation of multiple copies in different places
when one copy would do.  We introduce a spurious reference to the original TN
at the end of the copy interval so that the original TN doesn't disappear when
we change references over to the copy TN.

Copies may either be clean or dirty.  A clean copy is never written, whereas a
dirty copy may be.  There may be any number of clean copies for a TN, or there
may be one dirty copy.  References to a TN may only be replaced with references
to a copy within code where the copy is valid.  A dirty copy must be written
back before any possible read of the original.

A dirty copy isn't really a copy at all.  A dirty copy moves the real value of
the TN somewhere else within a chunk of code, and then moves it back when we
reach the end.  This can be tricky in the presence of non-local exits, since
the restoring code might be skipped.  We can avoid this problem if we never
make dirty copies when there are non-local exit targets above us.  Of course,
it is also a bad idea to make dirty copies of closure variables, since they may
be read or written at times we can't anticipate.

Our initial copy heuristic is to create copies for the duration of a loop,
replacing references of promising looking TNs.  A TN is promising if it is a
constant or a global TN with references outside of the loop.  We should be able
to do all this in one inside-out walk of the loop nesting, propagating up
information about what was discovered in inner loops.

Since closure variables are copied into TNs on component entry, they will also
be targets for copy generation.

Since constants are represented by TNs, we can generate copies of them.  If a
VOP can easily fold a constant into its code sequence, then this will be
reflected in a low cost for that operand when it is in an appropriate constant
SC, preventing caching of constants when it isn't worth the effort.  Even if a
constant is cached, a given code generator can decide to use an immediate
representation instead.


		LOCAL COMMON SUBEXPRESSION ELIMINATION

This phase is optional, but should be done if speed or space is more important
than compilation speed.

This phase scans each block and combines VOPs that compute the same expression.
We scan forward, adding VOPs to some kind of hashtable if they have an attribute
indicating that they are a candidate for a subexpression, and clearing out VOPs
in the table that are killed by side-effects that we encounter along the way.
Probably we want a separate table for unaffected VOPs so that we have fewer to
scan when we notice a side-effect.  We can deal with invalidating VOPs whose
arguments are clobbered by looking at the Tn-refs for each TN that we see being
used as a result.  Probably many expressions should be killed by function call,
since they won't be worth doing two memory references to save.

We do this after all the other optimizations so that we can combine duplicated
crud that may have been stuck at loop heads by previous optimizations.

We could do global common subexpression elimination, but it seems like a lot
more trouble that it's worth.  Global common subexpression is computationally
difficult, both because it uses flow analysis and because it involves creating
and repeatedly scanning large sets of expressions.  Local common subexpression
probably gets us most of the win; common subexpression elimination is more of a
space optimization that a time optimization anyway, since loop invariant
elimination moves the common subexpressions out of loops.


			PRELOAD GENERATION

[No point in cleaning up this section until the hardware isn't hypothetical.]

On machines which pipeline memory fetches, it may be desirable to move memory
reading operations backward in the code when it is possible to do so.  This
will allow the memory fetch to proceed in parallel with other computation.  We
use a scheme reminiscent of copy generation to do this.  What we do is move the
read back an appropriate distance, and then specify a Preload TN as the result
TN.  A Preload TN has a single writer and a single reader; if the original
result TN is more complex, then an explicit move is placed between the Preload
TN and the original result TN.  Pack is allowed to move the writer VOP as far
forward as the reader VOP.  The understanding is that pack will move the writer
VOP as far forward as necessary for the Preload TN to be packed in a good
location.

The preloading could be done simply on a go/no-go basis, with the VOP being
either left in place or moved immediately before the reader.  The preload TN
then gets assigned as its cost the amount saved by doing the preload.

The actual generation of the preload must take into consideration both the
legality of the backward move and the desirability of a given move distance.
As long as the writer VOP has no effects, the legality of the move is easily
determined by checking for VOPs which have effects on the writer VOP between
the original location and the possible new location.  Determining the distance
to move the writer is somewhat problematical since on one hand, we want to move
it as far as possible up to some fixed number of cycles, but on the other hand
we don't want to make the lifetime of the Preload TN too long, since that would
make it harder to pack.  This is further complicated by our having to guess how
many cycles a VOP will take.

If a go/no-go pack strategy is used, then it might be desirable to attempt to
look at the lifetime data to anticipate the largest move distance that Pack
would be able to deal with.  This is probably a bad idea though.  If dealing
with shorter than optimal preload times is important, then it would be better
to make Pack attempt progressive shortening of Preload TN lifetimes.
[### Registers shouldn't be such short supply for the distance of moves we
contemplate.]

This takes care of preload generation for operations known to do memory reads.
Since we haven't run Pack yet, it is possible that operand loading may cause
additional implicit reads.  We can probably make a pretty good guess of which
TNs are likely not to be packed into registers if we use the lifetime
information.  If at some point the conflict set is larger than the number of
registers, then someone is going to lose.  The losers will be the ones long
lifetimes and low costs.  We can generate explicit preloads for uses of likely
loser TNs.  If the TN gets packed in a register, then the preload will all go
away.  If we don't guess an eventual loser, then we just lose the benefit of
the preload.  But... preloading memory TNs is probably not a very important
optimization due to all the other optimizations that put important TNs in
registers.  It is probably only worth implementing preloading for explicit
reads, in which case we probably don't need the lifetime information, and this
phase could be moved earlier in the compiler if there was any reason to do so.
[We could get pack to help more here, since Pack now knows about load TNs and
loading.]

To have maximally tense array hacking, we should also consider how to use
preloading in interesting ways for arrays, but this would interact with things
like loop induction variable determination and other loop optimizations since
we can do a lot better if we know what the next index is going to be ahead of
time.  This would allow us to fetch the n+1'th element while processing the
n'th.  Note that it is harmless to fetch ahead an element which ends up being
unused due to loop exit.  Tense loop/array optimization is something which I
haven't considered much in this design.  In any case, these optimizations
should happen somewhere before here.


			LIFETIME ANALYSIS

This phase is a preliminary to Pack.  It involves three passes:
 -- A pre-pass that computes the DEF and USE sets for live TN analysis, while
    also assigning local TN numbers, splitting blocks if necessary.  ### But
not really...
 -- A flow analysis pass that does backward flow analysis on the
    component to find the live TNs at each block boundary.
 -- A post-pass that finds the conflict set for each TN.


Flow analysis:

We must take the effects of local call into consideration, since local call
introduces paths into the flow graph which can never actually be taken.  The
problem is that it appears as though the function can return to any of the call
sites, whereas it actually only returns to the caller.  This causes serious
lifetime bogosity, since TNs live at any call site appear to be live at all
call sites.  We get around this problem by special-casing the blocks that make
local calls.

In a call to a different environment, we treat the block returned to as the
only successor of the call block.  Similarly, at the return block for the
function, we pretend there are no successors.  The calling VOPs are defined so
that we get the right lifetimes for the arguments and passing locations when we
do this.

In a local call to the same environment (as could only be enabled by
environment optimization), we treat the call block the same, but use the real
successors of the return block.  This makes all the TNs live at any call be
live in the called function, so that we can allocate TNs in the called function
without clobbering the callers.

We don't special-case tail-recursive calls at all.  A tail-recursive local call
isn't even represented by a call VOP.

[Do we need or want to use the SSet representation for this flow analysis?  The
always-live information seems to kind of duplicate this information.

It seems we could use the global-conflicts structures during compute the
inter-block lifetime information.  The pre-pass creates all the
global-conflicts for blocks that global TNs are referenced in.  The flow
analysis pass just adds always-live global-conflicts for the other blocks the
TNs are live in.  In addition to possibly being more efficient than SSets, this
would directly result in the desired global-conflicts information, rather that
having to create it from another representation.

The DFO sorted per-TN global-conflicts thread suggests some kind of algorithm
based on the manipulation of the sets of blocks each TN is live in (which is
what we really want), rather than the set of TNs live in each block.

If we sorted the per-TN global-conflicts in reverse DFO (which is just as good
for determining conflicts between TNs), then it seems we could scan though the
conflicts simultaneously with our flow-analysis scan through the blocks.

The flow analysis step is the following:
    If a TN is always-live or read-before-written in a successor block, then we
    make it always-live in the current block unless there are already
    global-conflicts recorded for that TN in this block.

The iteration terminates when we don't add any new global-conflicts during a
pass.

We may also want to promote TNs only read within a block to always-live when
the TN is live in a successor.  This should be easy enough as long as the
global-conflicts structure contains this kind of info.

The critical operation here is determining whether a given global TN has global
conflicts in a given block.  Note that since we scan the blocks in DFO, and the
global-conflicts are sorted in DFO, if we give each global TN a pointer to the
global-conflicts for the last block we checked the TN was in, then we can
guarantee that the global-conflicts we are looking for are always at or after
that pointer.  If we need to insert a new structure, then the pointer will help
us rapidly find the place to do the insertion.]


Conflict detection:

This phase makes use of the results of lifetime analysis to find the set of TNs
that have lifetimes overlapping with those of each TN.  We also annotate call
VOPs with information about the live TNs so that code generation knows which
registers need to be saved.

The basic action is a backward scan of each block, looking at each TN-Ref and
maintaining a set of the currently live TNs.  When we see a read, we check if
the TN is in the live set.  If not, we:
 -- Add the TN to the conflict set for every currently live TN,
 -- Union the set of currently live TNs with the conflict set for the TN, and
 -- Add the TN to the set of live TNs.

When we see a write for a live TN, we just remove it from the live set.  If we
see a write to a dead TN, then we update the conflicts sets as for a read, but
don't add the TN to the live set.  We have to do this so that the bogus write
doesn't clobber anything.

[We don't consider always-live TNs at all in this process, since the conflict
of always-live TNs with other TNs in the block is implicit in the
global-conflicts structures.

Before we do the scan on a block, we go through the global-conflicts structures
of TNs that change liveness in the block, assigning the recorded LTN number to
the TN's LTN number for the duration of processing of that block.]
 

Efficiently computing and representing this information calls for some
cleverness.  It would be prohibitively expensive to represent the full conflict
set for every TN with sparse sets, as is done at the block-level.  Although it
wouldn't cause non-linear behavior, it would require a complex linked structure
containing tens of elements to be created for every TN.  Fortunately we can
improve on this if we take into account the fact that most TNs are "local" TNs:
TNs which have all their uses in one block.

First, many global TNs will be either live or dead for the entire duration of a
given block.  We can represent the conflict between global TNs live throughout
the block and TNs local to the block by storing the set of always-live global
TNs in the block.  This reduces the number of global TNs that must be
represented in the conflicts for local TNs.

Second, we can represent conflicts within a block using bit-vectors.  Each TN
that changes liveness within a block is assigned a local TN number.  Local
conflicts are represented using a fixed-size bit-vector of 64 elements or so
which has a 1 for the local TN number of every TN live at that time.  The block
has a simple-vector which maps from local TN numbers to TNs.  Fixed-size
vectors reduce the hassle of doing allocations and allow operations to be
open-coded in a maximally tense fashion.

We can represent the conflicts for a local TN by a single bit-vector indexed by
the local TN numbers for that block, but in the global TN case, we need to be
able to represent conflicts with arbitrary TNs.  We could use a list-like
sparse set representation, but then we would have to either special-case global
TNs by using the sparse representation within the block, or convert the local
conflicts bit-vector to the sparse representation at the block end.  Instead,
we give each global TN a list of the local conflicts bit-vectors for each block
that the TN is live in.  If the TN is always-live in a block, then we record
that fact instead.  This gives us a major reduction in the amount of work we
have to do in lifetime analysis at the cost of some increase in the time to
iterate over the set during Pack.

Since we build the lists of local conflict vectors a block at a time, the
blocks in the lists for each TN will be sorted by the block number.  The
structure also contains the local TN number for the TN in that block.  These
features allow pack to efficiently determine whether two arbitrary TNs
conflict.  You just scan the lists in order, skipping blocks that are in only
one list by using the block numbers.  When we find a block that both TNs are
live in, we just check the local TN number of one TN in the local conflicts
vector of the other.

In order to do these optimizations, we must do a pre-pass that finds the
always-live TNs and breaks blocks up into small enough pieces so that we don't
run out of local TN numbers.  If we can make a block arbitrarily small, then we
can guarantee that an arbitrarily small number of TNs change liveness within
the block.  We must be prepared to make the arguments to unbounded arg count
VOPs (such as function call) always-live even when they really aren't.  This is
enabled by a panic mode in the block splitter: if we discover that the block
only contains one VOP and there are still too many TNs that aren't always-live,
then we promote the arguments (which we'd better be able to do...).

This is done during the pre-scan in lifetime analysis.  We can do this because
all TNs that change liveness within a block can be found by examining that
block: the flow analysis only adds always-live TNs.


When we are doing the conflict detection pass, we set the LTN number of global
TNs.  We can easily detect global TNs that have not been locally mapped because
this slot is initially null for global TNs and we null it out after processing
each block.  We assign all Always-Live TNs to the same local number so that we
don't need to treat references to them specially when making the scan.

In pack, how do we handle people delivering unused results?  The TN will won't
be live, and thus won't conflict with anything, yet it isn't O.K to clobber
random locations.  I guess we special-case setting of dead TNs during conflict
analysis, marking them as already being packed in the "bit-bucket" SC.  Code
generators must be able to deal with these missing result locations.  This
shouldn't be a problem, since if they absolutely need a result location, they
can specify a SC-restriction on the result that excludes bit-bucket, forcing
allocation of a result load TN.

We also annotate call VOPs that do register saving with the TNs that are live
during the call, and thus would need to be saved if they are packed in
registers.

[Glag...]
We should also adjust the costs for TNs that need to be saved so
that TNs costing more to save and restore than to reference get packed on the
stack.  We would also like more often saved TNs to get higher costs so that
they are packed in more savable locations.  One way to represent this
information would be to have a count of the number of times each TN is saved
(adjusted for loop depth).  We then have a bogus TN-Ref for each saved TN that
encompasses all of the saving related costs.  We special-case copy TNs so that
we can realize that it is cheaper to restore the copy from the original
location than to save the copy and then restore it.


				PACKING
#|
Load TN pack:

A location is out for load TN packing if: 

The location has TN live in it after the VOP for a result, or before the VOP
for an argument, or

The location is used earlier in the TN-ref list (after) the saved results ref
or later in the TN-Ref list (before) the loaded argument's ref.

To pack load TNs, we advance the live-tns to the interesting VOP, then
repeatedly scan the vop-refs to find vop-local conflicts for each needed load
TN.  We insert move VOPs and change over the TN-Ref-TNs as we go so the TN-Refs
will reflect conflicts with already packed load-TNs.

If we fail to pack a load-TN in the desired SC, then we scan the Live-TNs for
the SB, looking for a TN that can be packed in an unbounded SB.  This TN must
then be repacked in the unbounded SB.  It is important the load-TNs are never
packed in unbounded SBs, since that would invalidate the conflicts info,
preventing us from repacking TNs in unbounded SBs.  We can't repack in a finite
SB, since there might have been load TNs packed in that SB which aren't
represented in the original conflict structures.

Is it permissible to "restrict" an operand to an unbounded SC?  Not impossible
to satisfy as long as a finite SC is also allowed.  But in practice, no
restriction would probably be as good.


Register saving needs some thought.  What is the representation of the save
info?  How about a list of the TNs live after the args are read and before the
results are written...  [Same as all TNs live after call that aren't results.]

Have a pack post-pass that assigns a stack save-tn to each TN that needs to be
saved?  Just select a location on the stack for the original TN, then pack the
save TN there instead.  We need to somehow associate the save TN with the saved
TN.  I guess we use the info from define-save-scs to tell what SC to allocate
the save TN in.  [Note that load-tns never need to be saved, so this could be
jammed into load TN pack with little difficulty.]

Actually, we could emit explicit moves (?).  That would have the advantage of
removing complexity from the call VOPs.  But the move VOPs had better not need
any registers.

Or...  Instead of annotating the call VOP with the save TNs, we could emit the
save/restore code after lifetime analysis (in the post-pass) and before pack.

An advantage of pushing the saving into the call VOPs is that it would be
more feasible to use load/store multiple operations.  Note also that the idea
of using LM/STM indicates against allocating explicit save TNs, since it would
be difficult for pack to allocate those TNs where the instruction wants them.

But at some point, we need to determine what the size of the stack frame is,
including the save area.

On the other hand, using TNs to represent the save locations probably
simplifies life for the code generator, and makes dumping debug information
easier.  We have to know where each register TN is saved.  A TN should always
be saved in the same place so that it is feasible to dump this information.
This imposes some constraints on the saving discipline.


How useful is FSC information?  Might want to know if FSC is the same before
deciding to attempt targeting.

The Costs, SC and Offset slots in the TN would probably all be flushed once
loop-local packing is done, since this info would be loop-local.

Add support for multiple-location TNs.

Assume all locations can be used when an sc is based on an unbounded sb?


Associate a bit-vector with each loop that represents the set of blocks in the
loop, including all nested loops and all code for functions called in the same
environment.  This allows a quick test for whether a particular block is within
the loop we are currently packing, so we know whether conflicts in that block
are relevant.  To check for conflicts on a particular location in a given
extent, we iterate over the global conflicts for the TN, using the bit-vector
to tell whether the block is in the extent.  

We compute this set by doing a walk on the call/loop graph, adding the blocks
directly in our loop into the set, then recursing on inferiors and OR'ing their
sets into ours.


Don't have to pack loops bottom-up breadth-first, since parallel loops are
disjoint.  We only have to pack all inferior loops before packing each loop: a
post-order walk of the loop nesting.  It also makes no difference which order
we pack functions in as long as they have disjoint environments (modulo
boundary conditions such as passing locations).  

If we have functions that share their environment, then the code in the
function is nested within several loops.  This causes problems, since packing
loops surrounding one call would add conflicts in the called function which
would interfere with packing more inner loops surrounding other calls.  But
this probably wouldn't cause much lossage, since loops within the called
function would still get packed first.  The only significant effect would be
that one caller would be favored in allowing stuff to stay in registers over
the call.  And this is only an issue if we do implement such calls...  Even
dealing with this issue sub-optimally, it would surely be better than doing all
that saving and restoring.


Constant caching: the current scheme of doing lifetime analysis for all
constant TNs is losing pretty badly.  Probably we want some kind of constant TN
that we don't attempt to cache, and then we selectively emit cached constant
TNs depending on policy and perhaps on a guess for the likely desirability of
caching based on the number of references, loop depth, kind of constant (cost
of loading), etc.  For example, caching an immediate constant is probably
rarely (only worthwhile when we can squeeze out any move from the cache
register, since the immediate load would be as fast as the register-to-register
move.)  Note that we need to discover which TNs are immediate constants anyway
so that we can give them special costs which allow packing in the constant SCs.

Perhaps have separately specified constant SCs for each primitive-type.  These
SCs would be in addition to the packed SCs when doing representation
selection/pack.  If a constant SC comes up best, then we don't attempt to
cache.

Note that specifying a generator restriction that restricts an argument to a
constant SC causes costs to be computed that allow the operand TN only into the
constant SCs.  This would prevent caching.  If you just want to emit an
immediate variant of an instruction in the constant case, then the thing to do
is to use only one generator that explicitly allows the immediate SC, and then
do a sc-case off of the operand within the generator body.

Of course, it is silly inhibiting caching of a constant just because someone
isn't prepared to use the cache.  Such users can still use the constant
directly even when it is cached.  I guess the way to reflect this would be to
have the costs for the ref be all zeros, saying we don't care what the SC is,
but when selecting the generator we still don't consider these other SCs to be
possible.  Specifying 0 move costs into the constant SCs would have the first
effect, but would also allow the immediate generator to be used for any SC.
[### But it probably isn't worth trying to fix restrictions to constant SCs,
since it doesn't seem to be useful.  We only need the power when an operand
being in a particular constant SC modifies the costs for other operands.  This
doesn't seem very likely.

Probably having multiple generators will be even more rare than previously
thought.  It will only happen with VOPs where there is an interesting
representation selection choice.  And even then?  It seems that in most cases
where we want to indicate a desire not to do a representation conversion, this
will be pushed into the load costs, with only one generator that uses the
canonical "good" representation.  For example, in the char-code/schar example,
schar would restrict its result to unboxed-string-char and char-code would
similarly restrict its argument.

So do we ever need more than one generator?  Multiple generators seem pretty
worthless in the current scheme.  The only information provided by the
generator selection is a somewhat better estimate of the aggregate VOP cost,
which isn't particularly useful to the compiler. 

I guess there may be a few places where it would come in handy.  For example,
there might be one sequence for =/single-float that is a win when both are in
normal registers, and another that is a win when both are in float registers.
But this really seems to be stretching...

In other words, representation selection is a choice of the best
representations for each TN, with the best representations for a given VOP
taken as constant.  This means that any interesting choice about the "best code
sequence" is made at IR2 conversion time.  There just don't seem to be any
interesting choices that can't be made then.  If we allowed conditional packing
based on SCs, then we could do things like squeeze out the case of operations
with a massively better sequence when an argument is constant, but this isn't
supported by the current multi-generator code, and could probably be supported
better by having support for automatic special-casing of constants in IR2
conversion.

So there is no point in having generators, which means there is no point in
having select functions, which means that there is no point in primitive types
having initial FSCs.  Lots of code in Define-VOP can be ripped out.  More
negative progress...

But we can also leave the code in for now, and just never write VOPs with
more than one generator.
]


How do we represent legal SCs for restricted temps?  It is pretty clearly
meaningless to allow temp SCs to be considered for choosing generators
(representation selection), and thus to be mentioned as operand restrictions.
This is because we choose generators before pack, and before pack there isn't
anything known about temp usage that isn't also known statically by the VOP
definer.  Probably what we should do is allow the SC restriction to be
explicitly specified when defining the temp.  An alternative would be to
require use of bogus SCs.


It is unnecessarily inefficient/redundant to place the SC restriction in the
TN-ref structure, since for a given VOP, we can determine the SC restrictions
from the generator number, which we already know.  The VOP-Info could have a
vector indexed by generator numbers, holding the restrictions for each argument
and result.

It might be useful to make the write-p slot encode more information.  How about
making it a "Kind" field, having values such as :Argument, :Result, :Temp-Read
and :Temp-Write?

Would it be a good idea to explicitly emit Move vops to represent operand
loading and saving?  It would make generator functions simpler, and would allow
all knowledge of moving to be moved into the Move VOP (maybe... targeting?).
It would also allow the Load-TN slot in the TN-Ref structure to be eliminated,
since the TN-Refs for the actual VOP would directly reference the load TN.
But, this would also disallow the possibility of a VOP getting control over the
loading process.  This seemed to be necessary for loading operations that
require additional temporaries.  But on the other hand, considering the loading
operations to be VOPs might make this fall out more naturally.

We also don't need the TN-Ref-Costs: instead, the select-function can add the
costs into the costs for the referenced TNs.  When the costs for a reference
are constant, we could also represent the costs as vectors in the VOP-Info.

If we pack from inner loops to outer, one loop at a time, then it is
unimportant (undesirable?) to consider loop depth in costs.  This simplifies
cost determination stuff.  This isn't a pessimization: in our new
interpretation, the purpose of costs is determining the best representation for
a value within a given pack extent (loop).  Weighting by loop factors
represented the desirability of favoring the inner loop if its FSC differed
from that in the outer loop.  In our new scheme, each loop can have its own
FSC.  

Since at any given time, we only pack TNs at the same loop level, all TNs are
equally "important".  Even supposing we wanted to use some cost-based
"importance" to determine packing order, the loop depth would play no role.

I guess the old scheme did provide a way to trade off the enhanced importance
of an inner loop with important things outside the loop, which the new
algorithm is fundamentally unable to do.  But it is unlikely that this loss
offsets all the gains of more localized packing.

Of course, the TN-Ref-Preference slot goes away, and probably isn't replaced by
anything comparable.  Or should it be?  TN-Refs might be convenient structures
to build the target graph out of.  If we allocated space in every TN-Ref, then
there would certainly be enough to represent arbitrary target graphs.  Would it
be enough to allocate a single Target slot?  If there is a target path though a
given VOP, then the Target of the write ref would be the read, and vice-versa.
To find all the TNs that target us, we look at the TN for the target of all our
write refs.

It might be useful to separately chain together the read refs and the write
refs for a TN.  This would allow easy determination of things such as whether a
TN has only a single definition or has no reads.  It would also allow easier
traversal of the target graph.


When computing preference sets, take usage of TN into consideration.  TNs that
are only used once (i.e. have a single write and read) are much easier to fold
together. 

If we only joined single-use, non-conflicting TNs into preference sets, this
would amount to computing a target path.  A lot of the preferences that it is
important to honor are of this relatively simple nature (including most local
preferences).  Do we want to consider more explicitly doing targeting?  This
would be a phase that runs after lifetime analysis, but before back.  It would
mash together TNs that we would like to be on a particular target path.

An icky issue here is targeting values to wired (or even restricted) locations.
Basically, we can't do this in a pre-pass to pack, since extending such TN
lifetimes could cause packing to fail.  Pack must somehow do something sensible
with preferences to wired TNs if full (and miscop) call is going to win.

How broken is the idea of preferences as a static global desire?  Currently the
only place that we emit preferences as for the operands to a move-like
operation.  "Local preferences" and "targeting" represent a local, code-based
preference.  Given that we could handle local preferences well, it seems that
global preferences could become unnecessary.  [### Beware of targeting on TNs
live across function calls, as this may create rather than save work (?)]


Basing preference analysis more closely on examination of the code also makes
more sense in packing algorithms where allocation is done more locally during a
sort of code walk.  But having some sort of global preference mechanism is
probably useful for encouraging compatible packing between the separately
packed units.  Perhaps we want to incrementally build a data structure that
represents affinities between TNs and *locations* during the packing process?

However we represent this, we want it to be able to represent the implicit
moves that may have to be done when we enter or exit a pack region.  But if we
always pack from inner loops out, then we will always be able to express our
desire as a desire to pack the TN in a certain location.  We also need to
express targeting to wired TNs, but once again this is a preference to a
location, rather than to a TN.  But we also need to communicate across, since
we would like parallel inner loops to choose the same location so that the
enclosing loop can agree with both.


When packing in a loop, consider all conflicts for the TN in blocks in enclosed
loops, as well as conflicts for blocks directly in that loop.  This allows
packing of inner loops to totally ignore TNs that aren't referenced within the
current loop or its inferiors.  In other words, when we pack a TN, we pack it
in that loop and in all inferiors of that loop, instead of just in that loop.


Represent per-scarce-location conflicts as vectors indexed by block number of
per-block conflict info.  To test whether a TN conflicts on a location, we
would then have to iterate over the TNs global-conflicts, using the block
number and LTN number to check for a conflict in that block.  But since most
TNs are local, this test actually isn't much more expensive than indexing into
a bit-vector by GTN numbers.

The big win of this scheme is that it is much cheaper to add conflicts into the
conflict set for a location, since we never need to actually compute the
conflict set in a list-like representation (which requires iterating over the
LTN conflicts vectors and unioning in the always-live TNs).  Instead, we just
iterate over the global-conflicts for the TN, using BIT-IOR to combine the
conflict set with the bit-vector for that block in that location, or marking
that block/location combination as being always-live if the conflict is
always-live.

Generating the conflict set is inherently more costly, since although we
believe the conflict set size to be roughly constant, it can easily contain
tens of elements.  We would have to generate these moderately large lists for
all TNs, including local TNs.  In contrast, the proposed scheme does work
proportional to the number of blocks the TN is live in, which is small on
average (1 for local TNs).  This win exists independently from the win of not
having to iterate over LTN conflict vectors.

Since we don't need to represent conflict set lists, we may also be able to
shave a slot or so off of the packed-TN structure.

[### Note that since we never do bitwise iteration over the LTN conflict
vectors, part of the motivation for keeping these a small fixed size has been
removed.  But it would still be useful to keep the size fixed so that we can
easily recycle the bit-vectors, and so that we could potentially have maximally
tense special primitives for doing clear and bit-ior on these vectors.]

This scheme is somewhat more space-intensive than having a per-location
bit-vector.  Each vector entry would be something like 150 bits rather than one
bit, but this is mitigated by the number of blocks being 5-10x smaller than the
number of TNs.  This seems like an acceptable overhead, a small fraction of the
total IR2 representation.

The space overhead could also be reduced by using something equivalent to a
two-dimensional bit array, indexed first by LTN numbers, and then block numbers
(instead of using a simple-vector of separate bit-vectors.)  This would
eliminate space wastage due to bit-vector overheads, which might be 50% or
more, and would also make efficient zeroing of the vectors more
straightforward.  We would then want efficient operations for OR'ing LTN
conflict vectors with rows in the array.

This representation also opens a whole new range of allocation algorithms: ones
that store allocate TNs in different locations within different portions of the
program.  This is because we can now represent a location being used to hold a
certain TN within an arbitrary subset of the blocks the TN is referenced in.


Pack goals:

Pack should:

Subject to resource constraints:
 -- Minimize use costs
     -- "Register allocation"
         Allocate as many values as possible in scarce "good" locations,
         attempting to minimize the aggregate use cost for the entire program.
     -- "Representation selection"
         When multiple representations for a value are possible, choose the one
         that has the lowest cost according to the context-sensitive use
         information.
     -- "Save optimization"
         Don't allocate values in registers when the save/restore costs exceed
         the expected gain for keeping the value in a register.  (Similar to
         "opening costs" in RAOC.)  [Really just a case of representation
         selection.]

 -- Minimize preference costs
    Eliminate as many moves as possible.


"Register allocation" is basically an attempt to eliminate moves between
registers and memory.  "Save optimization" counterbalances "register
allocation" to prevent it from becoming a pessimization, since saves can
introduce register/memory moves.

Preference optimization reduces the number of moves within an SC.  Doing a good
job of honoring preferences is important to the success of the compiler, since
we have assumed in many places that moves will usually be optimized away.

The "representation selection" problem is similar in formulation to "register
allocation", but it seems that we can expect to handle it somewhat better
because of its more local character.  It isn't so much a zero-sum game; we are
choosing between two good possibilities, rather than choosing which losses are
more acceptable.  A lot of the generality in our cost model is more aimed
toward representation selection than toward register optimization; we don't
necessarily need to make all the subtleties of context-sensitive cost
determination apparent to the scarcity-oriented aspects of pack (such as
priority determination).  We want to have the "correct" FSC before we go to
pack, but it isn't crucial that we keep the rank up to date during the packing
process.

The scarcity-oriented aspect of "register allocation" is handled by a greedy
algorithm in pack.  We try to pack the "most important" TNs first, under the
theory that earlier packing is more likely to succeed due to fewer constraints.

The drawback of greedy algorithms is their inability to look ahead.  Packing a
TN may mess up later "register allocation" by precluding packing of TNs that
are individually "less important", but more important in aggregate.  Packing a
TN may also prevent preferences from being honored.

The "voting" algorithms described in RAOC attempt to avoid the "bad" and favor
the "good" by a local examination of the preference/conflict graph for the TN
being packed.

The basic problem is that at the time we go to pack a TN, we only know stuff
about TNs that have already been packed; we don't know how the remaining TNs
will be packed, so we don't know how our packing choice will affect the other
TNs.

In the case of "register allocation", we don't know which location (if any) we
can pack that will result in less ultimate cost than in local savings.

In the case of "preference optimization", we don't know which preferences we
will be precluding by adding additional constraints, since we don't know where
unpacked TNs are going to want to be preferenced to.

Things seem to be especially ucky in the case of preferences; pack doesn't deal
very well with n'th order preference chains.  It might be interesting to
consider if we could determine good preference equivalence classes in the
absence of scarcity constraints.  This information can be used directly by
unbounded pack, and also might be put to good use by finite pack in determining
what preference relationships we would like to exist.  The algorithm suggested
for unbounded pack is of the right character, but it doesn't consider
preference/conflict issues at all.

Another thing that we might do in a pre-pass to pack would be to try to get
some sort of accurate picture of the resource demands of packing each TN; the
number of TNs that it conflicts with and their aggregate cost.  This could be
used to get a pre-pack estimate of the TNs that should probably not be packed
in registers so as to avoid precluding other packings.

If we did this kind of stuff, the actual pack phase might look somewhat
different.  It might start to look more like the "graph coloring" algorithms.

What we would do is find the sets of TNs that would like to be packed in the
same location for preference reasons, disregarding scarcity constraints.  We
would then attempt to pack the entire set of TNs in a single (scarce) location.
If this isn't possible due to conflicts, then we either pack some subset of the
TNs, allowing the remaining TNs to be packed in a single unbounded location,
or we unpack some TNs already packed in a scarce location (moving them to their
preference-set's corresponding unbounded location).  

We could also do some combination of unpacking and subset packing.  What we
might do is initially pack the subset of the TNs that we think have a
modest resource demand, and then attempt to pack TNs with more conflicts later
on, after we have packed everything else.

A central idea to this is that we assign a stack location to each preference
set as though all of the TNs in the set are actually allocated on the stack.
Then we can choose to pack any subset of the TNs in the preference set into a
register, without worrying about interactions with stack packing.  In
particular, load-TN packing doesn't really have to do anything at all to eject
a TN from a register.  All it has to do is change the TN to use the
stack location for its preference set.

I guess this amounts to regarding registers as a cache for the stack.  Perhaps
this is a useful way to think about it?

### Is there some good order to consider TNs in that will cleverly exploit the
locality in the code?  For example, sorting TNs by the "time" they first become
live.  We might use a particular register to cache a particular preference set
over a certain time interval.


We would never combine TNs into a preference set if they have different FSCs
(or perhaps different FSC SBs), since we really don't want to honor those
preferences.  And of course, there are no conflicting TNs in a preference set.

One way that we might consider preference/conflict issues in an unbounded
scheme would be to find sets of TNs that "want" to have their preferences
honored, with only partial consideration of conflict issues.  We would consider
"attractive" preferences: preferences between TNs that don't conflict.  We
could then label TNs with the component they appear in in the transitive
closure of this graph.

When determining "attractive" preferences, we might want to try a little harder
to consider preferences that seem "good", for example if a TN is preferenced to
TNs that mutually conflict, then we would choose which preference we will
attempt to honor, and (at least initially) disregard the others.  The crucial
point is that we want to throw out enough preferences so that there are a
fairly large number of components in the closure of the graph, with each of
these components representing TNs that are are "fairly strongly" connected.
How many preferences we need to throw out depends on how strongly connected the
preference graph is once we throw out preferences with direct conflicts.  

If we consider local preferences at this point (which we would have to), then
there are going to be lots of preferences.  In this scheme, the local/global
preference distinction becomes pretty meaningless.

If conflicting TNs appear in a component, then we split the component up so
that there are no conflicts.  This would basically amount to doing packing
separately on each collection of TNs that we believe have interesting
preferencing relationships.

So the algorithm might looks like this:
 -- Find the "ideal preference set".  This is a collection of TNs that we think
    would all like to be packed in the same location, but might contain some
    second-order conflicts.
 -- Split the ideal preference set as necessary in order to eliminate internal
    conflicts.  One algorithm might be to rank TNs by the total strength of
    their preferences within the ideal preference set, and then iterate over
    the TNs in the set in that order, adding all possible ones to an actual
    preference set, and rejecting TNs that conflict.  Rejected TNs would go
    back to the beginning of the whole packing process.
 -- Squish preference sets into scarce locations, as mumbled about above
    (subset packing/unpacking).


How about:

Representation selection is handled by the initial FSC determination.  We
select generators before pack, and don't change them.  [Which makes conditional
packing feasible again.  Oh well...]

Maybe we want to assume that before pack we determine two SCs for each TN: a
finite SC and an unbounded SC.  The finite SC may be omitted if the unbounded
SC is better.  The unbounded SC may be omitted if the TN is restricted.  The
thing that we are giving up here is the possibility of attempting packing in
several finite SCs before packing on the stack.  This doesn't seem to be an
important limitation.

Probably what we would actually do is have an FSC and an NFSC, which might be
the same.  

We compute a "goodness" for each TN that is the difference in the costs of the
FSC and the NFSC.  We also compute a "badness", which is the sum of the LTN
counts of all blocks that the TN is live in (giving some measure of the amount
of code that the TN is live in).

Then we compute the preference sets.  We rank preferences by the benefit (move
count * move cost in the FSC) divided by the sum of the "badness" of the
preferenced TNs.  We then iterate over the ranked preferences, putting the
preferenced TNs in the same preference set when the TNs don't conflict and have
the same finite and unbounded SCs (or had those SCs missing).

But this can result in second-order conflicts within the preference set.  We
can either compute the aggregate conflicts for the set, and only combine when
we don't conflict with any TN in the set, or we can remove conflicts after the
fact.

Maybe it would be useful to have a sparse representation of the aggregate
conflict set of the TNs in the preference set?  Perhaps represented with
global-conflicts structures (or similar).  A list sorted by block number, with
either :Live or a LTN conflict bit-vector.  An advantage of this is that adding
a TN to a preference set would basically involve merging the global-conflict
lists, rather than actually computing the conflict set.

Using this information, we could consider simultaneously packing all the TNs in
a preference set.  The advantage would be that we could efficiently compute the
aggregate conflict set from the merged global-conflicts lists, instead of
having to iterate over the conflicts of each TN that we pack.  We would iterate
over the TNs in the preference set, packing each TN that doesn't have a
conflict for the register in that register, and letting the rest go on the
stack.  Of course, this would give spurious conflicts on registers, since TNs
that went on the stack would still have their conflicts in the corresponding
register.  But this might work well enough, especially if we pack the
preference sets in an order determined by the aggregate ranks of the TNs in the
set.


|#
Assign TNs to SCs using cost and lifetime info.  This phase is mandatory, but
we can achieve various compile-speed/speed tradeoffs by varying Pack's
smartness.  Here are some things we can omit:
 1] Recomputation of reference cost and TN rank.  We compute the costs once,
    find the ranking, and then stick with it.
 2] Clever location selection.  We pack in a location without worrying a great
    deal about preference costs.  The cost of voting using the direct
    preference costs is probably negligible.  What can get hairy is the second
    order effects that involve iterating over the conflict set.

We have two preference mechanisms, local and global.  A local preference
indicates that it is desirable to pack some of the operands to a VOP in the
same location.  A local preference is represented by linking the TN-Refs that
want their TNs to be in the same place together in a cycle.  This local
preference relation affects only location selection, not the cost-driven
priority determination.  At the local level, the short-lived evaluation
temporaries will probably get packed into registers anyway.  We just want to
make sure that they get packed into registers in such a way that VOP costs are
minimized.  If we allocate a load TN, then the local preference affects the
load TN rather than the original TN.

The importance to assign to a local preference is unclear, since even if we had
a place to stick this information in our data structure, it would often be
difficult to come up with an exact figure for the cost.  The advantage might be
purely in terms of space, and thus invisible to our time metric, or it might be
even more obscure.  Probably we will do O.K. if we just use a nominal figure
such as the cost for a move within the SC, multiplied by an appropriate
loop-depth.

A global preference represents the anticipated cost of some number of moves
between two TNs.  We represent the cost by a move count so that we can use the
current SC to determine the absolute cost of the moves.

The main difference between our packing algorithm and the one described in the
PQCC papers is that we support load TNs.


Scarce SB packing:

#|

Pack all TNs restricted to a finite SC first, before packing any other TNs.
These TNs are all VOP temporaries, and it probably wouldn't hurt to assume that
all VOP temporaries are restricted to finite SCs.  If one isn't, we don't lose
anything by packing immediately placing it in an unbounded SB [except that we
might pack in a register when we didn't have to, possibly preventing a legal
packing.]

This eliminates to business of "infinite ranks" and "urgent TNs", which I don't
really understand or see how it works.  This really amounts to making all VOP
temporaries "urgent" from the time they are allocated, eliminating the need to
decide when the TN becomes urgent due to SCs filling up.  But maybe we kind of
have to really do the same thing anyway to keep track of SCs that
non-restricted TNs have been unsuccessfully packed in [But no, since we can
record that when the unsuccessful packing attempt is made.  The problem with
urgent TNs is that we need to anticipate failure of packing attempts so that we
know a restricted TN's second choice SC is in fact unavailable, and thus must
be packed in it's first choice.]  Note that there probably won't be many such
TNs, since most VOP temporaries will be wired to specific locations, rather
than packed.

Some kind of penalty for packing like the P@-(-) in RAOC is probably
worthwhile.  The idea is to disfavor packing TNs that have lots of competing
expensive TNs in their conflict set.  Using the average positive cost of
packing conflicting TNs seems to be a reasonable basis, but we need a factor
that indicates the probable number of TNs excluded.
	
Another possible angle to approach this from would be to use the per-location
costs obtained from location selection, abandoning packing the TN in that SC if
we can't find a location whose cost is more positive than the negated rank of
the TN.  [Or if we added the rank to each possible location, a location with a
non-negative cost.]
This approach seems like it is a lot more accurate than the one in RAOC, since
it actually looks at individual locations.

Instead of permanently rejecting the TN from the SC, we could just knock it's
rank down somehow, in the hope that we would find we could pack it after all
(as in RAOC).  This seems like it might be a bit ugly, since we would have to
(at least sort of) keep this penalty up to date.  We don't have to do a very
good job though.  

Although ideally we would like to reduce the penalty for conflict-set TNs
whenever we pack a TN, we don't really have to.  This may result in ranks being
artificially low, but at least the TN will get a second chance eventually.  We
could just have a penalty for each TN that is initially 0.  If we don't pack
because there is no location good enough, then we store the cost for the best
location as the TN penalty, add it to the rank, and re-rank.  Since this would
result in negative ranks, the effect would be similar to setting the TNs aside
until all other TNs have had a chance.

|#

Here are the location selection techniques, roughly in increasing order of
difficulty:

 1] We favor locations containing TNs that we are preferenced to with the
    strength of the preference.  We use both the global and local preferences
    in location selection.  This is trivial: we just iterate over our
    preferences, voting for the locations of packed TNs.

 2] Favor packing in locations that an unpacked TN that we *don't* conflict
    with and are preferenced to is preferenced to, in the hope that we can pack
    the other TN in that location and honor both preference links.  The
    potential benefit is the sum of the strength of the preferences, but it
    isn't certain that we will be able to honor the preference, so we should
    reduce the benefit to reflect this uncertainty.  [Not true that benefit is
    the sum, since we can hardly take credit for causing the second-order
    preference to be honored.  The benefit is at most the first-order
    preference.]

    For each unpacked TN that we are preferenced to and don't conflict with, we
    favor all the locations it is preferenced to and can be packed in (as
    determined by the per-location conflicts.)

 3] Disfavor packing in locations that an unpacked TN you conflict with is
    preferenced to.  If we pack in a location that a TN we conflict with is
    preferenced to, then the preference cannot be honored.  The penalty is
    proportional to the preference strength, but we might want to make less
    than the whole, since we might not honor the preference anyway.

    Finding the conflict set for global TNs is somewhat ugly.  We have to
    iterate over the sets for all the blocks the TN is live in, rejecting
    duplicates.  This is acceptable, since we have to know the full conflict
    set for the TN we are going to pack anyway.

    [This is kind of a special case of 4.  It says that we disfavor locations
    extra-strongly if a conflict is preferenced to the location.

 4] Favor packing in locations that hold TNs that conflict with unpacked TNs
    that you conflict with.  This is so that we don't gratuitously narrow down
    the packing opportunities of TNs we conflict with.  If we pack in a
    location that TNs in our conflict set are excluded from, then we don't
    reduce the number of places they could be packed.  The benefit is
    uncertain, but would be proportional to the rank of the excluded TN, since
    if we did exclude it the cost would be increased by that amount.

    Using the conflict set of the conflict set is probably not reasonable,
    since we don't have this information conveniently lying around --
    fortunately, we don't need the second order conflict set.  Instead, we
    iterate over the unpacked TNs in the conflict set, favoring locations that
    they cannot be packed in.  We can easily determine this by indexing the
    per-location conflicts bit-vector with the TN number of the TN in the
    conflict set.

    [To support packing rejection, we would want to change this rule to
    *disfavor* packing in locations that your conflicts *don't* conflict with.
    (i.e. disfavor locations that your conflicts can be packed in.)]

    [The penalty can be the conflicting TNs rank divided by the number of
    locations in the SC that it might be packed into.  Assuming we randomly
    assign to a location, this is the probability it will be assigned to that
    location.

    When a preference?  It seems our information about all possible packings of
    the TN in the SC could be used to scale the preference penalty.]


In the scarce resource (register) packing algorithm, we represent the set of
conflicts for a location using a vector indexed by the global TN number.  The
value for a TN's index is the number of TNs packed in that location which
conflict with with the TN, thus a TN may be packed in a location whenever the
conflict count for its number is zero.  We use a conflict count rather than a
single bit so that we can unpack TNs by decrementing the conflict count for all
the TNs in its conflict set.  Although there may be thousands of TNs, using an
array representation of the conflicts should be tractable since there are only
tens of scarce locations.

One might suppose that Pack would have to treat TNs in different environments
differently, but this is not the case.  Pack simply assigns TNs to locations so
that no two conflicting TNs are in the same location.  In the process of
implementing call semantics in conflict analysis, we cause TNs in different
environments not to conflict.  In the case of passing TNs, cross environment
conflicts do exist, but this reflects reality, since the passing TNs are
live in both the caller and the callee.  Environment semantics has already been
implemented at this point.

This means that Pack can pack all TNs simultaneously, using one data structure
to represent the conflicts for each location.  So we have only one vector of
conflict count vectors per SB, rather than one per SB per environment.

If we guarantee that load TNs are always packed immediately after they are
allocated, then we can avoid having to create TN conflicts information for load
TNs, which would be painful if there isn't room for any more local TNs in the
block that the reference is in.  Instead, we just pack the TN, adding the
appropriate conflicts to the location's conflicts vector.

Allocating load TNs during Pack is a bit of a pain, since already packed TNs
didn't specify the new TN as a conflict when they were packed.  When we create
a load TN, we must iterate over the conflicts of the TN, incrementing the
conflict counts for the new TN number in the locations that conflicting TNs
have been packed in.


Load TN packing:

[###
Idea: pack load TNs after all TNs that can be are packed in the scarce SBs.
Instead of remembering lifetime information from conflict analysis, we rederive
it.  We scan each block backward while keeping track of which locations have
live TNs in them.  When we find a reference that needs a load TN packed, we try
to pack it in an unused location.  If we can't, we unpack the currently live TN
with the lowest cost and force it into an unbounded SC.

This isn't as optimal as a scheme that would allow repacking of unpacked TNs,
but it is a lot simpler, since we don't have to update much of anything when we
pack a load TN.  The per-location and per-TN conflict information used by
scarce pack doesn't need to be updated, since we are done with scarce pack.
Since load TNs can only be packed in scarce SBs, they can't conflict with any
TNs that haven't been packed yet.

We also don't need to create any TN-Refs for load TNs.  [??? How do we keep
track of load-tn lifetimes?  It isn't really that hard, I guess.  We just
remember which load TNs we created at each VOP, killing them when we pass the
loading (or saving) step.  This suggests we could flush the Refs thread if we
were willing to sacrifice some flexibility in explicit temporary lifetimes.
Flushing the Refs would make creating the IR2 representation easier.]

[### It probably isn't worth trying to get the lifetimes for load TNs exactly
right.  Instead, we just consider them to conflict with all TNs referenced by
the VOP.  But this may not be easier than doing it right...]

The main possibility for pessimization would arise when there are TNs that can
be packed in distinct, overlapping, sets of locations.  A TN might be packed in
one of the overlap locations when it could be packed elsewhere.  A load TN
required to be in the other SC might force us to unpack the TN and put it on
the stack, when it could still go in a register if we realized this earlier.
It may also be a bit harder to satisfy local preferences on the load TNs.
[Clever location selection in scarce pack may be able to help these...]

Since we don't unpack TNs as far as scarce pack is concerned, we can use a
bit-vector to represent the TNs conflicting in a given location.

The lifetime analysis done in this second pass can also probably be enhanced
into a consistency check.  If we see a read of a TN packed in a location which
has a different TN currently live, then there is a packing bug.  If any of the
TNs recorded as being live at the block beginning are packed in a scarce SB,
but aren't current in that location, then we also have a problem.

Location selection is much simpler than in scarce Pack.  For one, there are no
global preferences on load TNs.  The conflict structure is also much less
interesting, since the load TNs for arguments and results all conflict with
each other, and don't conflict with much else.  Probably we just try packing in
locations that we have local preferences to before trying at random.  There
isn't any need for voting.

]

The "assumption of infinite registers" used by PQCC seems like it may be a lose
on the RT.  The assumption that they make is that they can get away without
allocating registers for operand loading, under the optimistic assumption that
they won't necessary.  It looks to me like they get away with it primarily
because many architectures don't require register operands in most cases, so
even if the arg TN doesn't make it into a register, they won't have to allocate
a load register.  This is definitely not the case on a load-store architecture.

A possible solution would be to get Pack to allocate operand loading TNs.  What
we do is have an optional SC requirement associated with TN-refs.  If we pack
the TN in an SC which is different from the required SC for the reference, then
we create a TN for each such reference, and pack it into the required SC.  In
many cases we will be able to pack the load TN with no hassle, but in general
we may need to unpack a TN that has already been packed.

The only real trick in unpacking is choosing a TN to unpack that can be
unpacked with minimum disruption and pessimization.  We only need to consider
unpacking TNs that conflict with the one we need to pack, and obviously we
would prefer to unpack the least important such TN.  We must avoid unpacking
TNs that we may not be able to repack: TNs that are restricted to finite SCs.
Probably this can be done with a simple modification of the location selection
algorithm.  When we throw the switch, packing in a location which conflicts
with an unpackable TN would be given an appropriate finite cost, rather than an
infinite one.  Once we have decided on a location, we eject any TN that we
conflict with.

It is possible that the load TNs
for the victim will force unpacking of yet another TN, but this process should
be rare and limited, since each unpacking increases the odds of successful load
TN packing due to replacing TNs that have long lifetimes with short-lived load
TNs.


Unbounded SB packing:

We use a different packing algorithm for packing in SBs that represent an
abundant resource such as memory.  If we know that all TNs that want to be
packed in the SB can be packed in it, then we can optimize preference costs at
the expense of storage usage.  When we attempt to pack in an abundant SB, we
just add the TN to a per-SB list and return success.  Our algorithms might also
exploit the statistical difference in the kind of TNs that get packed on the
stack: overall, most TNs are local and will get packed in registers.  The TNs
stack allocated are a small fraction of the total, and have a much larger
percentage of global TNs.

As in scarce Pack, we represent the conflicts for a location with a bit-vector
that has a 1 for the number of every TN that cannot be packed in that location,
but we index using new TN numbers assigned only to the TNs we are trying to
pack on the stack.  This should reduce the size of the vectors quite a bit,
since the stack TNs will be a small fraction of the total.

First, we attempt to honor all global preference links (ignoring local
preferences for now).  We find all the preference links between TNs in the SB,
and rank them by strength.  This should reject a lot of preferences, since many
preferences will be to TNs that got packed in registers.  We then process the
preferences in order:
 -- If one of the preferenced TNs has a location assigned, then we check the
    location's conflicts to see if we can pack in that location: if not, we
    skip this preference.  If we pack, then we add the new TN's conflicts to
    the locations conflicts vector, computing them from the TN's conflicts
    information.
 -- If both of the preferenced TNs have locations, then we see if the locations
    can be combined.  This involves iterating over the TNs packed in one
    location and seeing if any of them are in the conflicts for the other
    location.  We maintain a list of the TNs packed in each location so that we
    can do this iteration efficiently.  The lists should be small during the
    initial preferencing pass, since the size is bounded by the number of TNs
    that can be combined for purely preference reasons, which should be O(1).
    We combine the conflicts vectors using bit-or.
 -- If neither of the TNs have locations, and they don't conflict, then we
    assign a location to the pair, building a conflicts vector from the
    combined conflicts. 

After we have done all we can with the preferences, we pack stuff together in
the minimum amount of space:
 -- Combine the already allocated locations where possible.  Although the
    number of TNs packed in a location may become large at this point, testing
    for conflicts will still be reasonable, since we iterate over the small
    list of the location we are combining.  We don't even need to update the TN
    list for the location we are combining into, since we won't ever try to
    combine it with anything.
 -- Pack all the remaining TNs allocated in the SB.  These are the TNs that
    didn't have any successful preference (probably a large fraction.)  If a TN
    can't be packed in any existing location, then we make a new one.


Ranking:

An important part of the pack algorithm is the initial determination of the
ranking and the subsequent updating when TN weights are revised.  (Sorting
global preferences is a similar problem.)  The maximally dumb algorithm would
be to use a standard comparison sort to get the initial ordering, and then just
place the TNs in a doubly linked list so that we can move them around when we
need to.  There are some interesting properties of the problem that we can
exploit to do a lot better.  The basic realization is that although the costs
may range over orders of magnitude, there isn't a great deal of information in
them.  It isn't important that the ranking be exactly right; it need only be
good enough.

We use what is in effect a radix sort with exponentially increasing bin sizes.
We consider all TNs that end up in the same bin to have equivalent rank; there
is no attempt to sort the TNs within a bin.  In every bin, the maximum
difference between the weights is bounded by a fixed percentage.  We use this
algorithm both to compute the initial ranking and to maintain it.  Each TN can
be ranked in constant time.

In implementation what we do is use Integer-Length to find the log base 2 of
the weight, and then use some fixed number of bits below the MSB to select the
exact bin from among the bins assigned to that order of magnitude.  With 16
bins per order of magnitude, we could compute the bin number like this:

    (defun bin-number (x)
      (let ((len (integer-length x)))
	(+ (ash len 4)
	   (ldb (byte 4 (- len 5)) x))))

This number would be used as an index into a vector where the list of TNs
having ranks comparable to that one are stored.  If the weight of a TN was
restricted to be <2^20, then we would only need a 360 element vector, yet would
have resolution to 1 part in 16.  Larger weights can be arbitrarily assigned to
the largest bin, resulting in some loss of information but no incorrectness. 


			CONTROL OPTIMIZATION

In this phase we annotate blocks with drop-throughs.  This controls how code
generation linearizes code so that drop-throughs are used most effectively.
[### We may want to totally linearize the code here, allowing code generation
to scan the blocks using the drop-thru rather than having the figure out
what to do when there isn't one.]

One question is how much of the original sequencing in the code we should
attempt to preserve in IR1 conversion.  It seems that this depends on both how
smart the compiler is and how dumb the programmer is.  Probably we want to
preserve the drop-thrus in a TAGBODY at least until we can do loop rotation and
similar things.

There are basically two aspects to this optimization:
 1] Dynamically reducing the number of branches taken v.s. branches not
    taken under the assumption that branches not taken are cheaper.
 2] Statically minimizing the number of unconditional branches, saving space
    and presumably time.

These two goals can conflict, but if they do it seems pretty clear that the
dynamic optimization should get preference.  The main dynamic optimization is
changing the sense of a conditional test so that the more commonly taken branch
is the fall-through case.  The problem is determining which branch is more
commonly taken.

The most clear-cut case is where one branch leads out of a loop and the other
is within.  In this case, clearly the branch within the loop should be
preferred.  The only added complication is that at some point in the loop there
has to be a backward branch, and it is preferable for this branch to be
conditional, since an unconditional branch is just a waste of time.

In the absence of such good information, we can attempt to guess which branch
is more popular on the basis of difference in the cost between the two cases.
Min-max strategy suggests that we should choose the cheaper alternative, since
the percentagewise improvement is greater when the branch overhead is
significant with respect to the cost of the code branched to.  A tractable
approximation of this is to compare only the costs of the two blocks
immediately branched to, since this would avoid having to do any hairy graph
walking to find all the code for the consequent and the alternative.  It might
be worthwhile discriminating against ultra-expensive functions such as ERROR.

For this to work, we have to detect when one of the options is empty.  In this
case, the next for one branch is a successor of the other branch, making the
comparison meaningless.  We use dominator information to detect this situation.
When a branch is empty, one of the predecessors of the first block in the empty
branch will be dominated by the first block in the other branch.  In such a
case we favor the empty branch, since that's about as cheap as you can get.

Statically minimizing branches is really a much more tractable problem, but
what literature there is makes it look hard.  Clearly the thing to do is to use
a non-optimal heuristic algorithm.

A good possibility is to use an algorithm based on the depth first ordering.
We can modify the basic DFO algorithm so that it chooses an ordering which
favors any drop-thrus that we may choose for dynamic reasons.  When we are
walking the graph, we walk the desired drop-thru arc last, which will place it
immediately after us in the DFO unless the arc is a retreating arc.

We scan through the DFO and whenever we find a block that hasn't been done yet,
we build a straight-line segment by setting the drop-thru to the unreached
successor block which has the lowest DFN greater than that for the block.  We
move to the drop-thru block and repeat the process until there is no such
block.  We then go back to our original scan through the DFO, looking for the
head of another straight-line segment.

This process will automagically implement all of the dynamic optimizations
described above as long as we favor the appropriate IF branch when creating the
DFO.  Using the DFO will prevent us from making the back branch in a loop the
drop-thru, but we need to be clever about favoring IF branches within loops
while computing the DFO.  The IF join will be favored without any special
effort, since we follow through the most favored path until we reach the end.

This needs some knowledge about the target machine, since on most machines
non-tail-recursive calls will use some sort of call instruction.  In this case,
the call actually wants to drop through to the return point, rather than
dropping through to the beginning of the called function.


			BRANCH DELAY

On machines with delayed branch instructions, we would like to locate code that
can be moved into the delay slots after the branch.  Although this optimization
is usually done as a peephole optimization on assembly code, there are
advantages to doing it on VOPs in IR2.  The big advantage is that we already
have massive semantic information about the VOPs, their operands and side
effects.  This makes the use of delayed conditional branches much more
tractable, and should allow us to fill the delay slots a greater percentage of
the time.  Another advantage of doing this optimization on IR2 is that we
eliminate the need for an assembly code phase in the compiler, since there
aren't any other significant peephole optimizations.

Basically what we do is locate an appropriate number of VOPs which can be moved
into the delay slot, and then mark them so that the code generator for the
branch can special-case their generation.  Perhaps we would move them to the
very end of the block (after any IF VOP).

The easiest VOPs to move to the delay slot would be Move VOPs.  We would just
look for a Move VOP with different source and destination where the source
is not written and the destination is not read before the branch.

In the case of conditional branches where one arm leads out of the loop, we
could attempt to move code from the probable destination backward into the
delay slot.  In the case where the normal arm is the drop-through (an exit
test in a loop body), this is fairly straightforward.  In the case of the back
branch at the end of a loop, we would have to replicate the delayed VOPs at the
ends of all the predecessors of the loop head.

We could get clever and try to move things other than Move VOPs.  In the case
of an unconditional branch this is fairly straightforward.  For each VOP, we
would need to know if it is a good candidate for a delay slot, i.e. is
guaranteed not to expand into illegal stuff.  In the case of conditional
branches it is a bit more difficult.  We can only move VOPs into the delay slot
of a conditional branch when it is effectless and we know that nothing "bad"
will happen if the VOP is given garbage for its arguments.

Doing this pass after Pack is important, since we need to know which Move VOPs
actually do anything.  Unsurprisingly, there are reasons for wanting to do some
stuff before pack.  The main problem is that Pack knows nothing about the
desirability of not reusing the operands to potential delay VOPs between the
VOP and the end of the block.  If pack does reuse the operands, then we cannot
move the VOP.  This could be helped by having preload generation allocate
temporaries to hold the arguments of probable delay VOPs.


			CODE GENERATION

This should be fairly straightforward.  We lay out the blocks somehow or other,
taking into account the dropthrus, and then translate VOPs into instruction
sequences on a per-block basis.  The linearization process should attempt to
minimize branch lengths and place related code together.

After code generation, the IR2 representation is gone.  Everything is
represented by the assembler data structures.


			ASSEMBLY

[###

    Work out details of the interface between the back-end and the
    assembler/dumper.

    Support for multiple assemblers concurrently loaded?  (for byte code)
    
    We need various mechanisms for getting information out of the assembler.

    We can get entry PCs and similar things into function objects by making a
    Constant leaf, specifying that it goes in the closure, and then
    setting the value after assembly.

    We have an operation Label-Value which can be used to get the value of a
    label after assembly and before the assembler data structures are
    deallocated.

    The function map can be constructed without any special help from the
    assembler.  Codegen just has to note the current label when the function
    changes from one block to the next, and then use the final value of these
    labels to make the function map.

    Probably we want to do the source map this way too.  Although this will
    make zillions of spurious labels, we would have to effectively do that
    anyway.

    With both the function map and the source map, getting the locations right
    for uses of Elsewhere will be a bit tricky.  Users of Elsewhere will need
    to know about how these maps are being built, since they must record the
    labels and corresponding information for the elsewhere range.  It would be
    nice to have some cooperation from Elsewhere so that this isn't necessary,
    otherwise some VOP writer will break the rules, resulting in code that is
    nowhere.

    The Debug-Info and related structures are dumped by consing up the
    structure and making it be the value of a constant.

    Getting the code vector and fixups dumped may be a bit more interesting.  I
    guess we want a Dump-Code-Vector function which dumps the code and fixups
    accumulated by the current assembly, returning a magic object that will
    become the code vector when it is dumped as a constant.

    #, is implemented by having a magic marker that the dumper recognizes.
    When we are both dumping and compiling to Lisp, we dump the constant
    creation code to Lisp and run it in order to flush the marker.  When only
    compiling to Lisp, we eval at read time and directly use the original
    constant values.  If Load-Time-Eval becomes a language feature, we
    implement it by letting each component specifying some Constant structures
    and forms that should be evaluated to initialize them.
]

Instead of using an expensive symbolic data structure such as LAP code for the
interface between code generation and assembly, we use the hairy INST macro,
which does most of assembly on the fly.  This precludes any reasonable peephole
optimization phase, which is why we are careful not to need one.

INST does all of instruction translation, building the binary representation
for the instruction in an i-vector.  In effect, we do much of the work of
assembly when the compiler is compiled.  We have a parallel simple-vector that
represents information needed for assembly such as "labels" and fixup
annotations for branches and load-time constants.

The assembler makes one pass fixing up branch offsets, then squeezes out the
space left by branch shortening and dumps out the code along with the load-time
fixup information.  The assembler also deals with dumping unboxed non-immediate
constants and symbols.  Boxed constants are created by explicit constructor
code in the top-level form, while immediate constants are generated using
inline code.

[### The basic output of the assembler is:
    A code vector
    A representation of the fixups along with indices into the code vector for
      the fixup locations
    A PC map translating PCs into source paths

This information can then be used to build an output file or an in-core
function object.
]

The assembler is table-driven and supports arbitrary instruction formats.  As
far as the assembler is concerned, an instruction is a bit sequence that is
broken down into subsequences.  Some of the subsequences are constant in value,
while others can be determined at assemble or load time.

Assemble Node Form*
    Allow instructions to be emitted during the evaluation of the Forms by
    defining Inst as a local macro.  This macro caches various global
    information in local variables.  Node tells the assembler what node
    ultimately caused this code to be generated.  This is used to create the
    pc=>source map for the debugger.

Assemble-Elsewhere Node Form*
    Similar to Assemble, but the current assembler location is changed to
    somewhere else.  This is useful for generating error code and similar
    things.  Assemble-Elsewhere may not be nested.

Unassemble Form*
    Decache Assemble's state during the evaluation of the forms.  This is
    necessary when calling functions that emit instructions within the body of
    an Assemble form. 

Inst Name Arg*
    Emit the instruction Name with the specified arguments.

Gen-Label
Emit-Label (Label)
    Gen-Label returns a Label object, which describes a place in the code.
    Emit-Label marks the current position as being the location of Label.


Def-Instruction-Format Name Instruction-Size {(Type Size Kind Kind-Info*)}*
    This macro is used to define an instruction format.  Name is the name of
    the format.  Instruction-Size is the size of instructions having this
    format, in some architecture-dependent units.  The fields are contiguous
    and in order.  Type is either :Signed or :Unsigned.  Size is the size in
    bits of the field.

    Kind specifies how the value for the field is obtained:
        :Constant <Value>
            This field has the specified  numeric value in all instructions
            with this format.  Useful for operand specifiers and similar
            things.
        :Instruction-Constant
            The value for this field is determined by an argument to the
            Def-Instruction macro.  Useful for opcodes and providing shorthand
            notations for instructions that the architecture gives the same
            opcode.
        :Immediate
            The numeric value is an operand to the INST macro.  This is used for
            immediate operands to instructions.
        :Register
            The numeric value is a register number obtained from a TN which is
            an argument to the INST macro.
        :Fixup <Kind>
            The field is determined at load-time.  <Kind> is a keyword that
            indicates the kind of fixup to be done.  There is also a
            corresponding argument to the INST macro which provides context
            such as the function name for a link-table fixup.
        :Branch <Offset-Expression>
            This field is a relative branch offset.  <Offset-Expression> is an
            expression that returns the value for the field when it is
            evaluated with "Offset" bound to the offset from the start of the
            instruction in architecture units.  There is a corresponding
            argument to Inst which is a Label object describing the place to
            branch to.  A :Branch instruction cannot be directly emitted using
            Inst: Def-Branch must be used to define a branch
            pseudo-instruction. 

Def-Instruction Name Format Arg*
    Define an instruction with the specified Name, Format and constant fields.

Def-Branch Name {(Instruction Low High)}*
    Define Name to refer to a class of branch instructions which are equivalent
    except for the range of the offset.  We use the first Instruction where the
    offset (in architecture units from the start) is within the specified
    inclusive bounds.


			IR1 FINALIZE

This pass looks for interesting things in the IR1 so that we can forget about
them.  Used and not defined things are flamed about.

We postpone these checks until now because the IR1 optimizations may discover
errors that are not initially obvious.  We also emit efficiency notes about
optimizations that we were unable to do.  We can't emit the notes immediately,
since we don't know for sure whether a repeated attempt at optimization will
succeed.

We examine all references to unknown global function variables and update the
approximate type accordingly.  We also record the names of the unknown
functions so that they can be flamed about if they are never defined.  Unknown
normal variables are flamed about on the fly during IR1 conversion, so we
ignore them here.

We check each newly defined global function for compatibility with previously
recorded type information.  If there is no :defined or :declared type, then we
check for compatibility with any approximate function type inferred from
previous uses.


			RETARGETING

[###

    Allow "rest operands" to VOPs in define-vop.  You can't do much with the
    rest operands: define-vop just fills in the cost information according to
    the loading costs for a SC you specify.  You can't restrict rest operands,
    and you can't make local preferences.  In the generator, the named variable
    is bound to the TN-ref for the first extra operand.  This should be good
    enough to handle all the variable arg VOPs (primarily function call and
    return).

    Variable-arg VOPs can't be used with the VOP macro.  You must use VOP*.
    VOP* doesn't do anything with these extra operand except stick them on the
    ends of the operand lists passed into the template.  VOP* is often useful
    within the convert functions for non-VOP templates, since it can emit a VOP
    using an already prepared TN-Ref list.
    

    It is pretty basic to the whole primitive-type idea that there is only one
    primitive-type for a given lisp type.  This is really the same as saying
    primitive types are disjoint.  A primitive type serves two somewhat
    unrelated purposes:
     -- It is an abstraction a Lisp type used to select type specific
        operations.  Originally kind of an efficiency hack, but it lets a
        template's type signature be used both for selection and operand
        representation determination.
     -- It represents a set of possible representations for a value (SCs).  The
        primitive type is used to determine the legal SCs for a TN, and is also
        used to determine which type-coercion/move VOP to use.

]

There are basically three levels of target dependence:

 -- Code in the "front end" (before IR2 conversion) deals only with Lisp
    semantics, and is totally target independent.

 -- Code after IR2 conversion and before code generation depends on the VM,
    but should work with little modification across a wide range of
    "conventional" architectures.

 -- Code generation depends on the machine's instruction set and other
    implementation details, so it will have to be redone for each
    implementation.  Most of the work here is in defining the translation into
    assembly code of all the supported VOPs.


Storage bases and classes:  

A Storage Base represents a physical storage resource such as a register set or
stack frame.  Storage bases for non-global resources such as the stack are
relativized by the environment that the TN is allocated in.  Packing conflict
information is kept in the storage base, but non-packed storage resources such
as closure environments also have storage bases.
Some storage bases:
    General purpose registers
    Floating point registers
    Boxed (control) stack environment
    Unboxed (number) stack environment
    Closure environment

A storage class is a potentially arbitrary set of the elements in a storage
base.  Although conceptually there may be a hierarchy of storage classes such
as "all registers", "boxed registers", "boxed scratch registers", this doesn't
exist at the implementation level.  Such things can be done by specifying
storage classes whose locations overlap.  A TN shouldn't have lots of
overlapping SC's as legal SC's, since time would be wasted repeatedly
attempting to pack in the same locations.

There will be some SC's whose locations overlap a great deal, since we get Pack
to do our representation analysis by having lots of SC's.  A SC is basically a
way of looking at a storage resource.  Although we could keep a fixnum and an
unboxed representation of the same number in the same register, they correspond
to different SC's since they are different representation choices.
[### A possible scarce pack optimization to deal with this would be to compute
the SCs whose locations are a (not necessarily proper) subset of each SC.
When we fail to pack in a SC, we can then summarily reject packing in those SCs
as well.  This information can easily be computed at meta-compile or load time.
We need to be careful when the element sizes are different.  If we are unable
to pack in a small-size SC, then we know we cannot pack in a larger one, but
the inverse is not necessarily true.]

TNs are annotated with the primitive type of the object that they hold:
    T: random boxed object with only one representation.
    Fixnum, Integer, XXX-Float: Object is always of the specified numeric type.
    String-Char: Object is always a string-char.

When a TN is packed, it is annotated with the SC it was packed into.  The code
generator for a VOP must be able to uniquely determine the representation of
its operands from the primitive type and the SC.

Some SCs:
    Reg: any register (immediate objects)
    Save-Reg: a boxed register near r15 (registers easily saved in a call)
    Boxed-Reg: any boxed register (any boxed object)
    Unboxed-Reg: any unboxed register (any unboxed object)
    Float-Reg, Double-Float-Reg: float in FP register.
    Stack: boxed object on the stack (on cstack)
    Word: any 32bit unboxed object on nstack.
    Double: any 64bit unboxed object on nstack.

We have a number of non-packed storage classes which serve to represent access
costs associated with values that are not allocated using conflicts
information.  Non-packed TNs appear to already be packed in the appropriate
storage base so that Pack doesn't get confused.  Costs for relevant non-packed
SC's appear in the TN-Ref cost information, but need not ever be summed into the
TN cost vectors, since TNs cannot be packed into them.

There are SCs for non-immediate constants and for each significant kind of
immediate operand in the architecture.  On the RT, 4, 8 and 20 bit integer SCs
are probably worth having.

Non-packed SCs:
    Closure environment
    Save TN
    Immediate constant SCs:
        Signed-Byte-<N>, Unsigned-Byte-<N>, for various architecture dependent
	    values of <N>
	String-Char
	XXX-Float
	Magic values: T, NIL, 0.
    Global state variables such as stack pointers.


Type system parameterization:

The main aspect of the VM that is likely to vary for good reason is the type
system:

 -- Different systems will have different ways of representing dynamic type
    information.  The primary effect this has on the compiler is causing IR2
    conversion of type tests and checks to be implementation dependent.
    Rewriting this code for each implementation shouldn't be a big problem,
    since the portable semantics of types has already been dealt with.

 -- Different systems will have different specialized number and array types,
    and different VOPs specialized for these types.  It is easy add this kind
    of knowledge without affecting the rest of the compiler.  All you have to
    do is define the VOPs and then define primitive generators for them.

 -- Different systems will offer different specialized storage resources
    such as floating-point registers, and will have additional kinds of
    primitive-types.  The storage class mechanism handles a large part of this,
    but there may be some problem in getting IR2 conversion to realize the
    possibly large hidden costs in implicit moves to and from these specialized
    storage resources.  Probably the answer is to have some sort of general
    mechanism for determining the primitive-type for a TN given the Lisp type,
    and then to have some sort of mechanism for automatically using specialized
    Move VOPs when the source or destination has some particular primitive-type.


VOP Definition:

Before the operand TN-refs are passed to the emit function, the following
stuff is done:
 -- The refs in the operand and result lists are linked together in order using
    the Across slot.  This list is properly NIL terminated.
 -- The TN slot in each ref is set, and the ref is linked into that TN's refs
    using the Next slot.
 -- The Write-P slot is set depending on whether the ref is an argument or
    result.
 -- The other slots have the default values.

The template emit function fills in the Vop, Costs, Cost-Function,
SC-Restriction and Preference slots, and links together the Next-Ref chain as
appropriate.


Lifetime model:

It seems we need a fairly elaborate model for intra-VOP conflicts in order to
allocate temporaries without introducing spurious conflicts.  Consider the
important case of a VOP such as a miscop that must have operands in certain
registers.  We allocate a wired temporary, create a local preference for the
corresponding operand, and move to (or from) the temporary.  If all temporaries
conflict with all arguments, the result will be correct, but arguments could
never be packed in the actual passing register.  If temporaries didn't conflict
with any arguments, then the temporary for an earlier argument might get packed
in the same location as the operand for a later argument; loading would then
destroy an argument before it was read.

A temporary's intra-VOP lifetime is represented by the times at which its life
starts and ends.  There are various instants during the evaluation that start
and end VOP lifetimes.  Two TNs conflict if the live intervals overlap.
Lifetimes are open intervals: if one TN's lifetime begins at a point where
another's ends, then the TNs don't conflict.

The times within a VOP are the following:

:Load
    This is the beginning of the argument's lives, as far as intra-vop
    conflicts are concerned.  If load-TNs are allocated, then this is the
    beginning of their lives.

(:Argument <n>)
    The point at which the N'th argument is read for the last time (by this
    VOP).  If the argument is dead after this VOP, then the argument becomes
    dead at this time, and may be reused as a temporary or result load-TN.

(:Eval <n>)
    The N'th evaluation step.  There may be any number of evaluation steps, but
    it is unlikely that more than two are needed.

(:Result <n>) 
    The point at which the N'th result is first written into.  This is the
    point at which that result becomes live.

:Save
    Similar to :Load, but marks the end of time.  This is point at which result
    load-TNs are stored back to the actual location.

In any of the list-style time specifications, the keyword by itself stands for
the first such time, i.e.
    :argument  <==>  (:argument 0)


Note that argument/result read/write times don't actually have to be in the
order specified, but they must *appear* to happen in that order as far as
conflict analysis is concerned.  For example, the arguments can be read in any
order as long no TN is written that has a life beginning at or after
(:Argument <n>), where N is the number of an argument whose reading was
postponed.

[### (???)
We probably also want some syntactic sugar in Define-VOP for automatically
moving operands to/from explicitly allocated temporaries so that this kind of
thing is somewhat easy.  There isn't really any reason to consider the temporary
to be a load-TN, but we want to compute costs as though it was and want to use
the same operand loading routines.
]


VOP Cost model:
#|
    Packed (non-wired) VOP temporaries are somewhat ugly, since the
    primitive-type mechanism isn't quite right.  Normally the costs for a Ref
    must be less restrictive than the SCs allowed by the primitive-type.  If we
    used the obvious primitive-types, then we would have to be able to express
    restrictions such as "boxed but must be in a register".  I guess one
    solution would just be to require the VM definition to make appropriately
    restrictive bogus primitive type specifications.

    How about exploiting the fact that restricted TNs are only used for VOPs
    temps are packed in a separate pass?  Restricted TNs must have exactly one
    ref in their refs list.  The SC-Restriction on this ref is interpreted as a
    constraint to packing rather than a control of load-TN packing.
    Temporaries are restricted by specifying per-generator sc-restrictions.
 
|#

We get the cost determination process started by having a default FSC for each
primitive type.  We use this as our initial guess for the FSC, and then call
the select functions for all VOPs with hairy operands.  There must be some
generator feasible for this SC combination, since any pack may choose any
combination SCs allowed by the TN primitive types.

If each generator has the same restriction on an operand, then we don't need to
test that the restriction is satisfied.

If this is true, or if all generators have the same cost, then, for that
operand, the FSC doesn't affect the costs for any other operand.  In addition,
we can neglect to update the costs for this operand when the FSC of other
operands changes, since the change won't affect the rank of the referenced TN.

Note that in this model, if a operand has no restrictions, it has no cost.
This makes make sense, since the purpose of the cost is to indicate the
relative value of packing in different SCs.  If the operand isn't required to
be in a good SC (i.e. a register), then we might as well leave it in memory.
The SC restriction mechanism can be used even when doing a move into the SC is
too complex to be generated automatically (perhaps requiring temporary
registers), since Define-VOP allows operand loading to be done explicitly.

If any operand has restrictions in generators with different costs, then we
must have a select function.  (The restrictions must also be different, but
that is implicit, since there is no point in having multiple generators with
identical restrictions.)  The select function looks at the FSC for each operand
and finds the feasible generator with the lowest total cost, given those
operand SCs.

We can use generator restrictions to drive VOP temporary allocation also.
We just don't allow for loading.  This results in rather uninteresting costs
vectors, with the generator cost as the only non-null entries, but it does have
the desired effect.

This isn't totally degenerate, since the initial FSC for the operands determine
the generator we choose before packing, and thus choose the SCs that we force
restricted TNs into.  Once we have packed the restricted temporaries,
generators that don't allow that SC for the temporary will be impossible. 

Interestingly, this mechanism can easily be used to implement conditional
packing of temporaries on the basis of the VOP operand SCs [such as
immediate-constant].  What we do is have a "nada" SC that we allow the TNs to
be packed in.  If we don't choose a generator that restricts it to a real SC,
then we can always pack it in the nada SC (by doing nothing).  This could
almost be done without any special support by making nada be an unbounded SC.
Some cooperation from define-vop is needed, because temporaries allowed into
nada should still be considered to be restricted to finite SCs in that they
must be packed during the restricted TN pack pass. 

But if we are cautious enough to guarantee we won't preclude all legal
packings, then we will probable never be able to conditionally pack TNs.
Conditional temporary packing might become useful if if the load-TN packing
phase could recover when we incorrectly decided not to pack a TN.


Implementation parameterizations:

Another way in which different implementations differ is in the relative cost
of operations.  On machines without an integer multiply instruction, it may be
desirable to convert multiplication by a constant into shifts and adds, while
this is surely a bad idea on machines with hardware support for multiplication.
Part of the tuning process for an implementation will be adding implementation
dependent transforms and disabling undesirable standard transforms.

IR2 optimization has some problems, since it needs to know about side-effects
and dependencies of VOPs so that it can tell when an expression should be
killed.  This isn't very serious though, since without determining what
aliasing exists we can't do much better than the boolean effects used by
Rabbit.  Determining aliasing relations in Lisp is pretty impossible.  We can
do better for lexical variables, but we have already mapped them to TNs which
are independent of the particular IR2.  For special variables we can just have
a Special effect which all special references depend on.  We could do better if
we treated specials specially, but it probably isn't worth it.  All we need
then, is the set of effects and dependencies for each VOP; this can easily be
tabular.

If the code emitted for a VOP when an argument is constant is very different
than the non-constant case, then it may be desirable to special-case the
operation in IR2 conversion by emitting different VOPs.  An example would be if
SVREF is only open-coded when the index is a constant, and turns into a miscop
call otherwise.  We wouldn't want constant references to spuriously allocate
all the miscop linkage registers on the off chance that the offset might not be
constant.  Template selection can be special-cased by using a convert function
instead of the normal type/policy based selection.

When practical, IR1 transforms should be used instead of IR2 generators, since
transforms are more portable and less error-prone.  Note that the Lisp code
need not be implementation independent: it may contain all sorts of
sub-primitives and similar stuff.  Generally a function should be implemented
using a transform instead of an IR2 translator unless it cannot be implemented
as a transform due to being totally evil or it is just as easy to implement as
a translator because it is so simple.


Special-case IR2 conversion:

    (defun continuation-tn (cont &optional (check-p t))
      ...)
Return the TN which holds Continuation's first result value.  In general
this may emit code to load the value into a TN.  If Check-P is true, then
when policy indicates, code should be emitted to check that the value satisfies
the continuation asserted type.

    (defun result-tn (cont)
      ...)
Return the TN that Continuation's first value is delivered in.  In general,
may emit code to default any additional values to NIL.

    (defun result-tns (cont n)
      ...)
Similar to Result-TN, except that it returns a list of N result TNs, one
for each of the first N values.


Nearly all open-coded functions should be handled using standard template
selection.  Some (all?) exceptions:
 -- List, List* and Vector take arbitrary numbers of arguments.  Could
    implement Vector as a source transform.  Could even do List in a transform
    if we explicitly represent the stack args using %More-Args or something.
 -- %Typep varies a lot depending on the type specifier.  We don't want to
    transform it, since we want %Typep as a canonical form so that we can do
    type optimizations.
 -- Apply is weird.
 -- Funny functions emitted by the compiler: %Listify-Rest-Args, Arg,
    %More-Args, %Special-Bind, %Catch, %Unknown-Values (?), %Unwind-Protect,
    %Unwind, %%Primitive.


			INTERPRETER INTERFACE

We have to figure out how to make special bindings in the interpreter and
PROGV.  We want a special frob like:

    (WITH-RANDOM-SPECIALS
      ...
      (%SPECIAL-BIND sym value)
      ...)

Each %SPECIAL-BIND will add a special binding to the environment of code
evaluated after it.  Exiting the form will undo all special bindings made
within the form.  This can be implemented fairly easily as a macro when we are
using shallow binding.  It will just stash the initial BSP in a variable, and
then do the body in an unwind-protect, calling some frob in the cleanup which
unbinds to the saved BSP.  This is more of a bitch if we have deep-binding.  We
might want to do this some other way, possibly splitting the binding frobbing
off from the code affected by the bindings.

Cute idea: (PROGV vars vals body) => (%PROGV vars vals #'(LAMBDA () body)).
%PROGV is a totally magical function which establishes the bindings somehow,
and then calls the body function.  This would only be reasonable if we have
stack closures.

The only residual problem is dealing with the places in the interpreter where
people use the stack for temporary storage, pushing and popping it randomly.
The only places that do this are the binding forms and PSETQ.  In the case of
the binding forms, we could use the conses which we are going to cons anyway to
hold the results.  This would require rearranging the code a bit, but would
probably improve performance.  PSETQ is more problematical, since it currently
doesn't cons.  Probably we could get 90%+ of the performance by special-casing
fewer than N vars for some small N and consing for more.  A little consing in
PSETQ isn't going to kill anyone anyway, considering all the consing caused by
variable bindings.  If we ever reduce that consing, we can apply the same
technology to PSETQ.

[### Actually, we should seriously consider flushing the interpreter, or at
least 95% of it.

There seem to be a variety of not-as-related-as-you-think reasons for having an
interpreter:
 1] It is useful in initial system development to have a read-eval-print loop
    that doesn't require the compiler to work.
 2] It is useful to have a quick turnaround development environment with full
    source-level debugging support.
 3] It is nice to be able to step the execution of code.
 4] *evalhook* and *applyhook* might be useful for something other than
    steppers.

1 is definitely true, but doesn't require a full EVAL to be implemented.  All
you really need is calling of global macros and functions, self evaluating
forms, accessing of global variables and functions (via FUNCTION), and the
QUOTE and SETQ special forms.  There is no need to support calling of
interpreted functions, or the use of any other special forms.  If the compiler
is loaded when we encounter a hairy form, then we can compile the form and call
the resulting function.

2 is true, but most experienced programmers debug in compiled code despite the
supposed advantages of interpreted code.  Since the new compiler provides full
source-level debugging support even in compiled code, the only part of this
that might not be satisfied is the quick turnaround.

One solution would be to have an alternate back-end for the compiler that
compiles to an interpreted byte-code.  We would do IR1 conversion, but no
optimization.  We would then do some kind of quick environment analysis, and
then translate into byte code without special-casing any functions except for
the funny functions generated by the compiler.

[### This could be quite useful in debugging the compiler IR1 passes as well...
Perhaps I should implement this before implementing the native-code back-end.]

3 is true, but it is quite possible to step the execution of a byte-code
interpreter.  This would only require spitting out some minimal information to
mark the beginnings of forms in the stream, along with the corresponding source
locations.  A whizzy stepper wants something more than evalhook anyway.  Things
like backing up may be easier to implement in a byte-code interpreter than in a
standard interpreter.

4 is highly questionable.  I haven't seen any such use.  In any case, it would
be possible to implement a fair approximation of *evalhook* using the source
map information available.
]
 

			IR2 CONSISTENCY CHECKING

Block connectivity is similar to IR1.
Verify that TN-Ref linkages are correct.
Verify that TN-Ref Write-P slots are correct in args and results.
Check local and global preference links.
Check info about top-of-stack TNs. (whatever that may turn out to be...)
Check that the VOP list within each block is well-formed.
Check VOP arg counts?
Check for valid TN assignments w.r.t. function call.
Check loop analysis?

Check environment analysis:
   Do basic sanity checks such as making sure that directly referenced TNs are
   directly accessible in the current environment.

Check for references to uninitialized TNs:
   We could detect some lossage by looking for TNs which are referenced before
   they can possibly have been written.  This could easily fall out of
   lifetime analysis.  If a TN is uninitialized at the beginning of a block,
   and is read before it is written, then we have a problem.

Check packing legality:
   To the extent that it is possible, this is a good idea.  We can
   certainly check for things like SC restrictions being satisfied.
   We could also check on the legality w.r.t. TN lifetimes, but it would
   be extra work.  We could redo the live var analysis using simpler data
   structures and algorithms which are optimized for legality checking.
   After checking each block, we check that the new analysis agrees with
   the old, which will guarantee that the lifetimes are globally correct.
   Our checking algorithm would probably be based on data structures
   representing locations rather than TNs.  Each location would have a
   flag indicating which live TN is in it, if any.  If someone attempts
   to read a different TN in that location, then we know we have a
   problem.  The validity checking might not be much less buggy than the
   real version, but hopefully the bugs won't intersect.

Check accuracy of effects information:
   Incorrect tables of information about effects and interfaces is a major
   source of lossage.  We could check up on this by mumbling when a
   supposedly effectless operation is used for effect.  Could also complain
   if a supposedly effectless operation is explicitly replicated in user
   code.  Could have a grab-bag of other checks: whenever we find a
   problem, we try to devise a check which would have detected it.

Cost assignment checking:
   Probably a good idea, since bad costs would cause subtle bugs because
   code would work, but slowly.  Can we write something to check for
   efficiency bugs?  Can check for things like:
     TNs which aren't preferenced that we make moves between.
     TNs that are preferenced, have disjoint lives, and get packed into
       different LOCs in the same SC.
     Specific VOPs whose average cost is out of line with initial estimates.
     Code sequences containing many VOPs with above average costs.

   Could have a smarter, slower pack algorithm which tries to find better
   packings.  If the better packing is enough better, then we complain.  Could
   we treat repacking to be a game?  We could start out with a packing, and
   then move through the space of packings by moving one TN as our "move".  We
   would consider the cost of repacking any TNs that we unpack to be the "enemy
   move".


			OBSERVED BUGS AND POSSIBLE FIXES

Some bugs gleaned from the change logs of existing compilers, with possible ways
to prevent them from occurring in this compiler:

Generators emitting bad assembler format:
  Make assembly emission be done with a macro that checks the instruction for
  validity at compile time.

Random address arithmetic being done wrong with magic numbers:
  Use only named constants and well defined named functions in such
  expressions.  Examples of functions are:
     words=>bytes	(* x 4)
     byte-index		(+ x y)
     skip-header        (+ x <amount determined by specified type>) 
  No magic numbers in generators anywhere.

Bugs with stack simulation:
  Let stack analysis worry about it. 

Generators putting unboxed things in boxed registers and vice versa.
  Mainly a problem with temporaries used within the code for the VOP.  This is
  most likely to happen in code which deals with boxing and unboxing numbers.
  Factor this code out and get it right.  We can have a frob that will tell us
  all VOP generators that request unboxed TNs, so that these can be given an
  extra-careful going over.

General usage of wrong registers (magic register numbers)
  Make generators get their temporaries by requesting them as TNs.  The
  declarative VOP info will cause them to be allocated and passed in.  The
  requirements are specified in terms of SCs rather than magic registers.  This
  should give us some freedom to move things around without having to change
  all the generators.

Problems with inadequate restrictions being specified on TNs:
  Have conservative defaults?

Functions that pass around pieces of LAP such as operand locations cause
  problems with callers not being up on the syntax of what is going on:
  Don't pass around syntax, pass around structures that represent semantics and
  can be checked.

Broken generators fail to return the results and similar stuff:
  Check that all the args and results are actually used as operands by the
  generated code.  We could have temporaries be classified by whether they are
  always used or only sometimes used, and then could check that the always-used
  ones are always used.

Representation selection seems to be a huge source of problems:
  Does having lots of SCs and getting Pack to work for us really eliminate all
  these problems?

Discrepancies between the interface to a transform and the real definition:
  This is largely eliminated by specifying a type that the call must satisfy.
  If we forget to handle a keyword in a transform, then it just won't get
  transformed.  We could have a frob that compares the types for transforms to
  the types for the definitions.

Other confusions between different implementations of the same thing:
  Macros v.s. special forms.
  Functions v.s. source transforms v.s. transforms v.s. open coding.
  Do they do the same thing?  Are the differences gratuitous?  Are we using the
  one we want?  Do we have all the versions that we need?

If the same information is kept more than one place then it is surely wrong
  somewhere:
  Don't duplicate information.  Check that duplicates are consistent.

When decorating the representation, we forget to fill in some slots, and then
  garbage is interpreted in some random way:
  Choose defaults that are obviously broken: *empty*

Failure to realize that some stuff must be recomputed due to failing to realize
  that something has been invalidated or failing to propagate the invalidation.