Page    First Non-blank Line
  1  Page    First Non-blank Line
  2  Todo:
  3  Phases:
  4  			IR1 CONVERSION
  5  Hairy function representation:
  6  IR1 representation of Catch and Unwind-Protect:
  7  Block compilation:
  8  			LOCAL CALL ANALYSIS
  9  			FLOW GRAPH SIMPLIFICATION
 10  			IF OPTIMIZATION
 11  			IR1 BOTTOM-UP OPTIMIZE
 12  			IR1 TOP-DOWN OPTIMIZE
 13  			IR1 FORWARD FLOW
 14  			PREDICATE PROPAGATION
 15  			TRANSFORMATION
 16  			EVALUATION ORDERING
 17  			IR1 FINALIZE
 18  			IR2 CONVERSION
 19  TN Assignment:
 20  Loop analysis:
 21  Cleanup generation:
 22  Environment analysis:
 23  Stack management:
 24  IR2 Function call:
 25  Generating error checks:
 26  Converting conditionals into IR2:
 27  Converting Catch and Unwind-Protect:
 28  VOP representation:
 29  			COPY GENERATION
 30  		LOOP AND GLOBAL COMMON SUBEXPRESSION OPTIMIZATION 
 31  			TN LIFETIME ANALYSIS
 32  			CONFLICT ANALYSIS
 33  			SAVE GENERATION
 34  			PRELOAD GENERATION
 35  			PACKING
 36  			CONTROL OPTIMIZATION
 37  			CODE GENERATION
 38  			ASSEMBLY
 39  			TARGET DEPENDENCE
 40  			KEYWORD ARGS
 41  			MUMBLAGE ABOUT TYPES
 42  			INTERPRETER INTERFACE
 43  			IR2 CONSISTENCY CHECKING
 44  			OBSERVED BUGS AND POSSIBLE FIXES

Todo:

    More IR1 representation tweaks:
        Add more info to Leaf which tells whether a functional reference is in
          a closure context or if it is an open call???  (refinement to Inlinep?)

    Add support for equated continuations so that such
      continuations don't become invalidated when they are deleted.

    Fix up IR1tran to deal with new IR1 representations.

    DEFTYPE and DEFSTRUCT type stuff?

    Function-type hacking stuff for support:
      Flame-About-Redefinition
      Approximate function types

    Add stuff which sets the Leaf-Type for functionals so that we can do arg
      checking.
    Do inline substitution.
    Change TAGBODY (and BLOCK?) to preserve the drop-thrus in the original code.

    &REST and &KEY.
    Write an approximation to IR1-FINALIZE which checks for
      bound-but-not-referenced, type errors and similar things.

    Remaining special forms: UNWIND-PROTECT, CATCH, PROGV.  Rethink specbind.
    Simple type propagation so that function type checking does more.
    TYPE-INTERSECTION
    Handle recursive types.
    Write real COMPILER-ERROR and COMPILER-WARNING.  Add error recovery points.
    Make (FUNCTION * ...) types work.
    &AUX.

    In CLC, inline substitution is only allowed when the function is defined in
    the null environment.  This is be a reasonable constraint for throwing the
    expansion into the global environment, since it must be represented as
    source.  However, it would be nice to allow inline expansion of local
    functions, which would almost always be inhibited if we required a null
    environment.  Since the appearances of the local function must be nested
    within the desired environment, it should be possible to expand local
    functions inline even when they use the environment.  We just stash the
    source form and environments in the Functional for the local function.
    When we convert a call to it, we just reconvert the source in the saved
    environment.

    Make definition macros totally real by having the load-time functions
      deal with clearing compiler info and similar stuff.

    Change IR1-top-level to take a vector instead of a list so that we can find
      the top-level form which is the root of our source path more easily.

Design:

    Need some way for top-level code to get a special policy so that we don't
      waste lots of time and space trying to make it run fast.
 
    Unclear exactly how we should represent the return from the body of a local
      function.  One possibility is to represent the value semantics by having
      a dummy continuation like we do now, but the users of this continuation
      appear in the uses lists for all the actual call continuations.  It seems
      that we don't really need a back pointer from the continuation to the
      function being returned from, since the function has a point to the
      continuation, and can generate a return using the appropriate return
      address when the time comes.  Alternate schemes involve some kind of
      explicit return node.

    How do we deal with VOPs where the cost of temporaries depends on
      whether operands are packed in the same location? (N/N-1 arg) 
      Local preference will encourage packing arg/result together.
      If more temps may be needed, we can allocate them and preference to some
      operand.  Predicate which determines if packing is necessary?

    Do we need a separate local preference mechanism?  Based on TN-Refs?
      Link TN-Refs that want their TNs to be in the same place together in a
      cycle.  This local preference relation affects only location selection,
      not the cost-driven priority determination.  At the local level, the
      short-lived evaluation temporaries will probably get packed into
      registers anyway.  We just want to make sure that they get packed into
      registers in such a way that VOP costs are minimized.

    Define-VOP macro:
      Cost info
      Local preference info
      Temporaries required
      Code generator?
      effects/affected (eventually)
      Inheritance?  Same as XXX, but...

    How does automagic type-checking get done in IR2 conversion?
    Where and how do we figure out what stuff has to be saved/restored in local
      call: CONT, ENV, etc.
    Need to clarify the relation between functions and environments.  Is an
      environment always identified with a function?  What about functions that
      don't have their own environment?  Perhaps we should consider context
      pointers to be associated with the environment, but associate return PCs
      with functions.

    Cleanups:
      Cleaning up is implicit even on the fall-through path.
      The *current-cleanup* a cleanup structure.  Cleanups are linked together
      in a data structure which represents the syntactic nesting of messed up
      environments and serves as a place to keep information about how to clean
      up.  We can flush mv-prog1, since we now can send values to a
      continuation and then go somewhere else by putting the continuation at a
      block end.

    Exactly how does cleanup generation work?
      How does state-saving code get emitted?


Phases:

The structure of the compiler may be broadly characterized by describing the
compilation phases and the data structures that they manipulate.  The steps in
the compilation are called phases rather than passes since they don't
necessarily involve a full pass over the code.  The data structure used to
represent the code at some point is called an IR (Intermediate Representation).

Two major IRs are used in the compiler:
    IR1 is used to represent the lisp-level semantics of the source code during
    the initial phases.  Meta-evaluation and semantic analysis are done on this
    representation.  IR1 is roughly equivalent to a subset of Common Lisp, but
    is represented as a flow-graph rather than a syntax tree.

    IR2 is used to represent the implementation of the source code on a virtual
    machine.  The virtual machine may vary depending on the the target
    hardware, but the IR2 representation is sufficiently stylized that most of
    the phases which manipulate it are portable.


Each phase is briefly described here.  The ordering below is approximate.  Some
phases could moved around.  For maximum optimization, the phases from Local
call analysis though Transformation should be repeated until nothing new is
discovered.  The name of a phase is in brackets if it may be omitted to save
compilation time.

IR1 Conversion
    Convert the source into the IR1 representation, doing macroexpansion and
    simple source-to-source transformation.  All names are resolved at this
    time, so we don't have to worry about name conflicts.

Local call analysis
    Find calls to local functions and splice them into the call graph so that
    we can do flow analysis.

Flow graph simplification
    Eliminate unreachable blocks and compute depth-first ordering.

[IF optimization]
    Various optimizations of control structure based on "source to source" IF
    transformation.

[IR1 bottom-up optimize]
    Do constant folding and bottom-up type propagation.

[IR1 top-down optimize]
    Eliminate code that computes unused values and do top-down type
    propagation.

[IR1 forward flow]
    Propagate type and constant information to variable references and
    eliminate unnecessary trivial variable bindings.

[Predicate propagation]
    Use a limited form of global common subexpression elimination to detect
    things that are known to be true due to having been tested at some earlier
    time.

[Transformation]
    Special-case calls to some known global functions by replacing them with a
    computed function.  Some transformations can be done better on IR1 than on
    the actual source since argument type and constant information is known.

[Evaluation ordering]
    Reduce storage requirements by changing the order of evaluation of some
    function arguments.

IR1 finalize
    Do some error checks and maybe some other stuff to get ready for IR2
    conversion.

IR2 conversion
    Convert nodes into VOPs, variables into TNs, LETs into sets, ...

[Copy generation]
    Replace some references to TNs with references to copies which may get
    allocated in better places.

[Global common subexpression elimination and loop invariant optimization]
    Eliminate redundant computations of simple expressions.

TN lifetime analysis
    Determine the sets of TNs live at block boundaries.
 
Save generation
    Generate explicit saving and restoring code for TNs used in recursive local
    functions.

Conflict analysis
    For each TN, find the set of TNs that have lifetimes which overlap with its
    own lifetimes.

[Preload generation]
    On some machines, move memory references backward in the code so that they
    can overlap with computation.

Packing
    Assign each TN to a storage location.

[Control optimization]
    Linearize the flow graph in a way that minimizes the number of branches.

Code generation
    Convert the VOPs into assembly code.

Assembly
    Assemble the code and deal with dumping of constants and entry points and
    stuff.


			IR1 CONVERSION


The simplified lisp code is converted into the first intermediate
representation: nodes and continuations.  The set of special forms recognized
is exactly that specified in the Common Lisp manual.  Other things implemented
as special forms in the interpreter are implemented as macros in the compiler.

Large amounts of syntactic information are thrown away by the conversion to an
anonymous flow graph representation.  The elimination of names elminates the
need to represent most environment manipulation special forms.  The explicit
representation of control eliminates the need to represent BLOCK and GO, and
makes flow analysis easy.  The full Common Lisp LAMBDA is implemented with a
simple fixed-arg lambda, which greatly simplifies later code.

The elimination of syntactic information eliminates the need for most of the
"beta transformation" optimizations in Rabbit.  There are no progns, no
tagbodys and no returns.  There are no "close parens" which get in the way of
determining which node receives a given value.

In IR1, computation is represented by Nodes.  These are the node types:
    If:  Represents all conditionals.
    Set: Represents a SETQ.
    Leaf:
         Represents a constant or variable reference.
    Combination:
         Represents a normal function call.
    MV-Combination:
         Represents a MULTIPLE-VALUE-CALL.  This is used to implement
         all multiple value receiving forms except for MULTIPLE-VALUE-PROG1,
         which is implicit.
    Bind:
         This represents the allocation and initialization of the variables in
         a lambda.

Some slots are shared between all node types.  This information held in common
between all nodes often makes it possible to avoid special-casing nodes on the
basis of type.  This shared information is primarily concerned with the order
of evaluation and destinations and properties of results.  These things are
represented by continuations.

The Continuation structure represents stuff that is sufficiently related to the
normal notion of a continuation that naming it so seems sensible.  Basically, a
continuation represents a place in the code, or alternatively the destination
of an expression result and a transfer of control.  These two notions are bound
together for the same reasons that they are related in the normal functional
continuation interpretation.

A Continuation may be deprived of either or both of its value or control
significance.   If the value of a continuation is unused due to evaluation for
effect, then the continuation will have a null DEST.  If the NEXT node for a
continuation is deleted by some optimization, then NEXT will be :NONE.

The Block structure represents a basic block, in the the normal sense.  Control
transfers other than simple sequencing are represented by information in the
Block structure.  The continuation for the last node in a block represents
only the destination for the result.

It is very difficult to reconstruct anything resembling the original source
from IR1, so we record the original source form in each node.  The location of
the source form within the input is also recorded, allowing for interfaces such
as "Edit Compiler Warnings".

Forms such as special-bind and catch need to have cleanup code executed at all
exit points from the form.  We represent this constraint in IR1 by annotating
the code syntactically within the form with the node that must be cleaned up.
If the cleanup for a node's continuation is different that that for the node,
then some cleanup code will probably have to be emitted during IR2 conversion.

Someday we may want to change IR1 conversion so that it changes the current
component occasionally so that top-level forms don't get too bloated.  We could
then need a way to keep track of the multiple top-level components and make
sure that they get dumped in the right order.


Hairy function representation:

Non-fixed-arg functions are represented using Optional-Dispatch.  An
Optional-Dispatch has an entry-point function for each legal number of
optionals, and one if rest args are present.  Each entry point function is a
simple lambda.  The entry point function for an optional is passed the
arguments which were actually supplied; the entry point function is expected to
default any remaining parameters and evaluate the actual function body. 

If no supplied-p arg is present, then we can do this fairly easily by having
each entry point supply its default and call the next entry point, with the
last entry point containing the body.  If there are supplied-p args, then entry
point function is replaced with a function that calls the original entry
function with T's inserted at the position of all the supplied args with
supplied-p parameters.

Keyword args are represented using rest args, but we want to keep around some
sort of information so that we can open code keyword calls where possible.


IR1 representation of Catch and Unwind-Protect:


We represent CATCH using the lexical exit mechanism.  We do a transformation
something like this:
   (catch 'foo xxx)  ==>
   (block #:foo
     (%catch 'foo #'(lambda () (return-from #:foo (%unknown-values))))
     xxx)

%Catch just sets up the catch frame which points to the exit function.  %Catch
is an ordinary function as far as IR1 is concerned.  The fact that the catcher
needs to be cleaned up is expressed by the Cleanup slots in the continuations
in the body.  %Unknown-Values is a dummy function call which represents the
fact that we don't know what values will be thrown.

We use a similar hack in Unwind-Protect to represent the fact that the cleanup
forms can be invoked at arbitrarily random times. 
    (unwind-protect p c)  ==>
    (flet ((#:cleanup () c))
      (block #:return
	(let ((#:next
	       (block #:unwind
		 (%unwind-protect #'(lambda (x) (return-from #:unwind x)))
		 (return-from #:return p))))
	  (#:cleanup)
	  (%unwind #:next))))

We use the block #:unwind to restore the dynamic state before doing the cleanup
in the case where we are non-locally unwound.  Calling of the cleanup function
in the drop-through case (or any local exit) is handled by cleanup generation.
#:next is some kind of state that indicates how further unwinding is to be
done.  We make the cleanup a function so that implicit calls to it can be
created by cleanup generation.  The cleanup function is a magic arg to
%unwind-protect, since cleanup generation needs to find it, but we don't want
it to become a closure for no reason.  [How can we do this if %unwind-protect
is a function?  An alternate solution would be to make the cleanup function an
explicit arg to %unwind-protect, and then rely on magic to get the right
dynamic environment for the cleanup.  Powerful magic could also exempt the
cleanup function from the normal closure analysis process.]

Notice that implementing these forms using closures over continuations
eliminates any need to special-case IR1 flow analysis.  Obviously we don't
really want to make heap-closures here.  We don't necessarily need full stack
closures, since the code in the functions is highly constrained.


Block compilation:

One of the properties of IR1 is that supports "block compilation" by allowing
arbitrarily large amounts of code to be converted at once, with actual
compilation of the code being done at will.  In order to preserve the normal
semantics we must recognize that proclamations (possibly implicit) are scoped.
A proclamation is in effect only from the time of appearance of the
proclamation to the time it is contradicted.  The current global environment at
the end of a block is not necessarily the correct global environment for
compilation of all the code within the block.

We solve this problem by closing over then relevent information in the IR1 at
the time it is converted.  For example, each functional variable reference is
marked as inline, notinline or don't care.  Similarly, each node contains a
structure known as a Cookie which contains the appropriate settings of the
compiler policy switches.

The interface to block compilation will be primarily through extensions to
Compile-File rather than syntax in the source file.
  1] The file argument may be a list of files to be compiled together into a
     single binary.
  2] The keyword :Block-Compile specifies whether function references may be
     resolved at compile time.  (default NIL)  When enabled, full call may
     still be forced in a case-by-case basis by NOTINLINE declarations.
  3] The keyword :Entry-Points determines which DEFUNs are given globally
     callable definitions.  Eliminating entry points is primarily a
     space-saving measure.  These values are legal:
       :ALL
	   All functions get entry points (the default).
       :SPECIFIED
	   Functions in an ENTRY-POINT proclamation get entry points.
       :EXPORTED
     	   Functions are given entry points if they are either specified as an
           ENTRY-POINT or are exported from their name's home package.


			LOCAL CALL ANALYSIS

All calls to local functions (known named functions and LETs) are resolved to
the exact LAMBDA node which is to be called.  This will detect any such calls
which have an illegal number of arguments.  If the call has non-constant
keyword arguments or is an APPLY, then we can't tell what is going on, and will
end up making a normal function call.  Illegal calls are simply left alone for
IR1 finalization to complain about.

Although not necessary for correctness, there's no point in turning this
"optimization" off, since nearly all lambdas would turn into closures, causing
the compiler to run much more slowly in addition to generating abysmal code.

With local functions, we splice the called function into the flow graph for the
callee so that flow analysis can be done more easily.  Each call to a local
function causes the block to be ended at that point, and the body of the
function made the successor to the caller's block.  Similarly, the block
containing the code after call is made the successor to the last block in the
function.  The combination node still has the actual continuation for the call
so that we can still find the receiver of the value.  We add the nodes which
use the result continuation to the uses list for the actual call continuation
so that the caller can tell what the actual values being returned are.

When we discover that a function is only called with one continuation, then we
replace the dummy continuation with the actual continuation.  We also equate
the dummy continuation to the real continuation so that code in the function
body can see exactly how it being used.

IR1 forward flow analysis may discover that some function-valued normal
variables are in fact constant.  In this case, it is probably worth restarting
IR1 optimization at this pass.

We also do this substitution for MV-Calls that look like MULTIPLE-VALUE-BIND,
since we know that they can be open-coded.  We replace the optional dispatch
with a call to the last optional entry point, letting MV-Call magically default
the unsupplied values to NIL.  Although we may not do much flow analysis on
these things, we certainly don't want them turning into closures.  We could
also do the same for MV-Calls with known numbers of values.

When block compilation is enabled, we may have to go to some trouble to avoid
joining top-level code into the same component with real code.  This could
happen if a call to a known function appeared in top-level code.  We would like
to avoid turning such calls into local calls if this is possible.
Unfortuantely, it is hard to tell what code is and isn't top-level at this
point.  I guess we could make a pre-scan that converts calls which aren't to
named entry points, and then finds all the code reachable from the entry point
to the initial top-level form.


			FLOW GRAPH SIMPLIFICATION

This pass computes the depth-first ordering of the blocks in each function, for
use by later flow analysis.  Blocks which are not reached during the walk are
deleted and blocks which can be merged are merged.  Although the simplification
part of this phase is not necessary for correctness, it will probably more than
pay for itself in later mandatory flow analysis.

In the process of walking the graph to find the DFO, we may discover that the
the flow graph is not connected.  This is fairly unlikely when compiling single
functions, but it quite probable when block-compiling entire files.  Primarily
for the sake of block-compilation, we do the remaining part of compilation
separately on separate components.  This has the following adavantages:
  1] The size of flow analysis problems is limited to the size of the largest
     component.
  2] Locality of reference is improved in IR1 passes since they will only
     consider the IR1 for a single component instead of making repeated passes
     over the entire IR1.  This is important since the IR1 for a large block
     compilation can easily be larger than physical memory.
  3] IR2 data structures may be recycled between the compilation of each
     componenent.  This is significant since IR2 data structures are probably
     quite a bit larger than IR1.

It is possible that components may be split or joined by IR1 optimizations.
A component may be split when a call is eliminated as dead code.  Components
may be joined when flow analysis discovers that the function for a call is
constant.  We won't bother trying to discover when components could be split,
since this is hard to do and the payoff is small.


			IF OPTIMIZATION

This pass is optional, but is desirable if speed or space (or safety?) is more
important than compiler speed.  [Could hurt space? probably rare...]

This pass does high level control optimization based on transformations of IFs
like those used in Rabbit.  This pass alternates with flow graph simplification
until nothing happens.  Not clear how to best order this pass w.r.t. predicate
propagation for best interaction.


IFs with predicates known to be true or false are eliminated.

Eliminate IFs with identical consequent and alternative?  This would most
likely happen if both the consequent and alternative were optimized away.
Could also be done if the consequent and alternative were different blocks, but
computed the same value.  This could be done by a sort of cross-jumping
optimization which looked at the predecessors for a block and merged code
shared between predecessors.  IFs with identical branches would eventually be
left with nothing in their branches.


Eliminate IF-IF constructs:
    (IF (IF A B C) D E) ==>
    (IF A (IF B D E) (IF C D E))

Note that we don't actually replicate the code for D and E; we use the same
consequent and alternative block for both IFs.  Note also that there may be
forms wrapped around the inner IF as long as it is in a tail-recursive
position.  Due to the continuation representation, there isn't even any way to
tell if this is the case.

In general, it is non-trivial to find all the blocks in B and C so that we can
move them.  Probably what we want to do is only do the optimization if B and C
are single blocks.  This will deal with the important case of the expansions of
AND and OR.  About the only sane way to get a hairy expression in a predicate
would be to do a local call.  We could special-case this.  Of course, we could
just go for broke, and do a graph walk.  We could easily handle the general
case if we have the dominators, but we may not want to compute them this early
in the game.

We do IF optimizations bottom up (forward DFO).  We don't need to look at all
the code, since IFs are always at block ends.


We could also convert CALL-IF constructs:
    (FUNCALL (IF A B C) D E) ==>
    (IF A (FUNCALL B D E) (FUNCALL C D E))

Once again, we need not replicate the code for D and E.  This would require
changing IR1 to allow a continuation to have multiple destination nodes.


			IR1 BOTTOM-UP OPTIMIZE

This pass is optional, but is desirable if anything is more important than
compilation speed.

We walk the code in forward depth-first order and:

    Replace calls of foldable functions with constant arguments with the
    result.  We don't have to actually delete the call node, since Top-Down
    optimize will delete it now that its value is unused.

    Do bottom-up type propagation/inferencing.


			IR1 TOP-DOWN OPTIMIZE

This pass is optional, but is desirable if anything is more important than
compilation speed.

We walk the code in reverse depth-first order and:

    Eliminate any let-bindings for unused variables.

    Eliminate any effectless nodes with unused values.  In IR1 this is the only
    way that code is deleted other than the elimination of unreachable blocks.

    Do top-down type assertion propagation.  For some functions such as + we
    will dispatch to a function for this.
    
Since we will have to repeat this until it stabilizes, we probably want to do
sort of scheme where we flag a block as needing to be reanalyzed if we change
any continuation in the block, either by changing the asserted type or
eliminating the DEST.  We would then loop until no block was marked as changed,
analyzing only the changed blocks.


			IR1 FORWARD FLOW

This pass is optional, but is desirable if anything is more important than
compilation speed.

We use forward flow analysis to find which definitions of lexical variables
reach each block, and then use this information to propagate constant and type
information and to eliminate unnecessary trivial variable bindings.

The type of each variable reference is set to the union of the types of all the
definitions reaching it.

If all the definitions reaching a reference are the same constant value, then
the reference may be replaced with that constant.

When the only definition of a variable reaching a reference is another
variable, then the other variable can be substituted if the set of definitions
for the other variable is the same at the reference as it was at the trivial
definition.  This will cause unnecessary bindings to go away, since when all
uses of the new variable are deleted, the binding will go away.


			PREDICATE PROPAGATION

This pass is optional, but is desirable if anything is more important than
compilation speed.

We used a constrained version of global common subexpression elimination to
propagate information about expressions known to be true or false due the
evaluation of an IF.  Instead of propagating all available expressions, we only
propagate expressions which appear as the test in an IF.  If we stick to
effectless&unaffected functions of lexical variables, then we can just use the
available information about variable setting to determine when an expression is
killed.

Probably the way to insert this information back into the code is to express it
as type information.  If an expression is true, then we intersect its type with
(NOT NULL).  If it is false, then its type is NULL.  This will deal with the
problem of non-T truths.

The most important applications of this are optimizing hairy conditionals and
eliminating redundant type checks.  In the case of a type predicate of a
lexical variable, we can intersect or subtract the type from the types for
references, depending on whether the test is true or false.  This pass can be
done concurrently with IR1 Forward Flow.


			TRANSFORMATION

This is the "source to source" transformation pass for function calls.
Non-trivial optimizations of functions are better done on IR1 than on Lisp code
because there is type and constant information available.  Simple function
optimizations may still be done as source transforms.

This pass is optional, but should be done if speed is more important than
compilation speed.  Transformations may reduce space as well; if we really
believe this, then we should also transform if space is important.  In this
case, transforms which increase space should check before doing their thing.

A transform is actually a kind of inline function call, with the additional
provision for computing the function at compile time.  The transform gets to
peek at the continuations for the arguments, and computes a function using the
information gained.  The transform shouldn't do anything with the values, even
if they are constants, since the compiler must preserve eqlness of named
constants, and it will have a hard time if transforms go around randomly
copying constants.

The computed lambda replaces the original function variable reference as the
function for the call.  This lets the compiler worry about evaluating each
argument once in the right order.

There can be any number of transforms for a function.  Each transform is
associated with a function type which the call must be compatible with.  A
transform is only invoked if the call has the right type.  This lets us put
off type checking of calls until IR1 finalize when we know everything we
ever will about the type of each argument.  This also provides a slick way
to deal with the common case of a transform which only applies when the
arguments are of certain types and some arguments are not specified.

This pass works by scanning the global function vars looking for ones that have
transforms, and then processing the calls that reference the var.  This
eliminates the need to do a full code walk.  We may need to be a bit clever
about how we do things so that the transforms interact in a good way, or at
least don't interact in a bad way.

One problem is choosing which transform to apply when more than one is
legal.  This conflict will happen most often when there are specific and
general transforms for the same function.  We deal with this by trying the
transforms in an order determined by a specified goodness factor.

Another possible problem is that we may transform a call too early,
resulting in a less good transformation being done.  This probably isn't
very significant, since it seems like it would only happen if a later
transform revealed some information that couldn't be figured out by type
inference.   If this is a problem, then we could probably fix it by only
attempting bad transforms when we can't find any other optimization to do. 


			EVALUATION ORDERING

This pass is optional and should be done if either speed or space is more
important than compilation speed.

This pass attempts to modify the order of evaluation so that storage
requirements are reduced.  This is generally involves evaluating complex
arguments before simple ones.  It's not clear how big a deal this really is, or
whether it can fall out of other optimizations.  Optimial expression ordering
is Hard, but simple cases are probably the only thing that is really important,
since hairy expressions are going to be slow anyway.

It is trivial to postpone the evaluation of constants.  Evaluation of variables
can be postponed using reaching-definitions information.   The evaluation may
be postponed if the definitions which reach the new place of evaluation are the
same as those at the old one.  These two optimizations will probably be enough
to prevent blatently stupid code from being emitted.

This pass might also be a good place to exploit arithmetic properties in
reordering numeric expressions.  Alternatively, this could be done on the fly
during IR2 conversion.  The advantage of doing arithmetic simplification on IR1
is that it may open opportunities for constant folding and other optimizations.

It is also possible to postpone the evaluation of easy expressions, but it is
harder and not always a good idea.  This is done using available expression
information.  If the argument expression is effectless and available at the
proposed new place of evaluation, then it may be moved.  If you want to get
hairy, it is surely possible to move expressions with side-effects, but
effectful simple expressions whose value is used are rare enough that it surely
isn't worth the effort.  If we want to do this kind hairy reordering, then it
should probably be done on IR2, since that is where we have available
expression information. 

Determining whether it is a good idea to postpone the evaluation of an
expression depends primarily on whether the inputs to the expression could be
thrown away if the evaluation weren't postponed.  It would definitely be bad to
postpone the evaluation of an expression that has several arguments which could
all be in registers if the expression was evaluated immediately, since the
argument TNs would be forced onto the stack instead.  Liveness of the
argument TNs is probably a relevant criterion.  If an arg TN isn't live at the
new location, then it is probably a bad idea to move the node unless the node
which defines the dead TN can also be moved.


			IR1 FINALIZE

This pass looks for interesting things in the IR1 so that we can forget about
them.  Used and not defined or defined and not used things are flamed about.
Warnings about type errors are emitted in this pass.  Since in our view, number
of arguments is an aspect of function type, number of arguments checking for
functions is also done here.  When this pass is done, all errors and as many
warnings as possible should have been reported.  We postpone these checks until
now because the IR1 optimizations may discover errors that are not initially
obvious.

  Stuff checked:
    Type including function type.
    Bound but not referenced.
    Referenced but unknown function and variable.


			IR2 CONVERSION

IR2 preserves the block structure of IR1, but replaces the nodes with a target
dependent representation.  Although IR2 is target dependent, much of the code
which operates on it has no machine dependency.

The big thing that happens is the translation of nodes into a sequence of
virtual operations (VOPs).  Virtual operations are things which will eventually
turn into about one instruction.  N-arg primitives are associated into
fixed-arg versions at this point.  All VOPs take a fixed number of arguments.

This phase is where compiler policy switches have most of their effect.  The
speed/space/safety tradeoff can determine which of a number of coding
strategies are used.  When error checks are enabled, any checks not proven
unnecessary are explicitly generated here.  Type information can be forgotten
after IR2 conversion.  All type checks have been emitted and all type-specific
operation selections have been made.

Each local lexical variable and compiler temporary is assigned a TN.  Variables
have no semantics in IR2.  Closure variables are preresented by explicit
closure manipulation code.  References to special variables are represented by
Symbol-Value and Set-Symbol-Value VOPs.  [This will get a lot hairier when we go
to deep-binding specials.]

Sets and Refs of local lexical variables are represented implicitly by
supplying the TNs as arguments or results to VOPs.  References to variables
that don't belong to the current function are represented by Uplevel-Ref and
Uplevel-Set VOPs.  This makes the need for static links explicit and allows us
to build a de-facto display if it is useful.  If a TN is packed into a
register, then Uplevel-Reference VOPs can be deleted in favor of the TN itself.
[Not true if we always save all TNs live at a local call...]

The IR1 node/continuation structure doesn't represent the evaluation of code in
IR2, but parts of it live on as places to hang information about the IR2
representation:
    Continuations that actually end up being used by a function call in IR2 are
    used to collect information about the continuation.  They are annotated
    with the TNs that are used to hold the values and with the value passing
    strategy.


TN Assignment:

The TN-assignment pass assigns to each continuation the TNs which receive the
values.  The value-passing strategy (known or unknown value count) is also
determined.

In IR2 we use TNs to represent values.  A TN can only represent a single
value, so we bare the implementation of MVs at this point.  When we know the
number of multiple values being handled, we can simply use multiple TNs to hold
them.  When the number of values is actually unknown, we use a convention which
is compatible with full function call.  We return the first K values and the
number of values as distinct TNs, then return any additional values on the top
of the stack, and also set the values-present flag.  This mechanism is used
whenever the number of return values is not always the same, not only for
things like MV-Prog1 where the number of values is unbounded.

Once we know the TNs which receive the values for each continuation, we can
do the actual conversion at the block level by scanning through the block and
converting each node in order.  The actual node conversion for most nodes is
simple, since they will just expand into a VOP or two that hack on the supplied
TNs.

First, we find the equivalence classes of continuations which would ideally
have the same value-passing strategy and locations.  We examine each local
function, putting all the continuations used in calls to it in the same
equivalence class with the result continuation.  Equating the caller's
continuations implements the obviously necessary requirement that each caller
of a function must expect the values to be returned in the same places.
Equating the result continuation with the callers' continuations propagates the
same requirement to any tail-recursive calls, ensuring that they can return
values to the ultimate caller without any shuffling being necessary.  This is
necessary for tail-recursion to work properly, since if we have to shuffle the
return values after the call, it will no longer be truly tail-recursive. 

The result continuations for entry points and the continuations for full calls
are constrained to use the standard value passing mechanism, with unknown value
count and fixed result registers.  If we were strict about using the same
strategy and locations for all continuations in each equivalence class, then
nearly all calls would end up using the standard ones, since most functions
will be called tail-recursively from an entry point.  This would be
undesirable, since we can generate better code if we don't have to indicate the
number of values and can pass them in any location. 

Fortunately we can relax this strict equivalence and still preserve
tail-recursion  by realizing that we can shuffle the results of a full function
call as long as the call isn't tail-recursive.  We are only forced to use the
standard value passing conventions in equivalence classes which contain both a
continuation used in a full function call and the result continuation for an
entry point.  If a class contains only one or the other, then we can share any
convention and locations between the local calls, and add code to convert to
the standard convention at the call or entry-point.  As long as there is proper
tail recursion between full calls and between local calls it doesn't matter if
there is some friction at the interface.

Now that we have figured out where we must use the standard value passing
strategy, we can use a more flexible strategy to determine the return locations
for local functions.  We determine the possible numbers of return values from
each function by examining the uses of all the result continuations in the
equivalence class of the result continuation.  If the number of values is not
fixed, then we must use the unknown-values convention, although we are not
forced to use the standard locations.  If the number of values fixed, then we
return that fixed number of values from all the functions whose result
continuations are equated.  We assign the result TNs at this time.

A non-tail-recursive continuation must have TNs which are distinct from those
of any other continuation, since it may have to hold the values for an
arbitrarily long amount of time while other arbitrary code is being evaluated.
This means that when we return from a call, we must move the return TNs to
the actual TNs for the continuation.  These TNs are preferenced, and will
probably get packed in the same place.

We can now assign the TNs for all the rest of the continuations.  We scan
though the code in reverse depth-first order, looking at each node and
determining value passing strategy and locations for each continuation that has
the node as its DEST.  In most cases it is trivial to determine the number of
values desired, since we want exactly one.  The continuation for an MV-Call may
want multiple values.  In general, we will have to use the unknown values
convention and do a full call, but if the function resembles an MV-Bind, we can
request the obvious number of values.


Loop analysis:

The IR2 representation is annotated with information about loops.  This isn't
really very optional since environment analysis uses loop information to tell
when it can allocate vars for functions in the same place.  Inhibiting this
"optimization" would cause lots of spurious uplevel references to be detected,
which would probably make the compiler run slower, in addition to generating
poor code.  Loop information is really necessary for reasonable cost assignments
anyway, and also allows us to do some other optimizations at little extra cost.

First we find the set of the block which dominate each block.  Note that if we
keep the set sorted by DFN, then they will be in order of dominance as well.
We probably want to special-case local function call, since standard flow
analysis won't realize that the return point of the call is dominated by the
call point.

We want to have some kind of Loop structure which is linked together with
Blocks seven different ways.  The Loop structure should contain a list of all
the blocks in it, and also the loop depth.  If we arrange things so that Loops
are reasonably nested, then we may want up and down pointers as well.  If this
is the case, then we could omit blocks contained within inner loops from our
list.  We should also list the exit points from the loop.

We must deal with "strange" loops (not just a concession to TAGBODY, since
mutual recursion can create non-reducible flow graphs.)  First we find the back
edges and natural loops, and then we find the strange loops.  A strange loop is
a cycle which contains no back edges.  It turns out that every strange loop
contains an ascending edge which is not a back edge, and all such edges appear
in strange loops.

We can find the strange loop for such an edge by doing a graph walk forward
from the edge, recursing on cross and forward edges.  If we reach the start
node, then we add all the nodes on our path to the strange loop as we unwind.
If we reach a node already in the loop, then we do the same thing.  If we dead
end, then we just punt.

Strange loops are broken into segments, where each segment is the code
in the loop which is dominated by a given entry point.  The representation for
a segment of a strange loop is about the same as for a natural loop.  As far as
environment manipulation code is concerned, they are fairly similar in effect.
The difference is that we can't move code out of a segment of a strange loop
unless we can move it to before all the entries in the loop.

Determining the nesting of loops which have different start blocks is easy:
just use the dominance relation on their start points.  If there are multiple
back branches to the same start block, we just consider all the code to be in
the same loop.  A loop is effectively defined by its head, rather than by any
particular back branch.

Local function call will introduce blocks of code which can be thought of as
being shared between loops.  [How badly will we lose if we don't special-case
local call, either here or when finding dominators?  It shouldn't be a
correctness issue, since other code should be assuming arbitrarily bizzare flow
graphs.  If we did this, then a local function with more than one call could
cause the flow graph to be non-reducible.  This would result in more
environments being allocated than is really necessary.]


Cleanup generation:

We emit cleanup code during IR2 translation.  If the cleanup changes between
two blocks, then we may need to either emit state saving code or cleanup code,
depending on whether we are entering or leaving the messed up context.  We need
to do some kind of pre-pass which annotates the cleanup structures with the
state that needs to be saved on context entry.  We then emit the saving and
restoring code during the actual translation pass.  Note that since the state
that we want to revert to must be lexically apparent, there is no problem
finding out what the appropriate old stack pointer (or whatever) is, since even
if it isn't in our context or a register, we can get at with an uplevel
reference using the static link.

All the dynamic state variables (such as stack pointers) are TNs, and thus can
be explicitly manipulated by moving values into or out of them.  In most cases
the cleanup will simply be a move from a saved value into the current value.

We also need to deal with "uplevel continuations": uses of continuations where
we must unwind to some earlier state.  We are concerned here with the kind of
continuations that result from a non-local RETURN or GO.  If we don't know the
dynamic context of the return, then we cannot cleanup the specific things which
have messed it up; we must revert the unknown things to a saved state.

#### Note that every "uplevel continuation" will be a loop exit from a loop which
represents a context.  It is fairly easy to compute loop exits while finding
loops, and this info is also useful for other purposes.  When we exit from a
context, if the current cleanup for all of the predecessors is the same and we
can trace from that cleanup to the current cleanup for the continuation, then
we can statically determine our dynamic context, and can use the cleanup
mechanism.

If we cannot statically determine our dynamic context, then we treat the
continuation as a funny kind of Var, and get environment analysis to do the
hard work of ensuring that the saved state information is available at the
return point.  It would be possible in many non-closure cases to determine that
only some subset of the dynamic set can be messed up, and then to only restore
those things, but it probably isn't worth the effort.

Note that when we do a full register save due to a full call, we must always
save the registers in the same place so that the reentry code for a closed
continuation can restore all the registers when someone does a non-local exit
from a closure somewhere down the stack.

	
Environment analysis:

What is the env analysis problem?
    Determine where to allocate variables:
        Fairly easy using loop information

    Establish a way to find variables which are not allocated in the current
    environment: (uplevel references)
        We don't allow arbitrary references to other environments, since the
        referenced environment must always be allocated at the time of the
        reference.  The references must be truely uplevel, not crosslevel.  If
        we have a structure which allows us to trace up lexical levels, then we
        can establish a unique path to find a uplevel variable.

    Find closure variables and figure out how closure environments are located:
        This problem seems to be mostly identical to the standard environment
        analysis problem.  With minor tweaks, the same mechanism can do both
        simultaneously.

Is there some straightforward way that we can use retained syntactic
information to determine a nesting relation?

What properties do we want for the nesting?
 1] An environment which is referenced uplevel must reachable by tracing up
    the nesting tree from the reference environment.  This guarantees that we
    can find any variable by scanning up a single static link.

 2] An environment must always be allocated when any environment nested within
    it is allocated.  This guarantees that any environments that we trace through
    to find a variable will be allocated when we reference them.

A useful piece of syntactic information is the lambda that encloses the code
for an environment.  This can be determined if blocks are annotated with the
lambda they were originally in.  We can satisfy 2 if we use the environment for
the enclosing lambda as the outer environment, since the environment for the
lambda's variables must be allocated before any code in it can be evaluated.

It is not quite so obvious that 1 will be satisfied.  If lambda variables are
the only things allocated, then it is pretty obviously true, since the only
things semantically accessible would be variables in enclosing lambdas.
Temporaries introduced by the compiler don't necessarily have to follow this
rule.  I guess that we have to be careful about what environment we allocate
compiler temporaries in when the temporaries must live across an environment
transition.  For any such temporaries, we must know the deallocation
environment in addition to the allocation environment.

It is not clear that there are any such temporaries.  Most temporaries will be
consed at the beginning of the code for the form which needs them and will be
deallocated at the end.  The beginning and end of a call will aways be in the
same environment. (???)

Kinds of temps:
    Continuation value
    Environment pointer
    Cleanup context/dynamic context save
    Copy TNs
    Common subexpressions/loop invariants
    Preload TNs
    Save TNs
    VOP temps and load TNs
    Specbind and catch frames


Environment analysis determines the time at which storage is allocated for
lexical variables and where this storage is allocated.  Variables are allocated
either as TNs or as heap closure variables.  With TNs we must determine which
context is specified as the potential stack location and when this context is
allocated.  With closure variables, we must determine in which closure the
variable is allocated and how we get to that cell.

In both cases we attempt to combine environment creations by moving them
backward and upward in the code. The legality of such motion is determined by
using loop-dominator information.  We can move the allocation of any variable
to the head of the innermost loop which contains its allocation.  If a stack
variable is allocated and deallocated in the same loop, then we can move its
allocation out as far as we please.  Allocating all such variables in the
outermost environment is reasonable, although this could cause unnecessary
uplevel references. 

Initially, we guess that each loop which has stuff that can't be moved out has
its own environment, but some of these get combined eventually.  If any
environment references another environment, but is not nested within that
environment, then we combine the two environments.  This mashes together the
environments for functions which are broken up into multiple loops due to
recursion.  [Or can each environment be associated with a Lambda so that every
loop directly nested within the same lambda will have the same env?]

It is an invariant that the allocation point for a variable must dominate the
original cons point.

In the case of closure variables, the desirability of combination of
allocations is less clear.  If we blindly moved all allocations to the
beginning of the loop which encloses them, then we might allocate the variables
even when the body may conditionally not use them.  Probably what we should do
is scan up the dominator tree until we find either the loop head or an IF, and
then cons the variables there.

Stack variables have special significance, since all values that are not in
registers are either on the stack or pointed to by the stack.  A bunch of stuff
on the stack which shares a common base pointer is called a context.  We must
have the base pointer for the context before we can access anything in it, so
we must go to some trouble to make sure that the base pointer is available.

A context is represented by the Context structure.  Each lambda has pointers to
the contexts where its heap and stack variables are consed.  Each block has a
pointer to the context(s?) allocated at the beginning and deallocated at the
end.  [can there be more than one?  Probably not...]  Each TN has a pointer to
the context that it is allocated in.  Pack needs to know something about
contexts, since each context effectively has a different set of storage bases.

An uplevel reference is a reference to a variable which is allocated in a
different context than the one for the reference.  We find uplevel references
by scanning all the vars for all the lambdas, looking for references which are
in a different context than the cons point for the var.

An environment which is the current environment for an entry point is called a
closure environment.  If there is a closure environment in the nesting between an
uplevel reference and the referenced environment, then the variable must be a
closure variable.

#| 
Uplevel ref accesses the save location if TN is packed in a register.

Closure slots are allocated as TNs, but they may not be directly accessed.
Closure-Ref must be used.

Saving and restoring the previous context pointer is part of the environment
allocation/deallocation process.  [ENV also?  Only for closure environments?]

|#

A possible optimization is to allocate closures on the stack when they are
known not to be passed upward.  The main use of this would be for functional
arguments to functions known to be "safe".  As far as the rest of the compiler
is concerned, there is little difference between heap closures and stack
closures.  We just use different VOPs for hacking them.


Stack management:

Stack allocation of objects is made explicit in IR2.  All storage is allocated
using TNs.  Lexical variables and temporaries are packed individually on the
stack if they don't end up in registers.

Some uses such as Catch require a chunk of several cells allocated
contiguously.  This is done by allocating a composite TN which represents the
entire chunk, and then allocating the cells inside as individual component TNs.

In the case of TNs that have no particular location, it is easy for Pack to
find some place to stick the TN.  TNs that must be on TOS at some point in time
are somewhat more problematical.  We can deal with this by restricting the SC
for the appropriate TN-ref to be a special TOS SC.  In the case of fixed size
TNs, Pack can pack down from TOS just as easily as it packs up from the FP.
Dealing with variable size TNs is more of a bitch, since there aren't any
component TNs in the TN that determine its lifetime.  We probably need to
explicitly indicate the lifetimes of such TNs using alloc and dealloc ops which
take the entire composite TN as a result or argument

Codegen is the place that the actual stack hacking code is generated.  Pack
provides information about the allocation for Codegen to use.  An example of
such info is the amount of stuff allocated at a given point.  Pack is provided
with appropriate base TNs to use in specifying stack locations, but codegen is
responsible for ensuring that these registers have meaningful values before it
tries to load stack TNs.  The base TNs may also be explicitly frobbed by local
call code emitted in IR2 translation.

[How is info passed from pack to codegen?  If we know the TOS TN which is on
the top then we can determine how much stuff there is on the stack.  We can
determine the top TN at the beginning of each block by looking at each live TN
and seeing which is topmost.  Pack will need to have this information anyway.]


In the case of incoming stack arguments, we just set up the base pointer to
point them and then force allocation of a composite TN which holds them.


IR2 Function call:

Local functions which are called in only one place are simply inserted inline.
The variables are initialized using Move VOPs.

Various implicit arguments to functions may need to be made explicit in IR2.
In local function call there is no implicit environment manipulation.  If the
current context needs to be restored on return, then the previous
context is passed back as an explicit result.  If the static link needs to
be passed in, then it is passed explicitly.  This determination of implicit
arguments happens during environment analysis.

On both call and return some care needs to be taken to ensure that the TNs in
which the values are passed are accessible in both the caller and the callee.
There is no problem when the caller and the callee have the same context
or when the values are passed in registers.  In other cases, we pass the extra
values on the top of the stack.  This shouldn't happen very often if there is a
reasonable number of registers.  [Or we could pass them in arbitrary stack TNs
(allocated where?) and use the static link to get at them.]

The full function call mechanism must effectively be a subset of the local call
mechanism, since the two mechanisms must mesh at entry points and full function
calls.

Local function call is done with the Call-Local VOP, which represents only the
control aspects of the call.  Each function used for local call has a list of
TNs (determined by TN Assignment) which are the locations used for passing the
arguments.  The actual arguments are moved into these TNs using Move VOPs.
Call-Local jumps to the function, passing the return PC for the continuation as
an argument in a supplied TN.  The continuation for the call is represented by
the continuation we continue at.  A tail-recursive local call is doesn't need a
VOP to represent it, since it is just a jump.

The Allocate-Context VOP deals with setting up stuff at the beginning of a
local function.  It allocates stack stuff and sets up the current context,
returning the old current context as a result.  This is basically a
placeholder for Codegen, which will deal with any necessary stack-frobbing.

The Return VOP takes the return PC and the previous context as arguments.  It
deallocates the current context, restoring the old, and jumps back to location
indicated by the return PC.  Restoring the context must be folded into return
since the return PC might be stored on the stack.  Return-Values takes an
additional arg which is a stack TN containing stack values that need to be
returned.  The values are BLT'ed down to the appropriate place.  There is no
need for separate local and full return VOPs.

Simple-Return is similar to Return, but doesn't do any environment
deallocation.  It is used for returning from local functions which don't have
their own environment.  There is no need for a Values variant, since there is
no stuff on the stack that needs to be squeezed out.

If a local function returns a known number of values which is less than the number
expected by the caller, then additional code must be inserted at the return
site which sets the unused values to NIL.

MV-Calls can be treated as normal calls whenever the number of values returned
by each value form is known.  Calling of known lambdas of the form created by
Multiple-Value-Bind can also be converted as a special case.

A full call turns into some kind of Full-Call VOP.  There are different VOPs
for calling named functions and closures.   We also have tail-recursive full
call VOPs.

Whenever we call a function which may return stack values, we must do something
to receive or dispose of the values.  There are three options:
    Default-Values
	This VOP is used when we want to receive a specific number of values.  It
	takes the passing TNs, a stack TN and a desired value count as arguments.  Values
	not returned are defaulted to NIL and excess values are discarded.  The
        stack TN describes the place where the stack values are allocated.
    Receive-Values
	This VOP is used when we want to receive an unknown number of values.  It
	takes the values-present flag, Nargs and a stack TN.  The TN is set up to
	describe the returned values. [could set Nargs in the 1 val case as a side
	effect]
    Discard-Values
	This VOP simply discards any stack values.

Usually we will have to have one of these VOPs after every full call, but if we
know there are no stack values then the VOP can be omitted.


Generating error checks:

    We don't want to treat error checks like normal code: 
(unless win (error...)), since this would murder the control structure of the
code.  What we do is pretend that an op like (assert-type x fixnum) is a normal
operation which has no side-effects and returns the tested object.  Filtering
the object through the assertion allows it to be effectless so that it can be
moved out of loops, without letting the check disappear entirely.  This also
provides an easy interface to correctable errors.

    The exact representation of type checks in IR2 will be fairly
implementation dependent, since it will depend on the specific type system for
the given implementation.  For example, if an implementation can test some
types with a simple tag check, but other types require reading a field from the
object in addition, then the two different kinds of checks should be distinct
at the VOP level, since this will allow the VOP cost and storage information to
be more accurate.

    Generation of type checks should be factored out of code which would
otherwise be more portable.  In most cases, type checks should be generated
automatically without any intervention being needed on the part of the
translator, since the appropriate type assertions should already be manifest in
the IR1.


Converting conditionals into IR2:

We probably want to consider control transfer to be the fundamental result of
comparison, rather than anything such as a condition code.  Although most
compilers with whizzy register allocation seem to explicitly allocate and
manipulate the condition codes, it seems that any benefit is small in our case.
This is partly because our VOPs are at a somewhat higher level, making it
difficult to tell which VOPs do and don't trash the the CC.  Explicitly
incorporating condition codes in our VM also introduces another architecture
dependency.

At the IR2 level, we have a class of IF-XXX VOPs which transfer control to one
of two places on the basis of some test on the operands.  When generating code
for IF, we peek at the predicate and special-case it if we know how.
Otherwise, we just emit an IF-TRUE and compile the predicate normally.

If a builtin predicate appears in a non-branch context, then we effectively emit
a (IF pred T NIL).  This probably wants to be done during IR2 conversion rather
than as a transformation on IR1, since in some cases the operand type will have
to be taken into consideration when determining whether an operation can be
open-coded, and is thus built-in.

N-arg predicates such as < must be transformed into two-arg versions in IR1,
since they will result in hairy conditionals that want to have IF optimizations
done on them.


Converting Catch and Unwind-Protect:

Catch, Unwind-Protect and deep-binding specials all work in pretty much the
same way.  We make a stack TN to hold the catch frame or whatever, allocate TNs in
to represent the slots, and then initialize them.  The frame can be explicitly
linked in by TN manipulations, since the active catch and whatnot are
represented by TNs.  Since allocation of the frame is decoupled from linking
and unlinking, some of this stuff could be moved out of loops.  We will need a
VOP for loading the PC for an arbitrary continuation so that we can set up the
reentry PC.  [Could be done with Call-Local?  glag...]


VOP representation:

As far as IR2 is concerned, it doesn't matter whether a VOP is implemented as a
miscop or as inline code.  A miscop is just a VOP which requires its args to be
in the miscop linkage registers.  This allows a VOP's implementation to be
changed in fairly arbitrary ways without affecting its interface.

Global functions that are special-cased by the compiler are translated into
VOPs in IR2.  Functions with a fixed number of args can usually be translated
into a single VOP by a double dispatch determined by policy and operand types.
In the actual conversion pass, the main activity is to choose some generic
operation in the virtual machine such as FIXNUM-+-FIXNUM=>FIXNUM, and then to
choose a specific VOP on the basis of the policy settings.  The policy-driven
choice of a specific VOP to implement the generic operation can easily be made
table-driven.

Note that it is important to make the policy choice in IR2 conversion rather
that in code generation because the cost and storage requirement information
which drives TNBIND will depend strongly on what actual VOP is chosen.  In the
case of FIXNUM-+-FIXNUM=>FIXNUM, there might be three or more implemenations,
some optimized for speed, some for space, etc.  Some of these VOPS might be
open-coded and some not.  It might also be useful to allow a generic operation
to turn into a sequence of VOPs, but that would make automation of the
selection process harder.

Conversion of references to constants happens in a very similar way to variable
conversion.  Each distinct Constant gets assigned a TN.  This TN is wired down
in a SC which is determined by its type.  In addition to arbitrary constants
kept in the literal pool, we also will have SCs for each significant kind of
immediate operand in the architecture.  On the Romp, 4, 8 and 20 bit integer
SCs are probably worth having.

Since the constant is represented by a TN, copy generation can easily attempt
to cache the constant in a register.  Pack will use cost information to
determine whether it is in fact desirable to make the copy.  If a VOP can
easily fold an immediate constant into its code sequence, then this will be
reflected in a low cost for that operand when it is in an appropriate immediate
SC.  This will prevent caching of constants when it isn't worth the effort.
Even if a constant is cached, a given code generator can decide to use an
immediate representation instead.

If the code emitted for a VOP in the non-constant case is very different than
the constant case, then it may be desirable to special-case the operation in
IR2 conversion by emitting different VOPs.  An example would be if SVREF is
only open-coded when the index is a constant, and turns into a miscop call
otherwise.  We wouldn't want constant references to spuriously allocate all the
miscop linkage registers on the off chance that the offset might not be
constant.

In PQCC terminology, our VOPs don't represent templates that are very
different.  If we must choose between significantly different code sequences to
implement an operation, than the choice should be made during IR2 conversion.
The alternative is to make pessimistic assumptions about the TN requirements
for the VOP.


VOPs that want to be open-coded: 

Special set and ref.
Named constant ref?
CAR, CDR, SET-CAR...
    We could just have a Cell-Ref VOP which grabbed stuff out of boxed objects,
    but that wouldn't let us specify interesting side-effects.
Move
Uplevel-Reference: a reference to a TN consed in a different function.
Length for simple vector, string, ...
Simple vector accessors and setters.
Fixnum = < > + -.
Special versions of + and - which do overflow checking.
Fixnum logical operators.
Closure-Ref, Closure-Set
Structure-Ref, Structure-Set: so we can do better side-effect analysis.
EQ.
System refs and sets.
Change type.
Type predicate.
Type check.
Bounds check.
Constant arg versions of some VOPs which take fixnum args.

Call-Named
Tail-Call-Named
;Call-Closure
;Tail-Call-Closure

Call-Local
Return
Return-Values

VOPs not open-coded:

Special bind and unbind.
Allocation, hairy arithmetic, and other function-like ops implemented in assembler.
Throw, Unwind.


			COPY GENERATION

This optimization is optional, but is desirable if either space or speed is
more important than compilation speed.
 
One problem with TNBIND is that each TN is packed in the same place for all
time, whereas the best place for a TN may vary from one place in the code to
another.  To reduce this problem, we introduce new TNs which are copies of the
original TN.  Since the lifetime and uses of the copy are more restricted than
those of the original, the copy may be packed in a way which better reflects
the local demands of the code.  Heuristics are used to guess good places to
introduce copies.

Copies may either be clean or dirty.  A clean copy is never written, whereas a
dirty copy may be.  There may be any number of clean copies for a TN, or there
may be one dirty copy.  References to a TN may only be replaced with references
to a copy within code where the copy is valid.  A dirty copy must be written
back before any possible read of the original.

Pack can combine some copies which we are too conservative to combine here.  If
the cost of making and using a copy exceeds the cost of referencing the
original TN, then the copy can be replaced with the original.  For example, if
the original TN is already in a register, then there is no point in making a
register copy.  All the copy TNs should be preferenced together, including
indirect ones (does this really matter?).

A dirty copy isn't really a copy at all.  A dirty copy moves the real value of
the TN somewhere else within a chunk of code, and then moves it back when we
reach the end.  Probably we should only contemplate a dirty copy in highly
constrained circumstances such as a loop that doesn't make any function calls.

Here some good candidates for copies:
  1] In a loop, references to TNs may be replaced with references to a copy for
     the duration of the loop.  This is especially easy with TNs that represent
     constants.
  2] When the desired SCs for references clash, it may be a win to replace
     some of the references with references to a copy.  This is pretty
     vague and I'm not sure there's any non-perverse code in which this
     would be a win, especially if loops get copies anyway.

#| This is a possible implementation, but it would be much less efficient that
a special-purpose implementation...

Clean copies can be done cleanly and with a high probablity of correctness by
making a distinct copy for every reference and then letting global common
subexpression elimination fold the copies together wherever possible.

Common subexpression and loop-invariant optimization will have to be fairly
careful about combining copies, since if they really combine all possible
copies they may largely defeat the purpose of making copies in the first place.

Probably common subexpression should only combine copies whose users have the
same best SC, so we won't combine copies that should get packed into different
SCs.

Loop-invariant optimization should be careful not to combine copies of a TN for
an inner loop with those for an outer loop, as that would make copying the TN
an all-or-nothing proposition, instead of possibly being done only in an inner
loop.  Probably the way to represent this is to have the inner-loop copies be
copies of the copy, thus making the outer-loop copies not the same expression
as the inner-loop copies.
|#


		LOOP AND GLOBAL COMMON SUBEXPRESSION OPTIMIZATION 

Any of the standard compiler optimizations can be done here.  Common
subexpression elimination, loop invariant detection, etc.  Flow analysis of all
TNs.  Common subexpression and loop invariant optimization can have the same
effect as caching reads to special and closure variables.  This is because
these operations are expressions like any other.

An expression is a VOP.  A VOP is a common subexpression if it has the same
operation and argument TNs as another VOP.  Only effectless VOPs are candidates
for common subexpressions.  An expression is "killed", disabling its reuse,
when either one of the argument TNs is modified or a side effect which affects
its result happens.  Large expressions may be combined, but the combination
will happen one VOP at a time.  Probably it would be desirable to modifiy IR2
to make common subexpressions more apparent.  A possibility would be to have a
separate Operation structure which holds the basic operation and arguments,
but not the results.  We would then arrange to make all identical Operations
EQ.

Common subexpression elimination should be aware of the cost of saving the
result V.S.  recomputing it.  This is somewhat sticky since the cost of saving
the result depends mostly on the SC in which the TN gets packed, which we don't
know.  The most important case is probably avoiding combining expressions over
a function call when recomputing the expression is cheaper then two memory
references.  I guess that the general idea is that some expressions are worth
saving in memory and others aren't.  This could be expressed by having function
call kill expressions which aren't worth saving.  There is also the hidden cost
that retaining a common subexpression can tie up registers that would be better
used.  When combining copy TNs, we have pretty good interaction with Pack,
since it can avoid doing the copy if it turns out to be a bad idea.  With other
common subexpressions we don't have this out.  Probably PQCC has addressed this
problem iff there is a reasonable solution.  This may be a place where the
estimating of storage requirements comes in.  If we can tell there are lots of
registers available then combining expressions becomes more reasonable.

Global common subexpression elimination can be made to do most of the work of
loop-invariant optimization.  Once you determine that it is legal to evaluate
an expression before the loop starts, you just insert a copy there.
Subexpression elimination will replace copies within the loop when possible.
If it turns out that the expression can't be used in the loop, then the copy at
the beginning will get deleted since it is unused.

There are two noticeable problems with this solution.  One is that we have to
be careful that we only try each possible invariant once, since we could get in
a loop otherwise.  The other is that speculatively moving every expression to
the beginning of the loop would be pretty inefficient.  Probably we want the
code which checks for the move being legal to check for the move being likely
to succeed before doing anything else.

[A loop-invariant optimizer not based on Global Common Subexpression would
be much more efficient and would probably get most of the speed win to the two
combined without having to implement GCS at all.  This is because the common
subexpressions would be moved out of loops even though they aren't folded
together.]


			TN LIFETIME ANALYSIS
 
    This phase uses flow analysis to find the live TNs at each block boundary.
We determine this by backward flow analysis on the entire flow graph.  Although
somehat counter-intuitive, it seems to be possible to do live TN analysis on
the entire flow graph without worrying about that fact that on some paths the
same TN will refer to different variables due to recursion.


			SAVE GENERATION

We use a caller-saves convention.  Whenever we transfer control to code in a
different environment without deallocating the current environment, we save all
the TNs allocated in the current environment that are live at the return point.
The control transfer will always be a call, since any other transfer (i.e. a
GO) would deallocate the current environment.

The saving and restoring is emitted explicitly using Move VOPs.  Note that the
saves don't have to be done if the TN gets packed on the stack.  We can get
Pack to do the Right Thing for us if we require the save TN to be on the stack,
and then preference the original TN to it.

Save TNs probably want to be magical, in that we don't want to have to dummy up
the lifetime and conflicts information for them; it is both hard and pointless,
since the lifetime of the save TN lasts the full time that the context is
allocated.

If we do any saves, then we need to redo part of Lifetime Analysis, since the
lifetimes for TNs live across calls will be wildly inaccurate.  Once the TNs
are saved, the locations that they occupy are free for allocation of other TNs
within the called functions.   Additional lifetime bogosity is due to
standard flow analysis not realizing that the function must return to the
exact call site, rather than arbitrarily returning to any of the call sites.
This causes paths which cannot be traversed to be considered, making it appear
as though all TNs which are live at any call are live at all calls.
Furthermore, on the spurious paths, the bogusly live TNs will probably never be
initilized, and thus will appear to be live all the way from the return site to
the root of the graph.

What we do is change the blocks where restores are done to indicate that the
restored TNs are killed by the block.  We then subtract all TNs which we
generated saves for from the live TNs at the end of all blocks.  We can now
redo the propagation step of Live TN Analysis.

A useful invariant to remember is that every TN consed in a context is dead at
context entry, since every TN is always initialized before it is referenced.
Since we have the lifetime information already, this is an easy but powerful
consistency check.


			CONFLICT ANALYSIS

    For each TN, find the set of TNs that have lifetimes which overlap with its
own lifetimes.  This may be a good place to estimate the number of times each
TN will have to be saved and restored due to full function calls.  Each live TN
at a full call has the saving cost (adjusted by loop depth) added to the
register allocation cost.

For purposes of conflict analysis, there are five kinds of TNs:
 1] Argument TNs
 2] Load TNs for arguments
 3] VOP temporaries
 4] Load TNs for results
 5] Result TNs

Argument and result TNs are not considered to conflict, allowing the locations
for argument TNs to be recycled for results.  This requires that the writer for
the code generator guarantee that all of the arguments are read before any
results are written.  In any perverse cases where this is not true, explicitly
allocated temporaries must be used to avoid trashing arguments.  Temporaries
are considered to conflict with both arguments and results.

Argument load TNs conflict with argument TNs and temporaries.  Similarly,
result load TNs conflict with results and temporaries.  This allows argument
load TNs to be recycled for result load TNs.  We need to annotate each VOP with
the conflicts for load TNs that Pack might want to allocate.


			PRELOAD GENERATION

On machines which pipeline memory fetches, it may be desirable to move memory
reading operations backward in the code when it is possible to do so.  This
will allow the memory fetch to proceed in parallel with other computation.  We
use a scheme reminiscent of copy generation to do this.  What we do is move the
read back an appropriate distance, and then specify a Preload TN as the result
TN.  A Preload TN has a single writer and a single reader; if the original
result TN is more complex, then an explicit move is placed between the Preload
TN and the original result TN.  Pack is allowed to move the writer VOP as far
forward as the reader VOP.  The understanding is that pack will move the writer
VOP as far forward as necessary for the Preload TN to be packed in a good
location.

The preloading could be done simply on a go/no-go basis, with the VOP being
either left in place or moved immediately before the reader.  The preload TN
then gets assigned as its cost the amount saved by doing the preload.

The actual generation of the preload must take into consideration both the
legality of the backward move and the desirability of a given move distance.
As long as the writer VOP has no effects, the legality of the move is easily
determined by checking for VOPs which have effects on the writer VOP between
the original location and the possible new location.  Determining the distance
to move the writer is somewhat problematical since on one hand, we want to move
it as far as possible up to some fixed number of cycles, but on the other hand
we don't want to make the lifetime of the Preload TN too long, since that would
make it harder to pack.  This is further complicated by our having to guess how
many cycles a VOP will take.

If a go/no-go pack strategy is used, then it might be desirable to attempt to
look at the lifetime data to anticipate the largest move distance that Pack
would be able to deal with.  This is probably a bad idea though.  If dealing
with shorter than optimal preload times is important, then it would be better
to make Pack attempt progressive shortening of Preload TN lifetimes.

This takes care of preload generation for operations known to do memory reads.
Since we haven't run Pack yet, it is possible that operand loading may cause
additional implicit reads.  We can probably make a pretty good guess of which
TNs are likely not to be packed into registers if we use the lifetime
information.  If at some point the conflict set is larger than the number of
registers, then someone is going to lose.  The losers will be the ones long
lifetimes and low costs.  We can generate explicit preloads for uses of likely
loser TNs.  If the TN gets packed in a register, then the preload will all go
away.  If we don't guess an eventual loser, then we just lose the benefit of
the preload.  But... preloading memory TNs is probably not a very important
optimization due to all the other optimizations that put important TNs in
registers.  It is probably only worth implementing preloading for explicit
reads, in which case we probably don't need the lifetime information, and this
phase could be moved earlier in the compiler if there was any reason to do so.
[We could get pack to help more here, since Pack now knows about load TNs and
loading.]

To have maximally tense array hacking, we should also consider how to use
preloading in interesting ways for arrays, but this would interact with things
like loop induction variable determination and other loop optimizations since
we can do a lot better if we know what the next index is going to be ahead of
time.  This would allow us to fetch the n+1'th element while processing the
n'th.  Note that it is harmless to fetch ahead an element which ends up being
unused due to loop exit.  Tense loop/array optimization is something which I
haven't considered much in this design.  In any case, these optimizations
should happen somewhere before here.


			PACKING

    Assign TNs to SCs using cost and lifetime info.


In a full function call, we probably want to make the saving and restoring
of registers at least semi-explicit so that pack can tell whether it is
worth packing a TN in a register.  It may be that the cost of saving and
restoring around calls outweighs the benefit of being in a register in other
places.  Probably a reasonable solution is to assign to each TN a save cost
based on the number of full function calls at which it is live, modified by the
appropriate loop-depth factor.

There is an argument for being extremely reluctant to put a TN on the stack
when it can go in a register.  If we actually do a full function call, the
overhead of saving and restoring will probably be inconsequential compared to
the cost of the function itself, so in this case the choice is unimportant.  If
we don't do the function call, then we will want the TNs in registers.  Min-max
reasoning suggests that we have little to lose and something to gain by always
putting things in registers.
The argument breaks down when:
  1] You have calls to simple functions in your inner loops.
  2] You are more worried about space than speed.

This suggests that the time cost of saving and restoring should be understated
so that TNs are only packed on the stack to avoid saving when it is pretty
clearly a win.  There are probably a fair number of such clear-cut cases, such
as a variable which is bound outside a loop, but used only as the argument to a
function within the loop.

The "assumption of infinite registers" used by PQCC seems like it may be a lose
on the Romp.  The assumption that they make is that they can get away without
allocating registers for operand loading, under the optimistic assumption that
they won't necessary.  It looks to me like they get away with it primarily
because many architectures don't require register operands in most cases, so
even if the arg TN doesn't make it into a register, they won't have to allocate
a load register.  This is definitely not the case on a load-store architecture.

A possible solution would be to get Pack to allocate operand loading TNs.  What
we do is have an optional SC requirement associated with TN-refs.  If we pack
the TN in an SC which is different from the required SC for the reference, then
we create a TN for each such reference, and pack it into the required SC.  In
many cases we will be able to pack the load TN with no hassle, but in general
we may need to unpack a TN that has already been packed.  [How do we
preference load TNs?  We could just give them the preferences of the original TN,
but we could do better if we gave them only the preferences associated with the
specific use.  What we could do is only count preferences to TNs which are
referenced by the VOP that is getting the load TN.]

The PQCC papers make a big show of the efficiency implicit in a Pack algorithm
that never backtracks, but it seems in this case that it is an efficiency
non-issue since unpacking should be relatively rare, and in not very expensive
in any case.  The advantage of backtracking it that it will let us have
arbitrarily restricted SCs for load TNs, yet have a reasonable expectation that
they will be packed in a good way.  It is not necessary to introduce arbitrary
copy TNs as in PQCC.  This isn't really heavy-duty backtracking anyway, since
we don't undo all packings between made between the time of the mistake and the
time we found out it was wrong --- we only undo the packings that are actually
causing problems.

The only real trick in unpacking is choosing a TN to unpack which can be
unpacked with minimum disruption and pessimization.  We only need to consider
unpacking TNs which conflict with the one we need to pack, and obviously we
would prefer to unpack the least important such TN.  We must avoid unpacking
TNs which we may not be able to repack: TNs which are restricted to finite SCs.

Once we unpack our victim and force packing of our TN, the victim will
presumably be packed again soon afterward.  It is possible that the victim will
be repacked in the same SC, but more likely that it will get packed elsewhere
and we will have to pack its load TNs.  There is no possibility of endlessly
undoing and redoing a packing, since the only TNs we will force packing of have
required SCs, and thus cannot be unpacked.  It is possible that the load TNs
for the victim will force unpacking of yet another TN, but this process should
be rare and limited, since each unpacking increases the odds of successful load
TN packing due to replacing TNs that have long lifetimes with short-lived load
TNs.

The above method for allocating operand loading TNs works just as well for
result evaulation TNs, since they all work through TN-refs.  We could initially
use a wimpy implementation which avoids unpacking by preallocating any loading
registers that might be needed.


Storage classes:
  We have lots of SCs because we try to get Pack to do our representation
  analysis for us.  TNs are annotated with the primitive type of the object
  that they hold:
    T: random boxed object with only one representation.
    Fixnum, Integer, XXX-Float: Object is always of the specified numeric type.

  When a TN is packed, it is annotated with the SC it was packed into.  The
  code generator for a VOP must be able to uniquely determine the
  representation of its operands from the primitive type and the SC.

  Some SCs:
    Reg: any register (immediate objects)
    Save-Reg: a boxed register near r15 (registers easily saved in a call)
    Boxed-Reg: any boxed register (any boxed object)
    Unboxed-Reg: any unboxed register (any unboxed object)
    Boxed-Cont: boxed object on the stack (on cstack)

    Immediate constant SCs

  Some more SCs: (when we add a number stack)
    Word: any 32bit unboxed object on nstack.
    Double: any 64bit unboxed object on nstack.
    Float-Reg: float in FP register.

  Even more SCs:
    Unboxed-Char: character without type code, for maximally tense
      character/string hacking.  Char-Code and Code-Char are no-ops if arg and
      result are both unboxed chars.  Schar doesn't need to put on type code if
      destination is an unboxed char.
    Alien-Value: some kind of hack to get pack to determine when to cons Alien
      values.  This would allow Alien values to be first-class lisp objects,
      yet we would still compile them tensely.  This would require some sort of
      "record" TN which would be a TN with component TNs, but the component TNs
      would be packed independently, rather than being required to be
      consequtive.  [or something?]


			CONTROL OPTIMIZATION

In this phase we annotate blocks with drop-throughs.  This controls how code
generation linearizes code so that drop-throughs are used most effectively.

One question is how much of the original sequencing in the code we should
attempt to preserve in IR1 conversion.  It seems that this depends on both how
smart the compiler is and how dumb the programmer is.  Probably we want to
preserve the drop-thrus in a TAGBODY at least until we can do loop rotation and
similar things.

There are basically two aspects to this optimization:
 1] Dynamically reducing the number of branches taken v.s. branches not
    taken under the assumption that branches not taken are cheaper.
 2] Statically minimizing the number of unconditional branches, saving space
    and presumably time.

These two goals can conflict, but if they do it seems pretty clear that the
dynamic optimization should get preference.  The main dynamic optimization is
changing the sense of a conditional test so that the more commonly taken branch
is the fall-through case.  The problem is determining which branch is more
commonly taken.

The most clear-cut case is where one branch leads out of a loop and the other
is within.  In this case, clearly the branch within the loop should be
preferred.  The only added complication is that at some point in the loop there
has to be a backward branch, and it is prefereable for this branch to be
conditional, since an unconditional branch is just a waste of time.

In the absence of such good information, we can attempt to guess which branch
is more popular on the basis of difference in the cost between the two cases.
Min-max strategy suggests that we should choose the cheaper alternative, since
the percentagewise improvement is greater when the branch overhead is
significant with respect to the cost of the code branched to.  Probably the
most tractable implementation of this would be to compare only the costs of the
two blocks immediately branched to, since this would avoid having to do any
hairy graph walking to find all the code for the consequent and the
alternative.  Determining branch sense on this basis is probably only called
for when one of the options is quite inexpensive.  [Any reason?]  It would also
be reasonable to discriminate against ultra-expensive functions such as ERROR.

One problem with this approach to dynamic IF optimization is that it loses when
one of the options is empty, causing the next for one branch to be a successor
of the other branch, making the comparision meaningless.  This probably isn't a
big problem.  Actually, we can detect this using dominator information.  When a
branch is empty, one of the predecessors of the first block in the empty branch
will be dominated by the first block in the other branch.  In such a case we
would favor the empty branch, since that's about as cheap as you can get.

When you decide to favor a branch of an IF, you probably want to favor the join
by making it a drop-thru as well.

Statically minimizing branches is really a much more tractable problem, but
what literature there is makes it look hard.  Clearly the thing to do is to use
a non-optimal heuristic algorithm.

Given a spanning tree, we can get a legal set of drop-thrus by deleting all but
one ingoing and one outgoing arc of each vertex.  It makes no difference which
one we delete as far as number of branches are concerned.  If we could somehow
assign weights to the arcs which will encourage building a good set of
drop-thrus, then we could use a lightest-spanning tree algorithm.  Clearly a
major determinant in the weight of an arc is the number of remaining paths out
of the block it is from, the fewer the more important it is.  It make sense to
prefer the larger component when choosing among equal weighted arcs to add.
Longer components are better, since each component represents two branches but
consumes a bigger portion of the program as it gets longer.

Another possibility is to use an algorithm based on Depth First Spanning Trees.
We can modify the basic depth first ordering algorithm so that it chooses an
ordering which favors any drop-thrus that we may choose for dynamic reasons.
When we are walking the graph, we walk the desired drop-thru arc last, which
will place it immediately after us in the DFO unless the arc is a retreating
arc.

We scan through the DFO and whenever we find a block that hasn't been done yet,
we build a straight-line segment by setting the drop-thru to the unreached
successor block which has the lowest DFN greater than that for the block.  We
move to the drop-thru block and repeat the process until there is no such
block.  We then go back to our original scan through the DFO, looking for the
head of another straight-line segment.

This process will automagically implement all of the dynamic optimizations
described above as long as we favor the appropriate IF branch when creating the
DFO.  Using the DFO will prevent us from making the back branch in a loop the
drop-thru, but we need to be clever about favoring IF branches within loops
while computing the DFO.  The IF join will be favored without any special
effort, since we follow through the most favored path until we reach the end.

This stuff will undoubtedly have some effect on the plausibility of using
delayed branches.  It remains to be seen what effects are good and bad and what
may be done to maximize good effects.

This needs some knowledge about the target machine, since on most machines
non-tail-recursive calls will use some sort of call instruction, and thus
should not be considered as drop-throughs.


			CODE GENERATION

This should be fairly straightforward.  We translate VOPs into instruction
sequences on a per-block basis.  Probably the only real work is dealing the the
remaining sematics of function, such as saving and restoring around full
function calls and generating function entry points.  The result is a list of
the exact instruction sequence for a block, exclusive of final branch
instructions.

After code generation, all control flow is explicit.  VOPs for error checking
and value receiving may expand into code which contains branches.  In such
cases, Codegen introduces new blocks which are linked in as appropriate.
Implicit conditional branches are expressed in exactly the same way as those
which implement IFs.  Codegen can hack the flow graph fairly easily, since it
doesn't need to worry about invalidating the flow analysis information which is
not used after this point.

Each component in the flow graph is dumped as a distinct function.  If there
are multiple entry points, we choose one as the main function and store all
the constants in it.  The other entry points will have this function object as
a constant, and will grab it out when they are called.

There are a few optimizations which can really only be done on the assembly
code.  These would most often be done by peephole optimization.  We may be able
to do them by cooperation between code generation and the assembler.  The most
important optimization is turning branches into branch-with-execute.  The
assembler can do this without a full peephole optimization stage because the
blocks are passed in separately.  


			ASSEMBLY

Lay out the blocks somehow or other, taking into account the dropthrus.  This
process should attempt to minimize branch lengths and place related code
together.  Once this is done, the exact branch code can be generated, taking
into account possible use of delayed branches.  After that, all that needs to
be done is instruction translation and fasl-dumping.


			TARGET DEPENDENCE

    There are basically two levels of target dependence.  Parts of code
generation will be totally target dependent since they will deal with actual
machine instructions.  Other code will depend on the execution model but should
map on to many machines with little difficulty.  This includes IR2 conversion
and transforms which expand into sub-primitives.  Some IR2 passes emit VOPs,
but these VOPs will be in every implementation's IR2.

    IR2 optimization has some problems, since it needs to know about
side-effects and dependencies of VOPs so that it can tell when an expression
should be killed.  This isn't very serious though, since without determining
what aliasing exists we can't do much better than the boolean effects used by
Rabbit.  Determining aliasing relations in Lisp is pretty impossible.  We can
do better for lexical variables, but we have already mapped them to TNs which
are independent of the particular IR2.  For special variables we can just have
a Special effect which all special references depend on.  We could do better if
we treated specials specially, but it probably isn't worth it.  All we need then,
is the set of effects and dependencies for each VOP; this can easily be tabular.


			KEYWORD ARGS

    Someday we can have keyword entry points that allow us to do the keyword
parsing at compile time for calls of known functions with constant keywords.
This is pretty hard in the general case, since defaults for keywords can depend
on the values of previous keywords.  If there are no ugly dependencies, then we
can just have a keyword entry which takes all the keyword args as normal args,
and let the caller worry about defaulting.  For ugly args, there can be an
additional supplied-p parameter which the keyword entry looks at to decide
whether to default the argument.

    This also opens the way for semi-automatically eliminating keyword calls to
well known functions such as sequence functions.  All we need is something like
a KEYWORD-ENTRY proclamation which indicates that the specified function should
have a special entry for calling with pre-parsed keywords.  This will only work
in conjuction with an assertion that the function interface is the same in the
runtime environment as it is in the compiler.  This can be provided by a
FUNCTION proclamation.  Actually, we could do this whenever we have a
proclaimed function interface, KEYWORD-ENTRY notwithstanding.  The only problem
with this is that we might feel some compulsion to provide better inconsistency
checking and facilities for redefinition if we automatically made keyword
entries.  For this external keyword entry we should just pass a supplied-p arg
for every keyword, rather than having the callers have to know about and worry
about the exact defaulting mechanism and presence of supplied-p args.


			MUMBLAGE ABOUT TYPES

The current definition of types in Common Lisp sucks.  Let's start thinking
about ways to clarify the mess.

It makes some kind of sense to base the type system on the notion of objects
belonging to types, rather than on direct inclusion relationships between
types.  What then becomes the primary concern is the mapping between Common
Lisp types and the possible implementations for those types.  This notion
subsumes the idea of specialization of types in the manual.  It seems that the
best definition of an object belonging to a type is:

  An object belongs to a type if it could have been created by the system when
  told to create an object of that type.

"Telling the system to make an object of type X" utimately involves calling
some Common Lisp function which is constrained by the language definition to
return an object "of type X" when used in this way.  Incomplete type specifiers
cause some problem.  Note that there is no way to tell the system to create an
object of type SIMPLE-STRING, since the size is not specified.  You must
specify the length, getting something like (SIMPLE-STRING 3).  We get around
this problem by considering an incomplete specifier to stand for all the
possible complete specifiers when we are testing an object for type membership.
This mapping is from type specifiers to sets of type specifiers, and is
entirely defined by the language.  This mapping is quite different from the
implementation mapping which maps programs to objects.

The implementation mapping is not a mapping from type specifiers to objects,
since the system may consider things not describable by a type specifier when
choosing implementations.  If the system can prove that your program uses an
object in a constrained way, then it may choose a different representation than
it would under other circumstances.  This freedom to choose representations is
constrained only by the places in the Common Lisp definition which require
"types" to be disjoint.  These constraints really mean that the representations
of the indicated types must be disjoint.  In the absence of a meaningful
TYPE-OF, this qualification is unnecessary, since there is no way that the user
can directly discern the implementation.  Typep already guarantees that it will
hide the implementation mapping.

Note that this definition is more restrictive than the current one in the case
of "specialized arrays", since it would require that the result of make-array
always belong to the specified type.  There is no loss of power, since it
simply makes operations well defined which weren't before.

One would think that defining types in terms of object membership would make
the SATISFIES specifier less problematical.  With this formulation, at least,
this is not the case.  There is no way to tell the system to create an object
of type SATISFIES, so no object would be of type SATISFIES.  One way around
this problem is to say:
  
  The user will implement this type in terms of primitive types.  We can assume
  nothing about the implementation, and will ask the user (supplied function)
  to ask whether he might have made it.


(subtypep x y)
  
  All operations defined on each possible object of type X are defined on some
  possible object of type Y.

  Alternatively, all possible implementations for an object of type X are also
  possible implementations for an object of type Y.  The implementations of X
  are a subset of the implementations of Y.

Important aspects of subtypep are portable.  If X is a just a more fully
specified version of Y, then X is a subtype of Y in all implementations.

The implementation-dependent aspects enter when we involve the implementation
mapping.  SUBTYPEP is defined in terms of the real implementation mapping,
rather than in terms of an idealized Common Lisp implementation.  The only real
difference is that in a given implementation, potentially distinct types may be
mapped to the same implementation.  Suppose an implemenation had only one
integer represetantion, causing INTEGER, FIXNUM and BIGNUM to be
interchangable.  In this implementation, INTEGER is a subtype of FIXNUM.  At
least for the purpose of the compiler, it makes more sense to preserve all
possible type information in the program so that more checking can be done.  In
an implementation where SHORT-FLOAT is the same as SINGLE-FLOAT, calling a
SINGLE-FLOAT function with a SHORT-FLOAT will work fine, but the program won't
port.  The compiler should be able to detect erroneous programs even when they
will work in a given implementation.

This aspect of the SUBTYPEP definition is suboptimal for some purposes.  The
main example is statically determining type membership for an object, given
that we know some type that the object belongs to.  At least in the case of
floats, it pretty clearly is incorrect to say that two types are disjoint when
in that implementation they are indentical.  The SIMPLE-ARRAY <=> ARRAY case is
also interesting.  Note that a program which does (TYPEP 1f0 'SHORT-FLOAT) is
ill-defined unless the program has called SUBTYPEP to find out the float type
relationships or is just trying to determine the relationship by doing the
test.

It looks like the thing to do is to make the actual subtypep implementation as
picky as allowed by the language, and then have each implementation define
DEFTYPEs which fold together eqivalent types.  This type-folding could be
turned off for purposes of type-checking in the compiler.


And what about TYPE-OF?

Do we want to attempt this identity?
  (SUBTYPEP (TYPE-OF x) 'foo) <=> (TYPEP x 'foo)

It's certainly true that:
  (SUBTYPEP (TYPE-OF x) 'foo) => (TYPEP x 'foo)

This will definitely require changing lots of people's TYPE-OF's.  (TYPE-OF 3)
isn't FIXNUM, it's (INTEGER 3 3).  This would definitely make TYPE-OF a bitch,
since it would have to find a type as specific as the most specific one that
the user could meaningfully pass to SUBTYPEP.  I think that this would end up
unnecessarily constraining the implementation mapping, since it would be very
hard to have types whose implementions partially intersect.  In any case, it's
probably good for TYPE-OF to try real hard to maintain this equivalence.


			INTERPRETER INTERFACE

We have to figure out how to make special bindings in the interpreter and
PROGV.  We want a special frob like:

    (WITH-RANDOM-SPECIALS
      ...
      (%SPECIAL-BIND sym value)
      ...)

Each %SPECIAL-BIND will add a special binding to the environment of code
evaluated after it.  Exiting the form will undo all special bindings made
within the form.  This can be implemented fairly easily as a macro when we are
using shallow binding.  It will just stash the initial BSP in a variable, and
then do the body in an unwind-protect, calling some frob in the cleanup which
unbinds to the saved BSP.  This is more of a bitch if we have deep-binding.  We
might want to do this some other way, possibly splitting the binding frobbing
off from the code affected by the bindings.

Cute idea: (PROGV vars vals body) => (%PROGV vars vals #'(LAMBDA () body)).
%PROGV is a totally magical function which establishes the bindings somehow,
and then calls the body function.  This would only be reasonable if we have
stack closures.

The only residual problem is dealing with the places in the interpreter where
people use the stack for temporary storage, pushing and popping it randomly.
The only places that do this are the binding forms and PSETQ.  In the case of
the binding forms, we could use the conses which we are going to cons anyway to
hold the results.  This would require rearranging the code a bit, but would
probably improve performance.  PSETQ is more problematical, since it currently
doesn't cons.  Probably we could get 90%+ of the performance by special-casing
fewer than N vars for some small N and consing for more.  A little consing in
PSETQ isn't going to kill anyone anyway, considering all the consing caused by
variable bindings.  If we ever reduce that consing, we can apply the same
technology to PSETQ.


			IR2 CONSISTENCY CHECKING

Block connectivity is similar to IR1.
Verify that TN-REF linkages are correct.
Check that the VOP list within each block is well-formed.
Check VOP arg counts?
Check for valid TN assignments w.r.t. function call.
Check loop analysis?

Check cleanup generation:
   Could look for uplevel continuations that might have been missed.

Check environment analysis:
   Do basic sanity checks such as making sure that directly referenced TNs are
   directly accessible in the current context.

Check for references to uninitialized TNs:

   We could detect some lossage by looking for TNs which are referenced before
   they can't possibly have been written.  This could easily fall out of
   lifetime analysis.  If a TN is uninitialized at the beginning of a block,
   and is read before it is written, then we have a problem.

Check packing legality:
   To the extent that it is possible, this is a good idea.  We can
   certainly check for things like SC restrictions being satisfied.
   We could also check on the legality w.r.t. TN lifetimes, but it would
   be extra work.  We could redo the live var analysis using simpler data
   structures and algorithms which are optimized for legality checking.
   After checking each block, we check that the new analysis agrees with
   the old, which will guarantee that the lifetimes are globally correct.
   Our checking algorithm would probably be based on data structures
   representing locations rather than TNs.  Each location would have a
   flag indicating which live TN is in it, if any.  If someone attempts
   to read a different TN in that location, then we know we have a
   problem.  The validity checking might not be much less buggy than the
   real version, but hopefully the bugs won't intersect.

Check accuracy of effects information:
   Incorrect tables of information about effects and interfaces is a major
   source of lossage.  We could check up on this by mumbling when a
   supposedly effectless operation is used for effect.  Could also complain
   if a supposedly effectless operation is explicitly replicated in user
   code.  Could have a grab-bag of other checks: whenever we find a
   problem, we try to devise a check which would have detected it.

Cost assignment checking:
   Probably a good idea, since bad costs would cause subtle bugs because
   code would work, but slowly.  Can we write something to check for
   efficiency bugs?  Can check for things like:
     TNs which aren't preferenced that we make moves between.
     TNs that are preferenced, have disjoint lives, and get packed into
       different LOCs in the same SC.
     Specific VOPs whose average cost is out of line with initial estimates.
     Code sequences containing many VOPs with above average costs.

   Could have a smarter, slower pack algorithm which tries to find better
   packings.  If the better packing is enough better, then we complain.


			OBSERVED BUGS AND POSSIBLE FIXES

Some bugs gleaned from the change logs of existing compilers, with possible ways
to prevent them from occurring in this compiler:

Generators emitting bad assembler format:
  Make assembly emission be done with a macro that checks the instruction for
  validity at compile time.

Random address arithmetic being done wrong with magic numbers:
  Use only named constants and well defined named functions in such
  expressions.  Examples of functions are:
     words=>bytes	(* x 4)
     byte-index		(+ x y)
     skip-header        (+ x <amount determined by specified type>) 
  No magic numbers in generators anywhere.

Bugs with stack simulation:
  Let Pack worry about it. 

Generators putting unboxed things in boxed registers and vice versa.
  Mainly a problem with temporaries used within the code for the VOP.  This is
  most likely to happen in code which deals with boxing and unboxing numbers.
  Factor this code out and get it right.  We can have a frob that will tell us
  all VOP generators that request unboxed TNs, so that these can be given an
  extra-careful going over.

General usage of wrong registers (magic register numbers)
  Make generators get their temporaries by requesting them as TNs.  The
  declarative VOP info will cause them to be allocated and passed in.  The
  requirements are specified in terms of SCs rather than magic registers.  This
  should give us some freedom to move things around without having to change
  all the generators.

Problems with inadequate restrictions being specified on TNs:
  Have conservative defaults?

Functions that pass aound pieces of LAP such as operand locations cause
  problems with callers not being up on the syntax of what is going on:
  Don't pass around syntax, pass around structures that represent semantics and
  can be checked.

Broken generators fail to return the results and similar stuff:
  Check that all the args and results are actually used as operands by the
  generated code.  We could have temporaries be classified by whether they are
  always used or only sometimes used, and then could check that the always-used
  ones are always used.

Representation selection seems to be a huge source of problems:
  Does having lots of SCs and getting Pack to work for us really eliminate all
  these problems?

Discrepencies between the interface to a transform and the real definition:
  This is largely eliminated by specifying a type that the call must satisfy.
  If we forget to handle a keyword in a transform, then it just won't get
  transformed.  We could have a frob that compares the types for transforms to
  the types for the definitions.

Other confusions between different implementations of the same thing:
  Macros v.s. special forms.
  Functions v.s. source transforms v.s. transforms v.s. open coding.
  Do they do the same thing?  Are the differences gratuitous?  Are we using the
  one we want?  Do we have all the versions that we need?

If the same information is kept more than one place then it is surely wrong
  somewhere:
  Don't duplicate information.  Check that duplicates are consistent.

When decorating the representation, we forget to fill in some slots, and then
  garbage is interpreted in some random way:
  Choose defaults that are obviously broken: *empty*

Failure to realize that some stuff must be recomputed due to failing to realize
  that something has been invalidated or failing to propagate the invalidation.