Page First Non-blank Line 1 Page First Non-blank Line 2 Todo: 3 Phases: 4 IR1 CONVERSION 5 Hairy function representation: 6 IR1 representation of Catch and Unwind-Protect: 7 Block compilation: 8 LOCAL CALL ANALYSIS 9 FLOW GRAPH SIMPLIFICATION 10 IF OPTIMIZATION 11 IR1 BOTTOM-UP OPTIMIZE 12 IR1 TOP-DOWN OPTIMIZE 13 IR1 FORWARD FLOW 14 PREDICATE PROPAGATION 15 TRANSFORMATION 16 EVALUATION ORDERING 17 IR1 FINALIZE 18 IR2 CONVERSION 19 TN Assignment: 20 Loop analysis: 21 Cleanup generation: 22 Environment analysis: 23 Stack management: 24 IR2 Function call: 25 Generating error checks: 26 Converting conditionals into IR2: 27 Converting Catch and Unwind-Protect: 28 VOP representation: 29 COPY GENERATION 30 LOOP AND GLOBAL COMMON SUBEXPRESSION OPTIMIZATION 31 TN LIFETIME ANALYSIS 32 CONFLICT ANALYSIS 33 SAVE GENERATION 34 PRELOAD GENERATION 35 PACKING 36 CONTROL OPTIMIZATION 37 CODE GENERATION 38 ASSEMBLY 39 TARGET DEPENDENCE 40 KEYWORD ARGS 41 MUMBLAGE ABOUT TYPES 42 INTERPRETER INTERFACE 43 IR2 CONSISTENCY CHECKING 44 OBSERVED BUGS AND POSSIBLE FIXES Todo: More IR1 representation tweaks: Add more info to Leaf which tells whether a functional reference is in a closure context or if it is an open call??? (refinement to Inlinep?) Add support for equated continuations so that such continuations don't become invalidated when they are deleted. Fix up IR1tran to deal with new IR1 representations. DEFTYPE and DEFSTRUCT type stuff? Function-type hacking stuff for support: Flame-About-Redefinition Approximate function types Add stuff which sets the Leaf-Type for functionals so that we can do arg checking. Do inline substitution. Change TAGBODY (and BLOCK?) to preserve the drop-thrus in the original code. &REST and &KEY. Write an approximation to IR1-FINALIZE which checks for bound-but-not-referenced, type errors and similar things. Remaining special forms: UNWIND-PROTECT, CATCH, PROGV. Rethink specbind. Simple type propagation so that function type checking does more. TYPE-INTERSECTION Handle recursive types. Write real COMPILER-ERROR and COMPILER-WARNING. Add error recovery points. Make (FUNCTION * ...) types work. &AUX. In CLC, inline substitution is only allowed when the function is defined in the null environment. This is be a reasonable constraint for throwing the expansion into the global environment, since it must be represented as source. However, it would be nice to allow inline expansion of local functions, which would almost always be inhibited if we required a null environment. Since the appearances of the local function must be nested within the desired environment, it should be possible to expand local functions inline even when they use the environment. We just stash the source form and environments in the Functional for the local function. When we convert a call to it, we just reconvert the source in the saved environment. Make definition macros totally real by having the load-time functions deal with clearing compiler info and similar stuff. Change IR1-top-level to take a vector instead of a list so that we can find the top-level form which is the root of our source path more easily. Design: Need some way for top-level code to get a special policy so that we don't waste lots of time and space trying to make it run fast. Unclear exactly how we should represent the return from the body of a local function. One possibility is to represent the value semantics by having a dummy continuation like we do now, but the users of this continuation appear in the uses lists for all the actual call continuations. It seems that we don't really need a back pointer from the continuation to the function being returned from, since the function has a point to the continuation, and can generate a return using the appropriate return address when the time comes. Alternate schemes involve some kind of explicit return node. How do we deal with VOPs where the cost of temporaries depends on whether operands are packed in the same location? (N/N-1 arg) Local preference will encourage packing arg/result together. If more temps may be needed, we can allocate them and preference to some operand. Predicate which determines if packing is necessary? Do we need a separate local preference mechanism? Based on TN-Refs? Link TN-Refs that want their TNs to be in the same place together in a cycle. This local preference relation affects only location selection, not the cost-driven priority determination. At the local level, the short-lived evaluation temporaries will probably get packed into registers anyway. We just want to make sure that they get packed into registers in such a way that VOP costs are minimized. Define-VOP macro: Cost info Local preference info Temporaries required Code generator? effects/affected (eventually) Inheritance? Same as XXX, but... How does automagic type-checking get done in IR2 conversion? Where and how do we figure out what stuff has to be saved/restored in local call: CONT, ENV, etc. Need to clarify the relation between functions and environments. Is an environment always identified with a function? What about functions that don't have their own environment? Perhaps we should consider context pointers to be associated with the environment, but associate return PCs with functions. Cleanups: Cleaning up is implicit even on the fall-through path. The *current-cleanup* a cleanup structure. Cleanups are linked together in a data structure which represents the syntactic nesting of messed up environments and serves as a place to keep information about how to clean up. We can flush mv-prog1, since we now can send values to a continuation and then go somewhere else by putting the continuation at a block end. Exactly how does cleanup generation work? How does state-saving code get emitted? Phases: The structure of the compiler may be broadly characterized by describing the compilation phases and the data structures that they manipulate. The steps in the compilation are called phases rather than passes since they don't necessarily involve a full pass over the code. The data structure used to represent the code at some point is called an IR (Intermediate Representation). Two major IRs are used in the compiler: IR1 is used to represent the lisp-level semantics of the source code during the initial phases. Meta-evaluation and semantic analysis are done on this representation. IR1 is roughly equivalent to a subset of Common Lisp, but is represented as a flow-graph rather than a syntax tree. IR2 is used to represent the implementation of the source code on a virtual machine. The virtual machine may vary depending on the the target hardware, but the IR2 representation is sufficiently stylized that most of the phases which manipulate it are portable. Each phase is briefly described here. The ordering below is approximate. Some phases could moved around. For maximum optimization, the phases from Local call analysis though Transformation should be repeated until nothing new is discovered. The name of a phase is in brackets if it may be omitted to save compilation time. IR1 Conversion Convert the source into the IR1 representation, doing macroexpansion and simple source-to-source transformation. All names are resolved at this time, so we don't have to worry about name conflicts. Local call analysis Find calls to local functions and splice them into the call graph so that we can do flow analysis. Flow graph simplification Eliminate unreachable blocks and compute depth-first ordering. [IF optimization] Various optimizations of control structure based on "source to source" IF transformation. [IR1 bottom-up optimize] Do constant folding and bottom-up type propagation. [IR1 top-down optimize] Eliminate code that computes unused values and do top-down type propagation. [IR1 forward flow] Propagate type and constant information to variable references and eliminate unnecessary trivial variable bindings. [Predicate propagation] Use a limited form of global common subexpression elimination to detect things that are known to be true due to having been tested at some earlier time. [Transformation] Special-case calls to some known global functions by replacing them with a computed function. Some transformations can be done better on IR1 than on the actual source since argument type and constant information is known. [Evaluation ordering] Reduce storage requirements by changing the order of evaluation of some function arguments. IR1 finalize Do some error checks and maybe some other stuff to get ready for IR2 conversion. IR2 conversion Convert nodes into VOPs, variables into TNs, LETs into sets, ... [Copy generation] Replace some references to TNs with references to copies which may get allocated in better places. [Global common subexpression elimination and loop invariant optimization] Eliminate redundant computations of simple expressions. TN lifetime analysis Determine the sets of TNs live at block boundaries. Save generation Generate explicit saving and restoring code for TNs used in recursive local functions. Conflict analysis For each TN, find the set of TNs that have lifetimes which overlap with its own lifetimes. [Preload generation] On some machines, move memory references backward in the code so that they can overlap with computation. Packing Assign each TN to a storage location. [Control optimization] Linearize the flow graph in a way that minimizes the number of branches. Code generation Convert the VOPs into assembly code. Assembly Assemble the code and deal with dumping of constants and entry points and stuff. IR1 CONVERSION The simplified lisp code is converted into the first intermediate representation: nodes and continuations. The set of special forms recognized is exactly that specified in the Common Lisp manual. Other things implemented as special forms in the interpreter are implemented as macros in the compiler. Large amounts of syntactic information are thrown away by the conversion to an anonymous flow graph representation. The elimination of names elminates the need to represent most environment manipulation special forms. The explicit representation of control eliminates the need to represent BLOCK and GO, and makes flow analysis easy. The full Common Lisp LAMBDA is implemented with a simple fixed-arg lambda, which greatly simplifies later code. The elimination of syntactic information eliminates the need for most of the "beta transformation" optimizations in Rabbit. There are no progns, no tagbodys and no returns. There are no "close parens" which get in the way of determining which node receives a given value. In IR1, computation is represented by Nodes. These are the node types: If: Represents all conditionals. Set: Represents a SETQ. Leaf: Represents a constant or variable reference. Combination: Represents a normal function call. MV-Combination: Represents a MULTIPLE-VALUE-CALL. This is used to implement all multiple value receiving forms except for MULTIPLE-VALUE-PROG1, which is implicit. Bind: This represents the allocation and initialization of the variables in a lambda. Some slots are shared between all node types. This information held in common between all nodes often makes it possible to avoid special-casing nodes on the basis of type. This shared information is primarily concerned with the order of evaluation and destinations and properties of results. These things are represented by continuations. The Continuation structure represents stuff that is sufficiently related to the normal notion of a continuation that naming it so seems sensible. Basically, a continuation represents a place in the code, or alternatively the destination of an expression result and a transfer of control. These two notions are bound together for the same reasons that they are related in the normal functional continuation interpretation. A Continuation may be deprived of either or both of its value or control significance. If the value of a continuation is unused due to evaluation for effect, then the continuation will have a null DEST. If the NEXT node for a continuation is deleted by some optimization, then NEXT will be :NONE. The Block structure represents a basic block, in the the normal sense. Control transfers other than simple sequencing are represented by information in the Block structure. The continuation for the last node in a block represents only the destination for the result. It is very difficult to reconstruct anything resembling the original source from IR1, so we record the original source form in each node. The location of the source form within the input is also recorded, allowing for interfaces such as "Edit Compiler Warnings". Forms such as special-bind and catch need to have cleanup code executed at all exit points from the form. We represent this constraint in IR1 by annotating the code syntactically within the form with the node that must be cleaned up. If the cleanup for a node's continuation is different that that for the node, then some cleanup code will probably have to be emitted during IR2 conversion. Someday we may want to change IR1 conversion so that it changes the current component occasionally so that top-level forms don't get too bloated. We could then need a way to keep track of the multiple top-level components and make sure that they get dumped in the right order. Hairy function representation: Non-fixed-arg functions are represented using Optional-Dispatch. An Optional-Dispatch has an entry-point function for each legal number of optionals, and one if rest args are present. Each entry point function is a simple lambda. The entry point function for an optional is passed the arguments which were actually supplied; the entry point function is expected to default any remaining parameters and evaluate the actual function body. If no supplied-p arg is present, then we can do this fairly easily by having each entry point supply its default and call the next entry point, with the last entry point containing the body. If there are supplied-p args, then entry point function is replaced with a function that calls the original entry function with T's inserted at the position of all the supplied args with supplied-p parameters. Keyword args are represented using rest args, but we want to keep around some sort of information so that we can open code keyword calls where possible. IR1 representation of Catch and Unwind-Protect: We represent CATCH using the lexical exit mechanism. We do a transformation something like this: (catch 'foo xxx) ==> (block #:foo (%catch 'foo #'(lambda () (return-from #:foo (%unknown-values)))) xxx) %Catch just sets up the catch frame which points to the exit function. %Catch is an ordinary function as far as IR1 is concerned. The fact that the catcher needs to be cleaned up is expressed by the Cleanup slots in the continuations in the body. %Unknown-Values is a dummy function call which represents the fact that we don't know what values will be thrown. We use a similar hack in Unwind-Protect to represent the fact that the cleanup forms can be invoked at arbitrarily random times. (unwind-protect p c) ==> (flet ((#:cleanup () c)) (block #:return (let ((#:next (block #:unwind (%unwind-protect #'(lambda (x) (return-from #:unwind x))) (return-from #:return p)))) (#:cleanup) (%unwind #:next)))) We use the block #:unwind to restore the dynamic state before doing the cleanup in the case where we are non-locally unwound. Calling of the cleanup function in the drop-through case (or any local exit) is handled by cleanup generation. #:next is some kind of state that indicates how further unwinding is to be done. We make the cleanup a function so that implicit calls to it can be created by cleanup generation. The cleanup function is a magic arg to %unwind-protect, since cleanup generation needs to find it, but we don't want it to become a closure for no reason. [How can we do this if %unwind-protect is a function? An alternate solution would be to make the cleanup function an explicit arg to %unwind-protect, and then rely on magic to get the right dynamic environment for the cleanup. Powerful magic could also exempt the cleanup function from the normal closure analysis process.] Notice that implementing these forms using closures over continuations eliminates any need to special-case IR1 flow analysis. Obviously we don't really want to make heap-closures here. We don't necessarily need full stack closures, since the code in the functions is highly constrained. Block compilation: One of the properties of IR1 is that supports "block compilation" by allowing arbitrarily large amounts of code to be converted at once, with actual compilation of the code being done at will. In order to preserve the normal semantics we must recognize that proclamations (possibly implicit) are scoped. A proclamation is in effect only from the time of appearance of the proclamation to the time it is contradicted. The current global environment at the end of a block is not necessarily the correct global environment for compilation of all the code within the block. We solve this problem by closing over then relevent information in the IR1 at the time it is converted. For example, each functional variable reference is marked as inline, notinline or don't care. Similarly, each node contains a structure known as a Cookie which contains the appropriate settings of the compiler policy switches. The interface to block compilation will be primarily through extensions to Compile-File rather than syntax in the source file. 1] The file argument may be a list of files to be compiled together into a single binary. 2] The keyword :Block-Compile specifies whether function references may be resolved at compile time. (default NIL) When enabled, full call may still be forced in a case-by-case basis by NOTINLINE declarations. 3] The keyword :Entry-Points determines which DEFUNs are given globally callable definitions. Eliminating entry points is primarily a space-saving measure. These values are legal: :ALL All functions get entry points (the default). :SPECIFIED Functions in an ENTRY-POINT proclamation get entry points. :EXPORTED Functions are given entry points if they are either specified as an ENTRY-POINT or are exported from their name's home package. LOCAL CALL ANALYSIS All calls to local functions (known named functions and LETs) are resolved to the exact LAMBDA node which is to be called. This will detect any such calls which have an illegal number of arguments. If the call has non-constant keyword arguments or is an APPLY, then we can't tell what is going on, and will end up making a normal function call. Illegal calls are simply left alone for IR1 finalization to complain about. Although not necessary for correctness, there's no point in turning this "optimization" off, since nearly all lambdas would turn into closures, causing the compiler to run much more slowly in addition to generating abysmal code. With local functions, we splice the called function into the flow graph for the callee so that flow analysis can be done more easily. Each call to a local function causes the block to be ended at that point, and the body of the function made the successor to the caller's block. Similarly, the block containing the code after call is made the successor to the last block in the function. The combination node still has the actual continuation for the call so that we can still find the receiver of the value. We add the nodes which use the result continuation to the uses list for the actual call continuation so that the caller can tell what the actual values being returned are. When we discover that a function is only called with one continuation, then we replace the dummy continuation with the actual continuation. We also equate the dummy continuation to the real continuation so that code in the function body can see exactly how it being used. IR1 forward flow analysis may discover that some function-valued normal variables are in fact constant. In this case, it is probably worth restarting IR1 optimization at this pass. We also do this substitution for MV-Calls that look like MULTIPLE-VALUE-BIND, since we know that they can be open-coded. We replace the optional dispatch with a call to the last optional entry point, letting MV-Call magically default the unsupplied values to NIL. Although we may not do much flow analysis on these things, we certainly don't want them turning into closures. We could also do the same for MV-Calls with known numbers of values. When block compilation is enabled, we may have to go to some trouble to avoid joining top-level code into the same component with real code. This could happen if a call to a known function appeared in top-level code. We would like to avoid turning such calls into local calls if this is possible. Unfortuantely, it is hard to tell what code is and isn't top-level at this point. I guess we could make a pre-scan that converts calls which aren't to named entry points, and then finds all the code reachable from the entry point to the initial top-level form. FLOW GRAPH SIMPLIFICATION This pass computes the depth-first ordering of the blocks in each function, for use by later flow analysis. Blocks which are not reached during the walk are deleted and blocks which can be merged are merged. Although the simplification part of this phase is not necessary for correctness, it will probably more than pay for itself in later mandatory flow analysis. In the process of walking the graph to find the DFO, we may discover that the the flow graph is not connected. This is fairly unlikely when compiling single functions, but it quite probable when block-compiling entire files. Primarily for the sake of block-compilation, we do the remaining part of compilation separately on separate components. This has the following adavantages: 1] The size of flow analysis problems is limited to the size of the largest component. 2] Locality of reference is improved in IR1 passes since they will only consider the IR1 for a single component instead of making repeated passes over the entire IR1. This is important since the IR1 for a large block compilation can easily be larger than physical memory. 3] IR2 data structures may be recycled between the compilation of each componenent. This is significant since IR2 data structures are probably quite a bit larger than IR1. It is possible that components may be split or joined by IR1 optimizations. A component may be split when a call is eliminated as dead code. Components may be joined when flow analysis discovers that the function for a call is constant. We won't bother trying to discover when components could be split, since this is hard to do and the payoff is small. IF OPTIMIZATION This pass is optional, but is desirable if speed or space (or safety?) is more important than compiler speed. [Could hurt space? probably rare...] This pass does high level control optimization based on transformations of IFs like those used in Rabbit. This pass alternates with flow graph simplification until nothing happens. Not clear how to best order this pass w.r.t. predicate propagation for best interaction. IFs with predicates known to be true or false are eliminated. Eliminate IFs with identical consequent and alternative? This would most likely happen if both the consequent and alternative were optimized away. Could also be done if the consequent and alternative were different blocks, but computed the same value. This could be done by a sort of cross-jumping optimization which looked at the predecessors for a block and merged code shared between predecessors. IFs with identical branches would eventually be left with nothing in their branches. Eliminate IF-IF constructs: (IF (IF A B C) D E) ==> (IF A (IF B D E) (IF C D E)) Note that we don't actually replicate the code for D and E; we use the same consequent and alternative block for both IFs. Note also that there may be forms wrapped around the inner IF as long as it is in a tail-recursive position. Due to the continuation representation, there isn't even any way to tell if this is the case. In general, it is non-trivial to find all the blocks in B and C so that we can move them. Probably what we want to do is only do the optimization if B and C are single blocks. This will deal with the important case of the expansions of AND and OR. About the only sane way to get a hairy expression in a predicate would be to do a local call. We could special-case this. Of course, we could just go for broke, and do a graph walk. We could easily handle the general case if we have the dominators, but we may not want to compute them this early in the game. We do IF optimizations bottom up (forward DFO). We don't need to look at all the code, since IFs are always at block ends. We could also convert CALL-IF constructs: (FUNCALL (IF A B C) D E) ==> (IF A (FUNCALL B D E) (FUNCALL C D E)) Once again, we need not replicate the code for D and E. This would require changing IR1 to allow a continuation to have multiple destination nodes. IR1 BOTTOM-UP OPTIMIZE This pass is optional, but is desirable if anything is more important than compilation speed. We walk the code in forward depth-first order and: Replace calls of foldable functions with constant arguments with the result. We don't have to actually delete the call node, since Top-Down optimize will delete it now that its value is unused. Do bottom-up type propagation/inferencing. IR1 TOP-DOWN OPTIMIZE This pass is optional, but is desirable if anything is more important than compilation speed. We walk the code in reverse depth-first order and: Eliminate any let-bindings for unused variables. Eliminate any effectless nodes with unused values. In IR1 this is the only way that code is deleted other than the elimination of unreachable blocks. Do top-down type assertion propagation. For some functions such as + we will dispatch to a function for this. Since we will have to repeat this until it stabilizes, we probably want to do sort of scheme where we flag a block as needing to be reanalyzed if we change any continuation in the block, either by changing the asserted type or eliminating the DEST. We would then loop until no block was marked as changed, analyzing only the changed blocks. IR1 FORWARD FLOW This pass is optional, but is desirable if anything is more important than compilation speed. We use forward flow analysis to find which definitions of lexical variables reach each block, and then use this information to propagate constant and type information and to eliminate unnecessary trivial variable bindings. The type of each variable reference is set to the union of the types of all the definitions reaching it. If all the definitions reaching a reference are the same constant value, then the reference may be replaced with that constant. When the only definition of a variable reaching a reference is another variable, then the other variable can be substituted if the set of definitions for the other variable is the same at the reference as it was at the trivial definition. This will cause unnecessary bindings to go away, since when all uses of the new variable are deleted, the binding will go away. PREDICATE PROPAGATION This pass is optional, but is desirable if anything is more important than compilation speed. We used a constrained version of global common subexpression elimination to propagate information about expressions known to be true or false due the evaluation of an IF. Instead of propagating all available expressions, we only propagate expressions which appear as the test in an IF. If we stick to effectless&unaffected functions of lexical variables, then we can just use the available information about variable setting to determine when an expression is killed. Probably the way to insert this information back into the code is to express it as type information. If an expression is true, then we intersect its type with (NOT NULL). If it is false, then its type is NULL. This will deal with the problem of non-T truths. The most important applications of this are optimizing hairy conditionals and eliminating redundant type checks. In the case of a type predicate of a lexical variable, we can intersect or subtract the type from the types for references, depending on whether the test is true or false. This pass can be done concurrently with IR1 Forward Flow. TRANSFORMATION This is the "source to source" transformation pass for function calls. Non-trivial optimizations of functions are better done on IR1 than on Lisp code because there is type and constant information available. Simple function optimizations may still be done as source transforms. This pass is optional, but should be done if speed is more important than compilation speed. Transformations may reduce space as well; if we really believe this, then we should also transform if space is important. In this case, transforms which increase space should check before doing their thing. A transform is actually a kind of inline function call, with the additional provision for computing the function at compile time. The transform gets to peek at the continuations for the arguments, and computes a function using the information gained. The transform shouldn't do anything with the values, even if they are constants, since the compiler must preserve eqlness of named constants, and it will have a hard time if transforms go around randomly copying constants. The computed lambda replaces the original function variable reference as the function for the call. This lets the compiler worry about evaluating each argument once in the right order. There can be any number of transforms for a function. Each transform is associated with a function type which the call must be compatible with. A transform is only invoked if the call has the right type. This lets us put off type checking of calls until IR1 finalize when we know everything we ever will about the type of each argument. This also provides a slick way to deal with the common case of a transform which only applies when the arguments are of certain types and some arguments are not specified. This pass works by scanning the global function vars looking for ones that have transforms, and then processing the calls that reference the var. This eliminates the need to do a full code walk. We may need to be a bit clever about how we do things so that the transforms interact in a good way, or at least don't interact in a bad way. One problem is choosing which transform to apply when more than one is legal. This conflict will happen most often when there are specific and general transforms for the same function. We deal with this by trying the transforms in an order determined by a specified goodness factor. Another possible problem is that we may transform a call too early, resulting in a less good transformation being done. This probably isn't very significant, since it seems like it would only happen if a later transform revealed some information that couldn't be figured out by type inference. If this is a problem, then we could probably fix it by only attempting bad transforms when we can't find any other optimization to do. EVALUATION ORDERING This pass is optional and should be done if either speed or space is more important than compilation speed. This pass attempts to modify the order of evaluation so that storage requirements are reduced. This is generally involves evaluating complex arguments before simple ones. It's not clear how big a deal this really is, or whether it can fall out of other optimizations. Optimial expression ordering is Hard, but simple cases are probably the only thing that is really important, since hairy expressions are going to be slow anyway. It is trivial to postpone the evaluation of constants. Evaluation of variables can be postponed using reaching-definitions information. The evaluation may be postponed if the definitions which reach the new place of evaluation are the same as those at the old one. These two optimizations will probably be enough to prevent blatently stupid code from being emitted. This pass might also be a good place to exploit arithmetic properties in reordering numeric expressions. Alternatively, this could be done on the fly during IR2 conversion. The advantage of doing arithmetic simplification on IR1 is that it may open opportunities for constant folding and other optimizations. It is also possible to postpone the evaluation of easy expressions, but it is harder and not always a good idea. This is done using available expression information. If the argument expression is effectless and available at the proposed new place of evaluation, then it may be moved. If you want to get hairy, it is surely possible to move expressions with side-effects, but effectful simple expressions whose value is used are rare enough that it surely isn't worth the effort. If we want to do this kind hairy reordering, then it should probably be done on IR2, since that is where we have available expression information. Determining whether it is a good idea to postpone the evaluation of an expression depends primarily on whether the inputs to the expression could be thrown away if the evaluation weren't postponed. It would definitely be bad to postpone the evaluation of an expression that has several arguments which could all be in registers if the expression was evaluated immediately, since the argument TNs would be forced onto the stack instead. Liveness of the argument TNs is probably a relevant criterion. If an arg TN isn't live at the new location, then it is probably a bad idea to move the node unless the node which defines the dead TN can also be moved. IR1 FINALIZE This pass looks for interesting things in the IR1 so that we can forget about them. Used and not defined or defined and not used things are flamed about. Warnings about type errors are emitted in this pass. Since in our view, number of arguments is an aspect of function type, number of arguments checking for functions is also done here. When this pass is done, all errors and as many warnings as possible should have been reported. We postpone these checks until now because the IR1 optimizations may discover errors that are not initially obvious. Stuff checked: Type including function type. Bound but not referenced. Referenced but unknown function and variable. IR2 CONVERSION IR2 preserves the block structure of IR1, but replaces the nodes with a target dependent representation. Although IR2 is target dependent, much of the code which operates on it has no machine dependency. The big thing that happens is the translation of nodes into a sequence of virtual operations (VOPs). Virtual operations are things which will eventually turn into about one instruction. N-arg primitives are associated into fixed-arg versions at this point. All VOPs take a fixed number of arguments. This phase is where compiler policy switches have most of their effect. The speed/space/safety tradeoff can determine which of a number of coding strategies are used. When error checks are enabled, any checks not proven unnecessary are explicitly generated here. Type information can be forgotten after IR2 conversion. All type checks have been emitted and all type-specific operation selections have been made. Each local lexical variable and compiler temporary is assigned a TN. Variables have no semantics in IR2. Closure variables are preresented by explicit closure manipulation code. References to special variables are represented by Symbol-Value and Set-Symbol-Value VOPs. [This will get a lot hairier when we go to deep-binding specials.] Sets and Refs of local lexical variables are represented implicitly by supplying the TNs as arguments or results to VOPs. References to variables that don't belong to the current function are represented by Uplevel-Ref and Uplevel-Set VOPs. This makes the need for static links explicit and allows us to build a de-facto display if it is useful. If a TN is packed into a register, then Uplevel-Reference VOPs can be deleted in favor of the TN itself. [Not true if we always save all TNs live at a local call...] The IR1 node/continuation structure doesn't represent the evaluation of code in IR2, but parts of it live on as places to hang information about the IR2 representation: Continuations that actually end up being used by a function call in IR2 are used to collect information about the continuation. They are annotated with the TNs that are used to hold the values and with the value passing strategy. TN Assignment: The TN-assignment pass assigns to each continuation the TNs which receive the values. The value-passing strategy (known or unknown value count) is also determined. In IR2 we use TNs to represent values. A TN can only represent a single value, so we bare the implementation of MVs at this point. When we know the number of multiple values being handled, we can simply use multiple TNs to hold them. When the number of values is actually unknown, we use a convention which is compatible with full function call. We return the first K values and the number of values as distinct TNs, then return any additional values on the top of the stack, and also set the values-present flag. This mechanism is used whenever the number of return values is not always the same, not only for things like MV-Prog1 where the number of values is unbounded. Once we know the TNs which receive the values for each continuation, we can do the actual conversion at the block level by scanning through the block and converting each node in order. The actual node conversion for most nodes is simple, since they will just expand into a VOP or two that hack on the supplied TNs. First, we find the equivalence classes of continuations which would ideally have the same value-passing strategy and locations. We examine each local function, putting all the continuations used in calls to it in the same equivalence class with the result continuation. Equating the caller's continuations implements the obviously necessary requirement that each caller of a function must expect the values to be returned in the same places. Equating the result continuation with the callers' continuations propagates the same requirement to any tail-recursive calls, ensuring that they can return values to the ultimate caller without any shuffling being necessary. This is necessary for tail-recursion to work properly, since if we have to shuffle the return values after the call, it will no longer be truly tail-recursive. The result continuations for entry points and the continuations for full calls are constrained to use the standard value passing mechanism, with unknown value count and fixed result registers. If we were strict about using the same strategy and locations for all continuations in each equivalence class, then nearly all calls would end up using the standard ones, since most functions will be called tail-recursively from an entry point. This would be undesirable, since we can generate better code if we don't have to indicate the number of values and can pass them in any location. Fortunately we can relax this strict equivalence and still preserve tail-recursion by realizing that we can shuffle the results of a full function call as long as the call isn't tail-recursive. We are only forced to use the standard value passing conventions in equivalence classes which contain both a continuation used in a full function call and the result continuation for an entry point. If a class contains only one or the other, then we can share any convention and locations between the local calls, and add code to convert to the standard convention at the call or entry-point. As long as there is proper tail recursion between full calls and between local calls it doesn't matter if there is some friction at the interface. Now that we have figured out where we must use the standard value passing strategy, we can use a more flexible strategy to determine the return locations for local functions. We determine the possible numbers of return values from each function by examining the uses of all the result continuations in the equivalence class of the result continuation. If the number of values is not fixed, then we must use the unknown-values convention, although we are not forced to use the standard locations. If the number of values fixed, then we return that fixed number of values from all the functions whose result continuations are equated. We assign the result TNs at this time. A non-tail-recursive continuation must have TNs which are distinct from those of any other continuation, since it may have to hold the values for an arbitrarily long amount of time while other arbitrary code is being evaluated. This means that when we return from a call, we must move the return TNs to the actual TNs for the continuation. These TNs are preferenced, and will probably get packed in the same place. We can now assign the TNs for all the rest of the continuations. We scan though the code in reverse depth-first order, looking at each node and determining value passing strategy and locations for each continuation that has the node as its DEST. In most cases it is trivial to determine the number of values desired, since we want exactly one. The continuation for an MV-Call may want multiple values. In general, we will have to use the unknown values convention and do a full call, but if the function resembles an MV-Bind, we can request the obvious number of values. Loop analysis: The IR2 representation is annotated with information about loops. This isn't really very optional since environment analysis uses loop information to tell when it can allocate vars for functions in the same place. Inhibiting this "optimization" would cause lots of spurious uplevel references to be detected, which would probably make the compiler run slower, in addition to generating poor code. Loop information is really necessary for reasonable cost assignments anyway, and also allows us to do some other optimizations at little extra cost. First we find the set of the block which dominate each block. Note that if we keep the set sorted by DFN, then they will be in order of dominance as well. We probably want to special-case local function call, since standard flow analysis won't realize that the return point of the call is dominated by the call point. We want to have some kind of Loop structure which is linked together with Blocks seven different ways. The Loop structure should contain a list of all the blocks in it, and also the loop depth. If we arrange things so that Loops are reasonably nested, then we may want up and down pointers as well. If this is the case, then we could omit blocks contained within inner loops from our list. We should also list the exit points from the loop. We must deal with "strange" loops (not just a concession to TAGBODY, since mutual recursion can create non-reducible flow graphs.) First we find the back edges and natural loops, and then we find the strange loops. A strange loop is a cycle which contains no back edges. It turns out that every strange loop contains an ascending edge which is not a back edge, and all such edges appear in strange loops. We can find the strange loop for such an edge by doing a graph walk forward from the edge, recursing on cross and forward edges. If we reach the start node, then we add all the nodes on our path to the strange loop as we unwind. If we reach a node already in the loop, then we do the same thing. If we dead end, then we just punt. Strange loops are broken into segments, where each segment is the code in the loop which is dominated by a given entry point. The representation for a segment of a strange loop is about the same as for a natural loop. As far as environment manipulation code is concerned, they are fairly similar in effect. The difference is that we can't move code out of a segment of a strange loop unless we can move it to before all the entries in the loop. Determining the nesting of loops which have different start blocks is easy: just use the dominance relation on their start points. If there are multiple back branches to the same start block, we just consider all the code to be in the same loop. A loop is effectively defined by its head, rather than by any particular back branch. Local function call will introduce blocks of code which can be thought of as being shared between loops. [How badly will we lose if we don't special-case local call, either here or when finding dominators? It shouldn't be a correctness issue, since other code should be assuming arbitrarily bizzare flow graphs. If we did this, then a local function with more than one call could cause the flow graph to be non-reducible. This would result in more environments being allocated than is really necessary.] Cleanup generation: We emit cleanup code during IR2 translation. If the cleanup changes between two blocks, then we may need to either emit state saving code or cleanup code, depending on whether we are entering or leaving the messed up context. We need to do some kind of pre-pass which annotates the cleanup structures with the state that needs to be saved on context entry. We then emit the saving and restoring code during the actual translation pass. Note that since the state that we want to revert to must be lexically apparent, there is no problem finding out what the appropriate old stack pointer (or whatever) is, since even if it isn't in our context or a register, we can get at with an uplevel reference using the static link. All the dynamic state variables (such as stack pointers) are TNs, and thus can be explicitly manipulated by moving values into or out of them. In most cases the cleanup will simply be a move from a saved value into the current value. We also need to deal with "uplevel continuations": uses of continuations where we must unwind to some earlier state. We are concerned here with the kind of continuations that result from a non-local RETURN or GO. If we don't know the dynamic context of the return, then we cannot cleanup the specific things which have messed it up; we must revert the unknown things to a saved state. #### Note that every "uplevel continuation" will be a loop exit from a loop which represents a context. It is fairly easy to compute loop exits while finding loops, and this info is also useful for other purposes. When we exit from a context, if the current cleanup for all of the predecessors is the same and we can trace from that cleanup to the current cleanup for the continuation, then we can statically determine our dynamic context, and can use the cleanup mechanism. If we cannot statically determine our dynamic context, then we treat the continuation as a funny kind of Var, and get environment analysis to do the hard work of ensuring that the saved state information is available at the return point. It would be possible in many non-closure cases to determine that only some subset of the dynamic set can be messed up, and then to only restore those things, but it probably isn't worth the effort. Note that when we do a full register save due to a full call, we must always save the registers in the same place so that the reentry code for a closed continuation can restore all the registers when someone does a non-local exit from a closure somewhere down the stack. Environment analysis: What is the env analysis problem? Determine where to allocate variables: Fairly easy using loop information Establish a way to find variables which are not allocated in the current environment: (uplevel references) We don't allow arbitrary references to other environments, since the referenced environment must always be allocated at the time of the reference. The references must be truely uplevel, not crosslevel. If we have a structure which allows us to trace up lexical levels, then we can establish a unique path to find a uplevel variable. Find closure variables and figure out how closure environments are located: This problem seems to be mostly identical to the standard environment analysis problem. With minor tweaks, the same mechanism can do both simultaneously. Is there some straightforward way that we can use retained syntactic information to determine a nesting relation? What properties do we want for the nesting? 1] An environment which is referenced uplevel must reachable by tracing up the nesting tree from the reference environment. This guarantees that we can find any variable by scanning up a single static link. 2] An environment must always be allocated when any environment nested within it is allocated. This guarantees that any environments that we trace through to find a variable will be allocated when we reference them. A useful piece of syntactic information is the lambda that encloses the code for an environment. This can be determined if blocks are annotated with the lambda they were originally in. We can satisfy 2 if we use the environment for the enclosing lambda as the outer environment, since the environment for the lambda's variables must be allocated before any code in it can be evaluated. It is not quite so obvious that 1 will be satisfied. If lambda variables are the only things allocated, then it is pretty obviously true, since the only things semantically accessible would be variables in enclosing lambdas. Temporaries introduced by the compiler don't necessarily have to follow this rule. I guess that we have to be careful about what environment we allocate compiler temporaries in when the temporaries must live across an environment transition. For any such temporaries, we must know the deallocation environment in addition to the allocation environment. It is not clear that there are any such temporaries. Most temporaries will be consed at the beginning of the code for the form which needs them and will be deallocated at the end. The beginning and end of a call will aways be in the same environment. (???) Kinds of temps: Continuation value Environment pointer Cleanup context/dynamic context save Copy TNs Common subexpressions/loop invariants Preload TNs Save TNs VOP temps and load TNs Specbind and catch frames Environment analysis determines the time at which storage is allocated for lexical variables and where this storage is allocated. Variables are allocated either as TNs or as heap closure variables. With TNs we must determine which context is specified as the potential stack location and when this context is allocated. With closure variables, we must determine in which closure the variable is allocated and how we get to that cell. In both cases we attempt to combine environment creations by moving them backward and upward in the code. The legality of such motion is determined by using loop-dominator information. We can move the allocation of any variable to the head of the innermost loop which contains its allocation. If a stack variable is allocated and deallocated in the same loop, then we can move its allocation out as far as we please. Allocating all such variables in the outermost environment is reasonable, although this could cause unnecessary uplevel references. Initially, we guess that each loop which has stuff that can't be moved out has its own environment, but some of these get combined eventually. If any environment references another environment, but is not nested within that environment, then we combine the two environments. This mashes together the environments for functions which are broken up into multiple loops due to recursion. [Or can each environment be associated with a Lambda so that every loop directly nested within the same lambda will have the same env?] It is an invariant that the allocation point for a variable must dominate the original cons point. In the case of closure variables, the desirability of combination of allocations is less clear. If we blindly moved all allocations to the beginning of the loop which encloses them, then we might allocate the variables even when the body may conditionally not use them. Probably what we should do is scan up the dominator tree until we find either the loop head or an IF, and then cons the variables there. Stack variables have special significance, since all values that are not in registers are either on the stack or pointed to by the stack. A bunch of stuff on the stack which shares a common base pointer is called a context. We must have the base pointer for the context before we can access anything in it, so we must go to some trouble to make sure that the base pointer is available. A context is represented by the Context structure. Each lambda has pointers to the contexts where its heap and stack variables are consed. Each block has a pointer to the context(s?) allocated at the beginning and deallocated at the end. [can there be more than one? Probably not...] Each TN has a pointer to the context that it is allocated in. Pack needs to know something about contexts, since each context effectively has a different set of storage bases. An uplevel reference is a reference to a variable which is allocated in a different context than the one for the reference. We find uplevel references by scanning all the vars for all the lambdas, looking for references which are in a different context than the cons point for the var. An environment which is the current environment for an entry point is called a closure environment. If there is a closure environment in the nesting between an uplevel reference and the referenced environment, then the variable must be a closure variable. #| Uplevel ref accesses the save location if TN is packed in a register. Closure slots are allocated as TNs, but they may not be directly accessed. Closure-Ref must be used. Saving and restoring the previous context pointer is part of the environment allocation/deallocation process. [ENV also? Only for closure environments?] |# A possible optimization is to allocate closures on the stack when they are known not to be passed upward. The main use of this would be for functional arguments to functions known to be "safe". As far as the rest of the compiler is concerned, there is little difference between heap closures and stack closures. We just use different VOPs for hacking them. Stack management: Stack allocation of objects is made explicit in IR2. All storage is allocated using TNs. Lexical variables and temporaries are packed individually on the stack if they don't end up in registers. Some uses such as Catch require a chunk of several cells allocated contiguously. This is done by allocating a composite TN which represents the entire chunk, and then allocating the cells inside as individual component TNs. In the case of TNs that have no particular location, it is easy for Pack to find some place to stick the TN. TNs that must be on TOS at some point in time are somewhat more problematical. We can deal with this by restricting the SC for the appropriate TN-ref to be a special TOS SC. In the case of fixed size TNs, Pack can pack down from TOS just as easily as it packs up from the FP. Dealing with variable size TNs is more of a bitch, since there aren't any component TNs in the TN that determine its lifetime. We probably need to explicitly indicate the lifetimes of such TNs using alloc and dealloc ops which take the entire composite TN as a result or argument Codegen is the place that the actual stack hacking code is generated. Pack provides information about the allocation for Codegen to use. An example of such info is the amount of stuff allocated at a given point. Pack is provided with appropriate base TNs to use in specifying stack locations, but codegen is responsible for ensuring that these registers have meaningful values before it tries to load stack TNs. The base TNs may also be explicitly frobbed by local call code emitted in IR2 translation. [How is info passed from pack to codegen? If we know the TOS TN which is on the top then we can determine how much stuff there is on the stack. We can determine the top TN at the beginning of each block by looking at each live TN and seeing which is topmost. Pack will need to have this information anyway.] In the case of incoming stack arguments, we just set up the base pointer to point them and then force allocation of a composite TN which holds them. IR2 Function call: Local functions which are called in only one place are simply inserted inline. The variables are initialized using Move VOPs. Various implicit arguments to functions may need to be made explicit in IR2. In local function call there is no implicit environment manipulation. If the current context needs to be restored on return, then the previous context is passed back as an explicit result. If the static link needs to be passed in, then it is passed explicitly. This determination of implicit arguments happens during environment analysis. On both call and return some care needs to be taken to ensure that the TNs in which the values are passed are accessible in both the caller and the callee. There is no problem when the caller and the callee have the same context or when the values are passed in registers. In other cases, we pass the extra values on the top of the stack. This shouldn't happen very often if there is a reasonable number of registers. [Or we could pass them in arbitrary stack TNs (allocated where?) and use the static link to get at them.] The full function call mechanism must effectively be a subset of the local call mechanism, since the two mechanisms must mesh at entry points and full function calls. Local function call is done with the Call-Local VOP, which represents only the control aspects of the call. Each function used for local call has a list of TNs (determined by TN Assignment) which are the locations used for passing the arguments. The actual arguments are moved into these TNs using Move VOPs. Call-Local jumps to the function, passing the return PC for the continuation as an argument in a supplied TN. The continuation for the call is represented by the continuation we continue at. A tail-recursive local call is doesn't need a VOP to represent it, since it is just a jump. The Allocate-Context VOP deals with setting up stuff at the beginning of a local function. It allocates stack stuff and sets up the current context, returning the old current context as a result. This is basically a placeholder for Codegen, which will deal with any necessary stack-frobbing. The Return VOP takes the return PC and the previous context as arguments. It deallocates the current context, restoring the old, and jumps back to location indicated by the return PC. Restoring the context must be folded into return since the return PC might be stored on the stack. Return-Values takes an additional arg which is a stack TN containing stack values that need to be returned. The values are BLT'ed down to the appropriate place. There is no need for separate local and full return VOPs. Simple-Return is similar to Return, but doesn't do any environment deallocation. It is used for returning from local functions which don't have their own environment. There is no need for a Values variant, since there is no stuff on the stack that needs to be squeezed out. If a local function returns a known number of values which is less than the number expected by the caller, then additional code must be inserted at the return site which sets the unused values to NIL. MV-Calls can be treated as normal calls whenever the number of values returned by each value form is known. Calling of known lambdas of the form created by Multiple-Value-Bind can also be converted as a special case. A full call turns into some kind of Full-Call VOP. There are different VOPs for calling named functions and closures. We also have tail-recursive full call VOPs. Whenever we call a function which may return stack values, we must do something to receive or dispose of the values. There are three options: Default-Values This VOP is used when we want to receive a specific number of values. It takes the passing TNs, a stack TN and a desired value count as arguments. Values not returned are defaulted to NIL and excess values are discarded. The stack TN describes the place where the stack values are allocated. Receive-Values This VOP is used when we want to receive an unknown number of values. It takes the values-present flag, Nargs and a stack TN. The TN is set up to describe the returned values. [could set Nargs in the 1 val case as a side effect] Discard-Values This VOP simply discards any stack values. Usually we will have to have one of these VOPs after every full call, but if we know there are no stack values then the VOP can be omitted. Generating error checks: We don't want to treat error checks like normal code: (unless win (error...)), since this would murder the control structure of the code. What we do is pretend that an op like (assert-type x fixnum) is a normal operation which has no side-effects and returns the tested object. Filtering the object through the assertion allows it to be effectless so that it can be moved out of loops, without letting the check disappear entirely. This also provides an easy interface to correctable errors. The exact representation of type checks in IR2 will be fairly implementation dependent, since it will depend on the specific type system for the given implementation. For example, if an implementation can test some types with a simple tag check, but other types require reading a field from the object in addition, then the two different kinds of checks should be distinct at the VOP level, since this will allow the VOP cost and storage information to be more accurate. Generation of type checks should be factored out of code which would otherwise be more portable. In most cases, type checks should be generated automatically without any intervention being needed on the part of the translator, since the appropriate type assertions should already be manifest in the IR1. Converting conditionals into IR2: We probably want to consider control transfer to be the fundamental result of comparison, rather than anything such as a condition code. Although most compilers with whizzy register allocation seem to explicitly allocate and manipulate the condition codes, it seems that any benefit is small in our case. This is partly because our VOPs are at a somewhat higher level, making it difficult to tell which VOPs do and don't trash the the CC. Explicitly incorporating condition codes in our VM also introduces another architecture dependency. At the IR2 level, we have a class of IF-XXX VOPs which transfer control to one of two places on the basis of some test on the operands. When generating code for IF, we peek at the predicate and special-case it if we know how. Otherwise, we just emit an IF-TRUE and compile the predicate normally. If a builtin predicate appears in a non-branch context, then we effectively emit a (IF pred T NIL). This probably wants to be done during IR2 conversion rather than as a transformation on IR1, since in some cases the operand type will have to be taken into consideration when determining whether an operation can be open-coded, and is thus built-in. N-arg predicates such as < must be transformed into two-arg versions in IR1, since they will result in hairy conditionals that want to have IF optimizations done on them. Converting Catch and Unwind-Protect: Catch, Unwind-Protect and deep-binding specials all work in pretty much the same way. We make a stack TN to hold the catch frame or whatever, allocate TNs in to represent the slots, and then initialize them. The frame can be explicitly linked in by TN manipulations, since the active catch and whatnot are represented by TNs. Since allocation of the frame is decoupled from linking and unlinking, some of this stuff could be moved out of loops. We will need a VOP for loading the PC for an arbitrary continuation so that we can set up the reentry PC. [Could be done with Call-Local? glag...] VOP representation: As far as IR2 is concerned, it doesn't matter whether a VOP is implemented as a miscop or as inline code. A miscop is just a VOP which requires its args to be in the miscop linkage registers. This allows a VOP's implementation to be changed in fairly arbitrary ways without affecting its interface. Global functions that are special-cased by the compiler are translated into VOPs in IR2. Functions with a fixed number of args can usually be translated into a single VOP by a double dispatch determined by policy and operand types. In the actual conversion pass, the main activity is to choose some generic operation in the virtual machine such as FIXNUM-+-FIXNUM=>FIXNUM, and then to choose a specific VOP on the basis of the policy settings. The policy-driven choice of a specific VOP to implement the generic operation can easily be made table-driven. Note that it is important to make the policy choice in IR2 conversion rather that in code generation because the cost and storage requirement information which drives TNBIND will depend strongly on what actual VOP is chosen. In the case of FIXNUM-+-FIXNUM=>FIXNUM, there might be three or more implemenations, some optimized for speed, some for space, etc. Some of these VOPS might be open-coded and some not. It might also be useful to allow a generic operation to turn into a sequence of VOPs, but that would make automation of the selection process harder. Conversion of references to constants happens in a very similar way to variable conversion. Each distinct Constant gets assigned a TN. This TN is wired down in a SC which is determined by its type. In addition to arbitrary constants kept in the literal pool, we also will have SCs for each significant kind of immediate operand in the architecture. On the Romp, 4, 8 and 20 bit integer SCs are probably worth having. Since the constant is represented by a TN, copy generation can easily attempt to cache the constant in a register. Pack will use cost information to determine whether it is in fact desirable to make the copy. If a VOP can easily fold an immediate constant into its code sequence, then this will be reflected in a low cost for that operand when it is in an appropriate immediate SC. This will prevent caching of constants when it isn't worth the effort. Even if a constant is cached, a given code generator can decide to use an immediate representation instead. If the code emitted for a VOP in the non-constant case is very different than the constant case, then it may be desirable to special-case the operation in IR2 conversion by emitting different VOPs. An example would be if SVREF is only open-coded when the index is a constant, and turns into a miscop call otherwise. We wouldn't want constant references to spuriously allocate all the miscop linkage registers on the off chance that the offset might not be constant. In PQCC terminology, our VOPs don't represent templates that are very different. If we must choose between significantly different code sequences to implement an operation, than the choice should be made during IR2 conversion. The alternative is to make pessimistic assumptions about the TN requirements for the VOP. VOPs that want to be open-coded: Special set and ref. Named constant ref? CAR, CDR, SET-CAR... We could just have a Cell-Ref VOP which grabbed stuff out of boxed objects, but that wouldn't let us specify interesting side-effects. Move Uplevel-Reference: a reference to a TN consed in a different function. Length for simple vector, string, ... Simple vector accessors and setters. Fixnum = < > + -. Special versions of + and - which do overflow checking. Fixnum logical operators. Closure-Ref, Closure-Set Structure-Ref, Structure-Set: so we can do better side-effect analysis. EQ. System refs and sets. Change type. Type predicate. Type check. Bounds check. Constant arg versions of some VOPs which take fixnum args. Call-Named Tail-Call-Named ;Call-Closure ;Tail-Call-Closure Call-Local Return Return-Values VOPs not open-coded: Special bind and unbind. Allocation, hairy arithmetic, and other function-like ops implemented in assembler. Throw, Unwind. COPY GENERATION This optimization is optional, but is desirable if either space or speed is more important than compilation speed. One problem with TNBIND is that each TN is packed in the same place for all time, whereas the best place for a TN may vary from one place in the code to another. To reduce this problem, we introduce new TNs which are copies of the original TN. Since the lifetime and uses of the copy are more restricted than those of the original, the copy may be packed in a way which better reflects the local demands of the code. Heuristics are used to guess good places to introduce copies. Copies may either be clean or dirty. A clean copy is never written, whereas a dirty copy may be. There may be any number of clean copies for a TN, or there may be one dirty copy. References to a TN may only be replaced with references to a copy within code where the copy is valid. A dirty copy must be written back before any possible read of the original. Pack can combine some copies which we are too conservative to combine here. If the cost of making and using a copy exceeds the cost of referencing the original TN, then the copy can be replaced with the original. For example, if the original TN is already in a register, then there is no point in making a register copy. All the copy TNs should be preferenced together, including indirect ones (does this really matter?). A dirty copy isn't really a copy at all. A dirty copy moves the real value of the TN somewhere else within a chunk of code, and then moves it back when we reach the end. Probably we should only contemplate a dirty copy in highly constrained circumstances such as a loop that doesn't make any function calls. Here some good candidates for copies: 1] In a loop, references to TNs may be replaced with references to a copy for the duration of the loop. This is especially easy with TNs that represent constants. 2] When the desired SCs for references clash, it may be a win to replace some of the references with references to a copy. This is pretty vague and I'm not sure there's any non-perverse code in which this would be a win, especially if loops get copies anyway. #| This is a possible implementation, but it would be much less efficient that a special-purpose implementation... Clean copies can be done cleanly and with a high probablity of correctness by making a distinct copy for every reference and then letting global common subexpression elimination fold the copies together wherever possible. Common subexpression and loop-invariant optimization will have to be fairly careful about combining copies, since if they really combine all possible copies they may largely defeat the purpose of making copies in the first place. Probably common subexpression should only combine copies whose users have the same best SC, so we won't combine copies that should get packed into different SCs. Loop-invariant optimization should be careful not to combine copies of a TN for an inner loop with those for an outer loop, as that would make copying the TN an all-or-nothing proposition, instead of possibly being done only in an inner loop. Probably the way to represent this is to have the inner-loop copies be copies of the copy, thus making the outer-loop copies not the same expression as the inner-loop copies. |# LOOP AND GLOBAL COMMON SUBEXPRESSION OPTIMIZATION Any of the standard compiler optimizations can be done here. Common subexpression elimination, loop invariant detection, etc. Flow analysis of all TNs. Common subexpression and loop invariant optimization can have the same effect as caching reads to special and closure variables. This is because these operations are expressions like any other. An expression is a VOP. A VOP is a common subexpression if it has the same operation and argument TNs as another VOP. Only effectless VOPs are candidates for common subexpressions. An expression is "killed", disabling its reuse, when either one of the argument TNs is modified or a side effect which affects its result happens. Large expressions may be combined, but the combination will happen one VOP at a time. Probably it would be desirable to modifiy IR2 to make common subexpressions more apparent. A possibility would be to have a separate Operation structure which holds the basic operation and arguments, but not the results. We would then arrange to make all identical Operations EQ. Common subexpression elimination should be aware of the cost of saving the result V.S. recomputing it. This is somewhat sticky since the cost of saving the result depends mostly on the SC in which the TN gets packed, which we don't know. The most important case is probably avoiding combining expressions over a function call when recomputing the expression is cheaper then two memory references. I guess that the general idea is that some expressions are worth saving in memory and others aren't. This could be expressed by having function call kill expressions which aren't worth saving. There is also the hidden cost that retaining a common subexpression can tie up registers that would be better used. When combining copy TNs, we have pretty good interaction with Pack, since it can avoid doing the copy if it turns out to be a bad idea. With other common subexpressions we don't have this out. Probably PQCC has addressed this problem iff there is a reasonable solution. This may be a place where the estimating of storage requirements comes in. If we can tell there are lots of registers available then combining expressions becomes more reasonable. Global common subexpression elimination can be made to do most of the work of loop-invariant optimization. Once you determine that it is legal to evaluate an expression before the loop starts, you just insert a copy there. Subexpression elimination will replace copies within the loop when possible. If it turns out that the expression can't be used in the loop, then the copy at the beginning will get deleted since it is unused. There are two noticeable problems with this solution. One is that we have to be careful that we only try each possible invariant once, since we could get in a loop otherwise. The other is that speculatively moving every expression to the beginning of the loop would be pretty inefficient. Probably we want the code which checks for the move being legal to check for the move being likely to succeed before doing anything else. [A loop-invariant optimizer not based on Global Common Subexpression would be much more efficient and would probably get most of the speed win to the two combined without having to implement GCS at all. This is because the common subexpressions would be moved out of loops even though they aren't folded together.] TN LIFETIME ANALYSIS This phase uses flow analysis to find the live TNs at each block boundary. We determine this by backward flow analysis on the entire flow graph. Although somehat counter-intuitive, it seems to be possible to do live TN analysis on the entire flow graph without worrying about that fact that on some paths the same TN will refer to different variables due to recursion. SAVE GENERATION We use a caller-saves convention. Whenever we transfer control to code in a different environment without deallocating the current environment, we save all the TNs allocated in the current environment that are live at the return point. The control transfer will always be a call, since any other transfer (i.e. a GO) would deallocate the current environment. The saving and restoring is emitted explicitly using Move VOPs. Note that the saves don't have to be done if the TN gets packed on the stack. We can get Pack to do the Right Thing for us if we require the save TN to be on the stack, and then preference the original TN to it. Save TNs probably want to be magical, in that we don't want to have to dummy up the lifetime and conflicts information for them; it is both hard and pointless, since the lifetime of the save TN lasts the full time that the context is allocated. If we do any saves, then we need to redo part of Lifetime Analysis, since the lifetimes for TNs live across calls will be wildly inaccurate. Once the TNs are saved, the locations that they occupy are free for allocation of other TNs within the called functions. Additional lifetime bogosity is due to standard flow analysis not realizing that the function must return to the exact call site, rather than arbitrarily returning to any of the call sites. This causes paths which cannot be traversed to be considered, making it appear as though all TNs which are live at any call are live at all calls. Furthermore, on the spurious paths, the bogusly live TNs will probably never be initilized, and thus will appear to be live all the way from the return site to the root of the graph. What we do is change the blocks where restores are done to indicate that the restored TNs are killed by the block. We then subtract all TNs which we generated saves for from the live TNs at the end of all blocks. We can now redo the propagation step of Live TN Analysis. A useful invariant to remember is that every TN consed in a context is dead at context entry, since every TN is always initialized before it is referenced. Since we have the lifetime information already, this is an easy but powerful consistency check. CONFLICT ANALYSIS For each TN, find the set of TNs that have lifetimes which overlap with its own lifetimes. This may be a good place to estimate the number of times each TN will have to be saved and restored due to full function calls. Each live TN at a full call has the saving cost (adjusted by loop depth) added to the register allocation cost. For purposes of conflict analysis, there are five kinds of TNs: 1] Argument TNs 2] Load TNs for arguments 3] VOP temporaries 4] Load TNs for results 5] Result TNs Argument and result TNs are not considered to conflict, allowing the locations for argument TNs to be recycled for results. This requires that the writer for the code generator guarantee that all of the arguments are read before any results are written. In any perverse cases where this is not true, explicitly allocated temporaries must be used to avoid trashing arguments. Temporaries are considered to conflict with both arguments and results. Argument load TNs conflict with argument TNs and temporaries. Similarly, result load TNs conflict with results and temporaries. This allows argument load TNs to be recycled for result load TNs. We need to annotate each VOP with the conflicts for load TNs that Pack might want to allocate. PRELOAD GENERATION On machines which pipeline memory fetches, it may be desirable to move memory reading operations backward in the code when it is possible to do so. This will allow the memory fetch to proceed in parallel with other computation. We use a scheme reminiscent of copy generation to do this. What we do is move the read back an appropriate distance, and then specify a Preload TN as the result TN. A Preload TN has a single writer and a single reader; if the original result TN is more complex, then an explicit move is placed between the Preload TN and the original result TN. Pack is allowed to move the writer VOP as far forward as the reader VOP. The understanding is that pack will move the writer VOP as far forward as necessary for the Preload TN to be packed in a good location. The preloading could be done simply on a go/no-go basis, with the VOP being either left in place or moved immediately before the reader. The preload TN then gets assigned as its cost the amount saved by doing the preload. The actual generation of the preload must take into consideration both the legality of the backward move and the desirability of a given move distance. As long as the writer VOP has no effects, the legality of the move is easily determined by checking for VOPs which have effects on the writer VOP between the original location and the possible new location. Determining the distance to move the writer is somewhat problematical since on one hand, we want to move it as far as possible up to some fixed number of cycles, but on the other hand we don't want to make the lifetime of the Preload TN too long, since that would make it harder to pack. This is further complicated by our having to guess how many cycles a VOP will take. If a go/no-go pack strategy is used, then it might be desirable to attempt to look at the lifetime data to anticipate the largest move distance that Pack would be able to deal with. This is probably a bad idea though. If dealing with shorter than optimal preload times is important, then it would be better to make Pack attempt progressive shortening of Preload TN lifetimes. This takes care of preload generation for operations known to do memory reads. Since we haven't run Pack yet, it is possible that operand loading may cause additional implicit reads. We can probably make a pretty good guess of which TNs are likely not to be packed into registers if we use the lifetime information. If at some point the conflict set is larger than the number of registers, then someone is going to lose. The losers will be the ones long lifetimes and low costs. We can generate explicit preloads for uses of likely loser TNs. If the TN gets packed in a register, then the preload will all go away. If we don't guess an eventual loser, then we just lose the benefit of the preload. But... preloading memory TNs is probably not a very important optimization due to all the other optimizations that put important TNs in registers. It is probably only worth implementing preloading for explicit reads, in which case we probably don't need the lifetime information, and this phase could be moved earlier in the compiler if there was any reason to do so. [We could get pack to help more here, since Pack now knows about load TNs and loading.] To have maximally tense array hacking, we should also consider how to use preloading in interesting ways for arrays, but this would interact with things like loop induction variable determination and other loop optimizations since we can do a lot better if we know what the next index is going to be ahead of time. This would allow us to fetch the n+1'th element while processing the n'th. Note that it is harmless to fetch ahead an element which ends up being unused due to loop exit. Tense loop/array optimization is something which I haven't considered much in this design. In any case, these optimizations should happen somewhere before here. PACKING Assign TNs to SCs using cost and lifetime info. In a full function call, we probably want to make the saving and restoring of registers at least semi-explicit so that pack can tell whether it is worth packing a TN in a register. It may be that the cost of saving and restoring around calls outweighs the benefit of being in a register in other places. Probably a reasonable solution is to assign to each TN a save cost based on the number of full function calls at which it is live, modified by the appropriate loop-depth factor. There is an argument for being extremely reluctant to put a TN on the stack when it can go in a register. If we actually do a full function call, the overhead of saving and restoring will probably be inconsequential compared to the cost of the function itself, so in this case the choice is unimportant. If we don't do the function call, then we will want the TNs in registers. Min-max reasoning suggests that we have little to lose and something to gain by always putting things in registers. The argument breaks down when: 1] You have calls to simple functions in your inner loops. 2] You are more worried about space than speed. This suggests that the time cost of saving and restoring should be understated so that TNs are only packed on the stack to avoid saving when it is pretty clearly a win. There are probably a fair number of such clear-cut cases, such as a variable which is bound outside a loop, but used only as the argument to a function within the loop. The "assumption of infinite registers" used by PQCC seems like it may be a lose on the Romp. The assumption that they make is that they can get away without allocating registers for operand loading, under the optimistic assumption that they won't necessary. It looks to me like they get away with it primarily because many architectures don't require register operands in most cases, so even if the arg TN doesn't make it into a register, they won't have to allocate a load register. This is definitely not the case on a load-store architecture. A possible solution would be to get Pack to allocate operand loading TNs. What we do is have an optional SC requirement associated with TN-refs. If we pack the TN in an SC which is different from the required SC for the reference, then we create a TN for each such reference, and pack it into the required SC. In many cases we will be able to pack the load TN with no hassle, but in general we may need to unpack a TN that has already been packed. [How do we preference load TNs? We could just give them the preferences of the original TN, but we could do better if we gave them only the preferences associated with the specific use. What we could do is only count preferences to TNs which are referenced by the VOP that is getting the load TN.] The PQCC papers make a big show of the efficiency implicit in a Pack algorithm that never backtracks, but it seems in this case that it is an efficiency non-issue since unpacking should be relatively rare, and in not very expensive in any case. The advantage of backtracking it that it will let us have arbitrarily restricted SCs for load TNs, yet have a reasonable expectation that they will be packed in a good way. It is not necessary to introduce arbitrary copy TNs as in PQCC. This isn't really heavy-duty backtracking anyway, since we don't undo all packings between made between the time of the mistake and the time we found out it was wrong --- we only undo the packings that are actually causing problems. The only real trick in unpacking is choosing a TN to unpack which can be unpacked with minimum disruption and pessimization. We only need to consider unpacking TNs which conflict with the one we need to pack, and obviously we would prefer to unpack the least important such TN. We must avoid unpacking TNs which we may not be able to repack: TNs which are restricted to finite SCs. Once we unpack our victim and force packing of our TN, the victim will presumably be packed again soon afterward. It is possible that the victim will be repacked in the same SC, but more likely that it will get packed elsewhere and we will have to pack its load TNs. There is no possibility of endlessly undoing and redoing a packing, since the only TNs we will force packing of have required SCs, and thus cannot be unpacked. It is possible that the load TNs for the victim will force unpacking of yet another TN, but this process should be rare and limited, since each unpacking increases the odds of successful load TN packing due to replacing TNs that have long lifetimes with short-lived load TNs. The above method for allocating operand loading TNs works just as well for result evaulation TNs, since they all work through TN-refs. We could initially use a wimpy implementation which avoids unpacking by preallocating any loading registers that might be needed. Storage classes: We have lots of SCs because we try to get Pack to do our representation analysis for us. TNs are annotated with the primitive type of the object that they hold: T: random boxed object with only one representation. Fixnum, Integer, XXX-Float: Object is always of the specified numeric type. When a TN is packed, it is annotated with the SC it was packed into. The code generator for a VOP must be able to uniquely determine the representation of its operands from the primitive type and the SC. Some SCs: Reg: any register (immediate objects) Save-Reg: a boxed register near r15 (registers easily saved in a call) Boxed-Reg: any boxed register (any boxed object) Unboxed-Reg: any unboxed register (any unboxed object) Boxed-Cont: boxed object on the stack (on cstack) Immediate constant SCs Some more SCs: (when we add a number stack) Word: any 32bit unboxed object on nstack. Double: any 64bit unboxed object on nstack. Float-Reg: float in FP register. Even more SCs: Unboxed-Char: character without type code, for maximally tense character/string hacking. Char-Code and Code-Char are no-ops if arg and result are both unboxed chars. Schar doesn't need to put on type code if destination is an unboxed char. Alien-Value: some kind of hack to get pack to determine when to cons Alien values. This would allow Alien values to be first-class lisp objects, yet we would still compile them tensely. This would require some sort of "record" TN which would be a TN with component TNs, but the component TNs would be packed independently, rather than being required to be consequtive. [or something?] CONTROL OPTIMIZATION In this phase we annotate blocks with drop-throughs. This controls how code generation linearizes code so that drop-throughs are used most effectively. One question is how much of the original sequencing in the code we should attempt to preserve in IR1 conversion. It seems that this depends on both how smart the compiler is and how dumb the programmer is. Probably we want to preserve the drop-thrus in a TAGBODY at least until we can do loop rotation and similar things. There are basically two aspects to this optimization: 1] Dynamically reducing the number of branches taken v.s. branches not taken under the assumption that branches not taken are cheaper. 2] Statically minimizing the number of unconditional branches, saving space and presumably time. These two goals can conflict, but if they do it seems pretty clear that the dynamic optimization should get preference. The main dynamic optimization is changing the sense of a conditional test so that the more commonly taken branch is the fall-through case. The problem is determining which branch is more commonly taken. The most clear-cut case is where one branch leads out of a loop and the other is within. In this case, clearly the branch within the loop should be preferred. The only added complication is that at some point in the loop there has to be a backward branch, and it is prefereable for this branch to be conditional, since an unconditional branch is just a waste of time. In the absence of such good information, we can attempt to guess which branch is more popular on the basis of difference in the cost between the two cases. Min-max strategy suggests that we should choose the cheaper alternative, since the percentagewise improvement is greater when the branch overhead is significant with respect to the cost of the code branched to. Probably the most tractable implementation of this would be to compare only the costs of the two blocks immediately branched to, since this would avoid having to do any hairy graph walking to find all the code for the consequent and the alternative. Determining branch sense on this basis is probably only called for when one of the options is quite inexpensive. [Any reason?] It would also be reasonable to discriminate against ultra-expensive functions such as ERROR. One problem with this approach to dynamic IF optimization is that it loses when one of the options is empty, causing the next for one branch to be a successor of the other branch, making the comparision meaningless. This probably isn't a big problem. Actually, we can detect this using dominator information. When a branch is empty, one of the predecessors of the first block in the empty branch will be dominated by the first block in the other branch. In such a case we would favor the empty branch, since that's about as cheap as you can get. When you decide to favor a branch of an IF, you probably want to favor the join by making it a drop-thru as well. Statically minimizing branches is really a much more tractable problem, but what literature there is makes it look hard. Clearly the thing to do is to use a non-optimal heuristic algorithm. Given a spanning tree, we can get a legal set of drop-thrus by deleting all but one ingoing and one outgoing arc of each vertex. It makes no difference which one we delete as far as number of branches are concerned. If we could somehow assign weights to the arcs which will encourage building a good set of drop-thrus, then we could use a lightest-spanning tree algorithm. Clearly a major determinant in the weight of an arc is the number of remaining paths out of the block it is from, the fewer the more important it is. It make sense to prefer the larger component when choosing among equal weighted arcs to add. Longer components are better, since each component represents two branches but consumes a bigger portion of the program as it gets longer. Another possibility is to use an algorithm based on Depth First Spanning Trees. We can modify the basic depth first ordering algorithm so that it chooses an ordering which favors any drop-thrus that we may choose for dynamic reasons. When we are walking the graph, we walk the desired drop-thru arc last, which will place it immediately after us in the DFO unless the arc is a retreating arc. We scan through the DFO and whenever we find a block that hasn't been done yet, we build a straight-line segment by setting the drop-thru to the unreached successor block which has the lowest DFN greater than that for the block. We move to the drop-thru block and repeat the process until there is no such block. We then go back to our original scan through the DFO, looking for the head of another straight-line segment. This process will automagically implement all of the dynamic optimizations described above as long as we favor the appropriate IF branch when creating the DFO. Using the DFO will prevent us from making the back branch in a loop the drop-thru, but we need to be clever about favoring IF branches within loops while computing the DFO. The IF join will be favored without any special effort, since we follow through the most favored path until we reach the end. This stuff will undoubtedly have some effect on the plausibility of using delayed branches. It remains to be seen what effects are good and bad and what may be done to maximize good effects. This needs some knowledge about the target machine, since on most machines non-tail-recursive calls will use some sort of call instruction, and thus should not be considered as drop-throughs. CODE GENERATION This should be fairly straightforward. We translate VOPs into instruction sequences on a per-block basis. Probably the only real work is dealing the the remaining sematics of function, such as saving and restoring around full function calls and generating function entry points. The result is a list of the exact instruction sequence for a block, exclusive of final branch instructions. After code generation, all control flow is explicit. VOPs for error checking and value receiving may expand into code which contains branches. In such cases, Codegen introduces new blocks which are linked in as appropriate. Implicit conditional branches are expressed in exactly the same way as those which implement IFs. Codegen can hack the flow graph fairly easily, since it doesn't need to worry about invalidating the flow analysis information which is not used after this point. Each component in the flow graph is dumped as a distinct function. If there are multiple entry points, we choose one as the main function and store all the constants in it. The other entry points will have this function object as a constant, and will grab it out when they are called. There are a few optimizations which can really only be done on the assembly code. These would most often be done by peephole optimization. We may be able to do them by cooperation between code generation and the assembler. The most important optimization is turning branches into branch-with-execute. The assembler can do this without a full peephole optimization stage because the blocks are passed in separately. ASSEMBLY Lay out the blocks somehow or other, taking into account the dropthrus. This process should attempt to minimize branch lengths and place related code together. Once this is done, the exact branch code can be generated, taking into account possible use of delayed branches. After that, all that needs to be done is instruction translation and fasl-dumping. TARGET DEPENDENCE There are basically two levels of target dependence. Parts of code generation will be totally target dependent since they will deal with actual machine instructions. Other code will depend on the execution model but should map on to many machines with little difficulty. This includes IR2 conversion and transforms which expand into sub-primitives. Some IR2 passes emit VOPs, but these VOPs will be in every implementation's IR2. IR2 optimization has some problems, since it needs to know about side-effects and dependencies of VOPs so that it can tell when an expression should be killed. This isn't very serious though, since without determining what aliasing exists we can't do much better than the boolean effects used by Rabbit. Determining aliasing relations in Lisp is pretty impossible. We can do better for lexical variables, but we have already mapped them to TNs which are independent of the particular IR2. For special variables we can just have a Special effect which all special references depend on. We could do better if we treated specials specially, but it probably isn't worth it. All we need then, is the set of effects and dependencies for each VOP; this can easily be tabular. KEYWORD ARGS Someday we can have keyword entry points that allow us to do the keyword parsing at compile time for calls of known functions with constant keywords. This is pretty hard in the general case, since defaults for keywords can depend on the values of previous keywords. If there are no ugly dependencies, then we can just have a keyword entry which takes all the keyword args as normal args, and let the caller worry about defaulting. For ugly args, there can be an additional supplied-p parameter which the keyword entry looks at to decide whether to default the argument. This also opens the way for semi-automatically eliminating keyword calls to well known functions such as sequence functions. All we need is something like a KEYWORD-ENTRY proclamation which indicates that the specified function should have a special entry for calling with pre-parsed keywords. This will only work in conjuction with an assertion that the function interface is the same in the runtime environment as it is in the compiler. This can be provided by a FUNCTION proclamation. Actually, we could do this whenever we have a proclaimed function interface, KEYWORD-ENTRY notwithstanding. The only problem with this is that we might feel some compulsion to provide better inconsistency checking and facilities for redefinition if we automatically made keyword entries. For this external keyword entry we should just pass a supplied-p arg for every keyword, rather than having the callers have to know about and worry about the exact defaulting mechanism and presence of supplied-p args. MUMBLAGE ABOUT TYPES The current definition of types in Common Lisp sucks. Let's start thinking about ways to clarify the mess. It makes some kind of sense to base the type system on the notion of objects belonging to types, rather than on direct inclusion relationships between types. What then becomes the primary concern is the mapping between Common Lisp types and the possible implementations for those types. This notion subsumes the idea of specialization of types in the manual. It seems that the best definition of an object belonging to a type is: An object belongs to a type if it could have been created by the system when told to create an object of that type. "Telling the system to make an object of type X" utimately involves calling some Common Lisp function which is constrained by the language definition to return an object "of type X" when used in this way. Incomplete type specifiers cause some problem. Note that there is no way to tell the system to create an object of type SIMPLE-STRING, since the size is not specified. You must specify the length, getting something like (SIMPLE-STRING 3). We get around this problem by considering an incomplete specifier to stand for all the possible complete specifiers when we are testing an object for type membership. This mapping is from type specifiers to sets of type specifiers, and is entirely defined by the language. This mapping is quite different from the implementation mapping which maps programs to objects. The implementation mapping is not a mapping from type specifiers to objects, since the system may consider things not describable by a type specifier when choosing implementations. If the system can prove that your program uses an object in a constrained way, then it may choose a different representation than it would under other circumstances. This freedom to choose representations is constrained only by the places in the Common Lisp definition which require "types" to be disjoint. These constraints really mean that the representations of the indicated types must be disjoint. In the absence of a meaningful TYPE-OF, this qualification is unnecessary, since there is no way that the user can directly discern the implementation. Typep already guarantees that it will hide the implementation mapping. Note that this definition is more restrictive than the current one in the case of "specialized arrays", since it would require that the result of make-array always belong to the specified type. There is no loss of power, since it simply makes operations well defined which weren't before. One would think that defining types in terms of object membership would make the SATISFIES specifier less problematical. With this formulation, at least, this is not the case. There is no way to tell the system to create an object of type SATISFIES, so no object would be of type SATISFIES. One way around this problem is to say: The user will implement this type in terms of primitive types. We can assume nothing about the implementation, and will ask the user (supplied function) to ask whether he might have made it. (subtypep x y) All operations defined on each possible object of type X are defined on some possible object of type Y. Alternatively, all possible implementations for an object of type X are also possible implementations for an object of type Y. The implementations of X are a subset of the implementations of Y. Important aspects of subtypep are portable. If X is a just a more fully specified version of Y, then X is a subtype of Y in all implementations. The implementation-dependent aspects enter when we involve the implementation mapping. SUBTYPEP is defined in terms of the real implementation mapping, rather than in terms of an idealized Common Lisp implementation. The only real difference is that in a given implementation, potentially distinct types may be mapped to the same implementation. Suppose an implemenation had only one integer represetantion, causing INTEGER, FIXNUM and BIGNUM to be interchangable. In this implementation, INTEGER is a subtype of FIXNUM. At least for the purpose of the compiler, it makes more sense to preserve all possible type information in the program so that more checking can be done. In an implementation where SHORT-FLOAT is the same as SINGLE-FLOAT, calling a SINGLE-FLOAT function with a SHORT-FLOAT will work fine, but the program won't port. The compiler should be able to detect erroneous programs even when they will work in a given implementation. This aspect of the SUBTYPEP definition is suboptimal for some purposes. The main example is statically determining type membership for an object, given that we know some type that the object belongs to. At least in the case of floats, it pretty clearly is incorrect to say that two types are disjoint when in that implementation they are indentical. The SIMPLE-ARRAY <=> ARRAY case is also interesting. Note that a program which does (TYPEP 1f0 'SHORT-FLOAT) is ill-defined unless the program has called SUBTYPEP to find out the float type relationships or is just trying to determine the relationship by doing the test. It looks like the thing to do is to make the actual subtypep implementation as picky as allowed by the language, and then have each implementation define DEFTYPEs which fold together eqivalent types. This type-folding could be turned off for purposes of type-checking in the compiler. And what about TYPE-OF? Do we want to attempt this identity? (SUBTYPEP (TYPE-OF x) 'foo) <=> (TYPEP x 'foo) It's certainly true that: (SUBTYPEP (TYPE-OF x) 'foo) => (TYPEP x 'foo) This will definitely require changing lots of people's TYPE-OF's. (TYPE-OF 3) isn't FIXNUM, it's (INTEGER 3 3). This would definitely make TYPE-OF a bitch, since it would have to find a type as specific as the most specific one that the user could meaningfully pass to SUBTYPEP. I think that this would end up unnecessarily constraining the implementation mapping, since it would be very hard to have types whose implementions partially intersect. In any case, it's probably good for TYPE-OF to try real hard to maintain this equivalence. INTERPRETER INTERFACE We have to figure out how to make special bindings in the interpreter and PROGV. We want a special frob like: (WITH-RANDOM-SPECIALS ... (%SPECIAL-BIND sym value) ...) Each %SPECIAL-BIND will add a special binding to the environment of code evaluated after it. Exiting the form will undo all special bindings made within the form. This can be implemented fairly easily as a macro when we are using shallow binding. It will just stash the initial BSP in a variable, and then do the body in an unwind-protect, calling some frob in the cleanup which unbinds to the saved BSP. This is more of a bitch if we have deep-binding. We might want to do this some other way, possibly splitting the binding frobbing off from the code affected by the bindings. Cute idea: (PROGV vars vals body) => (%PROGV vars vals #'(LAMBDA () body)). %PROGV is a totally magical function which establishes the bindings somehow, and then calls the body function. This would only be reasonable if we have stack closures. The only residual problem is dealing with the places in the interpreter where people use the stack for temporary storage, pushing and popping it randomly. The only places that do this are the binding forms and PSETQ. In the case of the binding forms, we could use the conses which we are going to cons anyway to hold the results. This would require rearranging the code a bit, but would probably improve performance. PSETQ is more problematical, since it currently doesn't cons. Probably we could get 90%+ of the performance by special-casing fewer than N vars for some small N and consing for more. A little consing in PSETQ isn't going to kill anyone anyway, considering all the consing caused by variable bindings. If we ever reduce that consing, we can apply the same technology to PSETQ. IR2 CONSISTENCY CHECKING Block connectivity is similar to IR1. Verify that TN-REF linkages are correct. Check that the VOP list within each block is well-formed. Check VOP arg counts? Check for valid TN assignments w.r.t. function call. Check loop analysis? Check cleanup generation: Could look for uplevel continuations that might have been missed. Check environment analysis: Do basic sanity checks such as making sure that directly referenced TNs are directly accessible in the current context. Check for references to uninitialized TNs: We could detect some lossage by looking for TNs which are referenced before they can't possibly have been written. This could easily fall out of lifetime analysis. If a TN is uninitialized at the beginning of a block, and is read before it is written, then we have a problem. Check packing legality: To the extent that it is possible, this is a good idea. We can certainly check for things like SC restrictions being satisfied. We could also check on the legality w.r.t. TN lifetimes, but it would be extra work. We could redo the live var analysis using simpler data structures and algorithms which are optimized for legality checking. After checking each block, we check that the new analysis agrees with the old, which will guarantee that the lifetimes are globally correct. Our checking algorithm would probably be based on data structures representing locations rather than TNs. Each location would have a flag indicating which live TN is in it, if any. If someone attempts to read a different TN in that location, then we know we have a problem. The validity checking might not be much less buggy than the real version, but hopefully the bugs won't intersect. Check accuracy of effects information: Incorrect tables of information about effects and interfaces is a major source of lossage. We could check up on this by mumbling when a supposedly effectless operation is used for effect. Could also complain if a supposedly effectless operation is explicitly replicated in user code. Could have a grab-bag of other checks: whenever we find a problem, we try to devise a check which would have detected it. Cost assignment checking: Probably a good idea, since bad costs would cause subtle bugs because code would work, but slowly. Can we write something to check for efficiency bugs? Can check for things like: TNs which aren't preferenced that we make moves between. TNs that are preferenced, have disjoint lives, and get packed into different LOCs in the same SC. Specific VOPs whose average cost is out of line with initial estimates. Code sequences containing many VOPs with above average costs. Could have a smarter, slower pack algorithm which tries to find better packings. If the better packing is enough better, then we complain. OBSERVED BUGS AND POSSIBLE FIXES Some bugs gleaned from the change logs of existing compilers, with possible ways to prevent them from occurring in this compiler: Generators emitting bad assembler format: Make assembly emission be done with a macro that checks the instruction for validity at compile time. Random address arithmetic being done wrong with magic numbers: Use only named constants and well defined named functions in such expressions. Examples of functions are: words=>bytes (* x 4) byte-index (+ x y) skip-header (+ x ) No magic numbers in generators anywhere. Bugs with stack simulation: Let Pack worry about it. Generators putting unboxed things in boxed registers and vice versa. Mainly a problem with temporaries used within the code for the VOP. This is most likely to happen in code which deals with boxing and unboxing numbers. Factor this code out and get it right. We can have a frob that will tell us all VOP generators that request unboxed TNs, so that these can be given an extra-careful going over. General usage of wrong registers (magic register numbers) Make generators get their temporaries by requesting them as TNs. The declarative VOP info will cause them to be allocated and passed in. The requirements are specified in terms of SCs rather than magic registers. This should give us some freedom to move things around without having to change all the generators. Problems with inadequate restrictions being specified on TNs: Have conservative defaults? Functions that pass aound pieces of LAP such as operand locations cause problems with callers not being up on the syntax of what is going on: Don't pass around syntax, pass around structures that represent semantics and can be checked. Broken generators fail to return the results and similar stuff: Check that all the args and results are actually used as operands by the generated code. We could have temporaries be classified by whether they are always used or only sometimes used, and then could check that the always-used ones are always used. Representation selection seems to be a huge source of problems: Does having lots of SCs and getting Pack to work for us really eliminate all these problems? Discrepencies between the interface to a transform and the real definition: This is largely eliminated by specifying a type that the call must satisfy. If we forget to handle a keyword in a transform, then it just won't get transformed. We could have a frob that compares the types for transforms to the types for the definitions. Other confusions between different implementations of the same thing: Macros v.s. special forms. Functions v.s. source transforms v.s. transforms v.s. open coding. Do they do the same thing? Are the differences gratuitous? Are we using the one we want? Do we have all the versions that we need? If the same information is kept more than one place then it is surely wrong somewhere: Don't duplicate information. Check that duplicates are consistent. When decorating the representation, we forget to fill in some slots, and then garbage is interpreted in some random way: Choose defaults that are obviously broken: *empty* Failure to realize that some stuff must be recomputed due to failing to realize that something has been invalidated or failing to propagate the invalidation.