Some mumblage about possible function call and register conventions for a new lisp compiler on the RT. FUNCTION TYPES The old notion of a "function object" is now broken down into three different parts: Function entry A function entry is a structure that holds the information that we need to call a function. Environment The environment is stuff that a function needs when it runs. This includes constants computed at load time and variables closed over at run time. Environment information may be allocated in the function entry structure after the required linkage information. Debug information This is information about a function which isn't normally needed at run time. Debug information will be either allocated in or pointed to by the environment area. Function entries: We support two function calling conventions: named and anonymous. A named call is a call to a known global function, and is done through the link-table. An anonymous call is done when the function being called is not known at compile time. The format of a function entry looks like this: header word {function entry subtype} pointer to code-vector arg-count info or pointer to env that has this info environment info... There are a fixed number of entry points, one for each possible number of register args, and one for when there are more. When there are stack args, the total number of args is passed in the NARGS register. The entry points are direct pointers to the PC where the code starts. If it is not legal to call the function with a particular number of args, then the entry point points to a standard error routine. It would be possible to have additional entry points for specific numbers of arguments >3, which could be used by named call, but it probably isn't worth the effort. If there are stack arguments, then you are losing pretty badly already. When we do an anonymous call (FUNCALL), we make sure the object is a function object, grab the code vector and jump in at a fixed offset. If it is not a function object, we jump off somewhere else and deal with LAMBDAs and illegal functions. The 0 arg entry as at byte 0, the 1 arg entry at byte 4, etc. Each of these fixed entries will be either a relative branch to the real entry code or an absolute branch to an error routine. When snapping link-table links, we look at these branches to find the real entry point, allowing us to skip this extra branch during named call. This representation makes no distinction between "closures" and functions which have no closure variables. For all that matter, there is no distinction between "constants" and closure variables. Purify will move all environments and their dependent structure into read-only space. If this will be bad for you, then you can throw a switch that causes environments to be moved into static space. In some cases, we want to store environment info without any directly associated function entry. Such an environment is indicated by the subtype in the header. Presumably the environment can somehow be reached from the environment area in some entry point. DEBUG INFO Since the compiler takes substantial liberties in where it allocates values and how it does function call, we need at least some special debug info before we can do any kind of reasonable debugging. There are three interesting levels of debug info: 1] Minimal info which allows for a backtrace with arguments. This requires information about the locations of the basic context pointers such as the environment ptr, the previous environment and PC, and the arguments themselves. Values which are allocated in a register may also have save locations which must be used when the routine is not the current one. This info could be represented by a vector of alternating register numbers and stack offsets, with either one being null if the value isn't allocated there. The debug info for each environment will also need the ranges of PCs in which it is current so that we can find the correct environment given the current PC. 2] Info which allows location of user variables given their name. A reasonably compact representation for this info would be to have a vector for each kind of storage, with a cell for each location. Each cell would store the string name of the variable located there, or a vector of the strings if there is more than one. The compiler should uniqueify repeated variable names somehow. 3] Info which allows a precise source location for some PC to be determined. For each component, we have a list of structures describing each top-level form. There is information about the source file and location in the file where the form starts, and a sorted vector which translates from PCs to source paths within the form. The vector has alternating PC and source path entries. The code between two PC entries belongs to the source path they surround. The source path is represented as a 4bit I-Vector, with big indicies being represented by 15 followed by a 4 nybble offset. We can locate the source path for a PC by doing a binary search. CONTINUATION REPRESENTATION A continuation does the same thing as a conventional stack frame, but is conceptually a sort of stack closure. It contains two areas of stuff: saved registers and temporary data. The saved registers are blasted out immediately before a call using STM. Temporary data is anything which didn't make it into registers, and is accessed word by word. CONT remains unchanged during the execution of a function. SP may be changed as the demand for temporary storage changes. Since the caller resets SP after a call, the amount of temporary storage may be changed for free. SP --> . . CONT --> temporary data . . saved registers The arguments to the continuation are the values being returned. If there is other than one argument, we have a multiple-value return. There are two interesting problems in the continuation calling convention: argument dispatching and argument passing. The continuation argument dispatching problem is similar to the anonymous call argument dispatching problem. We could solve it in the same way, but can do better if we recognize the supreme importance of calling a continuation with a single argument (returning a single value). The solution is twofold: 1] If other than one value is being returned, this fact is flagged to the continued code. This flag should be easy to set in the one-value position and easy to ignore. A good possibility is to return a NEQ condition code if there is only one value. Normal functions should find it easy to set this condition without doing any extra work, and it is trivial to ignore. 2] If the continued code only wants one value, it may ignore extra values at no cost. This imposes a constraint on the calling convention: the continuation must restore the SP, since it cannot be decremented until any return values are safe in temporary storage allocated by the continuation. Normally this will only take a single instruction which adds a constant to the continuation and stores it in SP. The argument passing convention may be the same as for normal functions. The continuation being invoked will be passed in the continuation register, since a continuation has no continuation. If a continuation returns other than one value, all the unused arg registers are set to NIL. We may have to close over a continuation due to a return or go out of a function. In this case, we treat the continuation like a lexical variable, storing the stack pointer to the continuation in the variable. When the dynamic extent of the continuation is exited, we zero the closure variable so that we can tell it is invalid. For this non-local continuation case, we fetch the return PC out of a known place in the continuation. We actually know this PC at the place where the continuation is invoked, but we may have to process unwind-protects between here and there. The stack pointer for the closed continuation actually points at the PC which is somewhere within the continuation for the function being continued. Things which unwind work by first finding the place to unwind to and then invoking an unwind routine with this continuation. The unwind routine looks at the current unwind-protect and invokes it if it is dynamically below the continuation. The unwind-protect entry stashes the continuation being unwound to, does the cleanup code, and then calls the unwind routine again. An unwind-protect block is just a pair of cells in the continuation of some function on the control stack: pc for entry point previous unwind-protect A catch is pretty similar, except that catch blocks are linked separately from unwind-protects and the tag is in the block: pc for catch entry point previous catch tag A throw works by scanning through the linked catch blocks to find the continuation to unwind to, and then calling the unwind routine. At any of these unwind entry points, the continuation pointer is fixed to point to the start of the whole continuation and any other necessary cleanup such as special unbinding is done. Of course, the stack pointers need to be closed over somewhere in the continuation, but that may only have to be done once at function entry. It should also be fairly easy for the compiler to break up catches and unwind-protects by storing into the current catch or unwind-protect, rather than doing a throw. These changes would make unused catches quite a bit faster, which is probably worthwhile since most catches and unwind-protects usually fall through. REGISTER ALLOCATION Ultimately the registers will be allocated differently so as to give the compiler as much freedom as possible. The main idea is to globally allocate only those registers with global significance. There is only one: the control stack pointer. Scratch registers may be trashed by miscop calls, whereas temporary registers may not. NL0: unboxed temporary NL1: unboxed temporary A.K.A. CONT, the continuation for a call. SP: control stack pointer NL2: unboxed scratch NL3: unboxed scratch A0: first function argument A.K.A. MA0, first miscop argument. A1: second function argument and boxed scratch. A2: third function argument A.K.A. MA1, second miscop argument. L0: boxed scratch A.K.A. NARGS, number of args in call (if not implicit.) L1: boxed scratch A.K.A. MA2, third miscop argument. L2-L5: boxed temporaries L6: boxed temporary A.K.A. ENV, the environment in a function call. L7: boxed temporary A.K.A. PC, function return address. Rationale: Boxed temporaries go last, since we want to be able to use LM and STM for reading and writing continuations (stack frames), and the contents will be these registers. We want the miscop argument registers to alternate with scratch registers so that shift-paired instructions can be used to extract the type of miscop arguments. These can be boxed temporaries since the extracted type code is a valid fixnum. CONT, NARGS, ENV and PC are preallocated to the extent that a certain useful thing is passed into a function in them. The function may do whatever it wants with the registers and values. NL1 will probably usually have the continuation in it. We don't pass the continuation in a scratch register since we will want it later and may not have have to save it. Non-Lisp temporaries are more useful that you might suppose, since any immediate object can be kept in them (i.e. fixnums.) Unboxed registers become quite important when you start open-coding arithmetic, since if you have an actual unboxed thing (say a single-float), it is more expensive to spill than a boxed quantity. This scheme separates the function argument (A) and miscop argument (MA) registers. This allows the block L0-L7 to be freely used to allocate locals which will need to be closed in a continuation as long as no miscop calls are done during the evaluation of arguments to the call. NARGS is passed in a scratch register (if at all), since it is useless after the arguments are processed. We don't have to pass NARGS when three or fewer args are being passed. This requires a little cleverness, since one of the register args might be part of a rest arg. What we do is always have an entry point for each legal arg count less than four, rather than always jumping in at a rest entry when there are rest args. ENV and PC come into a function where they do partly because that's where LM leaves them. This is a pretty good place anyway, since if we have a continuation these things will want to go in it. Probably we will initially keep the old register allocations for registers that miscops care about. This will eliminate the need to fix up all the assembler sources. The only big problem with the old allocations is that there aren't any non-scratch unboxed registers. This won't be a problem until we start generating code that hacks unboxed numbers. We can free up BS, FP, AF and PC, allocating on an as-needed basis. The revised allocations will look something like this: NL0: A.K.A. NARGS A0 NL1 A1 A3: fourth miscop argument A2 CS: control stack pointer L0 L1 L2 L3 L4 L5: (was BS, binding stack pointer) L6: A.K.A. CONT (was FP, active frame pointer) L7: A.K.A. ENV (was AF, active function) L8: A.K.A. PC SAMPLE CODE SEQUENCES Named function call with no stack args. We assume CONT has already been saved and that CONT already points to our context. This will have been done at function entry. stm CONT, start-reg, -size ; Save registers cau NL3, link-tab-high lm NL3, ENV, link-tab-low ; Get entry and env balr PC, PC ; Go The entry for a single-value continuation. cal SP, CONT, size ; Reset SP lm CONT, start-reg, -size ; Restore registers Single value return from function which makes no function calls. brx PC ; Return something that sets NE condition Single value return from function which made calls but was able to save PC and CONT in registers. ??? is an instruction that sets the CC and moves the result. (and immediate -1?) brx save-pc ; Return and ??? CONT, save-cont ; restore CONT while returning NEQ Single value return from function which had to spill PC and CONT into memory. lm CONT, R14, save-offset ; Load CONT and PC into R14 and R15. brx PC ; Return and ??? CONT, R14 ; restore CONT while returning NEQ Multiple value calling is vastly more efficient than it was if there are three or fewer values being recieved. The entry for a multiple-value continuation which wants two or three values: bex skip ; If a MV return, do nothing. cal SP, CONT, size ; Reset SP cau A1, nil-h ; Default unsupplied values to NIL. [lr A2, A1] skip lm CONT, start-reg, -size ; Restore registers The return of zero, two or three values is identical to a one-value return except that the unused arg registers must be NIL'ed and the EQ condition returned. In the case where a continuation is used by more than one function call, we must put in a jump to the join point either before or after the balr. We could put the jump after the balr and then let cross-jumping optimization combine as much of the calling sequence code as possible. SYSTEM ISSUES GC is going to have to work some to deal with all these raw PC's lying around. Too bad... The scenario for GC is this: first find all functions on the stack and in boxed registers, then find all code pointers on the stack and in boxed registers, and for each one check to see if it points into the code vector of one of the functions. If so, adjust it if necessary. A code pointer must be distinct from an I-Vector pointer because old PC's may be left lying in temporary storage on the stack after their corresponding function object is gone. This is due to temporary storage allocation keeping whatever stack garbage was there. If we find a PC without any corresponding function then we can just ignore it. If we luck in to both a garbage PC and function then it doesn't hurt anything, although the function won't get GC'd even if it is true garbage. Life is real tough for the debugger. Without stack garbage, it is not too hard to get a backtrace without arguments of non-tail-recursive calls. It should be possible to detect "calls" which are stack garbage by tracing up the stack and making sure that the return addresses link together properly. More elaborate debugging information could be enabled by a compiler switch. At function entry, we can STM all the boxed registers (A0-L7) into a known location in temporary storage. This would hold on to the arguments and make it much easier to find the calling function and return PC. Tail recursion can be inhibited by wrapping a mv-prog1 around the body of every function. This stuff has a significant time penalty, but would allow reasonable debugging of compiled code. We could emit a storage map which tells where every variable is located. By adding spurious variable references, we could keep lexically visible variables from disappearing when they die. We could use the same technique for arguments instead of saving them separately, but it is useful to have the actual argument values rather than the possibly modified argument variables. It would also be possible to save the arguments without building a storage map, which would save a great deal of space. RANDOM IDEAS Adding the notion of a constant function would allow a super-fast calling sequence. A constant function is a function which may not be redefined. We put the function object and entry PC in our constants vector. We know which registers it uses, so we might not need to save any. We know how many values it returns. We know if it changes SP. Calling a simple constant function may be nearly as fast as a miscop call. Call of a constant function known not to trash any of our registers. Assume ENV and PC were saved at function entry. We don't have to do anything on return. lm env-save, ENV, offset ; Get new ENV and PC balr PC, PC This would hair up GC some, but not too much: all you have to do is look out for function-objects followed by PC's while scavenging function space. MULTIPLE THREAD SUPPORT When we support multiple threads under Mach, we will have to scrounge a THREAD register. This will point to a structure which contains the thread-specific context which doesn't fit in registers: Current special binding block Active catch pointer Any number stacks Information for management of the resources for this thread: Base and size of stacks. Descriptive info (name...) Policy info (priority...) OS level handle on the thread Synchronization info such as what lock the thread is waiting on, if any. The THREAD pointer could actually be the base of the thread's control stack, with the frequently used info being stored before any real stack stuff. This would allow fast checks for continuation validity, since both the top and bottom of the cstack will be in registers. This is useful, since a non-local GO into another thread's stack would cause flaming death. We will also have to modify code which manipulates resources shared between threads so that it does synchronization. The big issue performance-wise is the alloc-table: when there is no lock on the table, we want to get at it real fast. Contention for the alloc-table could be reduced by having multiple allocation spaces (and tables) for important types. You could just grab a table off of a queue and put it back when you are done. It will be necessary to make all miscops "interruptable", since they will have to be able to run concurrently. Probably the only resource shared at the miscop level is the alloc-table. Note that software interrupts are going away, so strictly speaking interrupts won't happen, but a scheduling break can happen at any time. The only big problem I see is with miscops which manipulate raw pointers into objects. The system must either avoid GC while such a pointer is in use or adjust the raw pointer if the object is moved. Probably the fix is to allow raw pointers in boxed registers as long as the real object is also in a boxed register. This would allow random pointer hacking to be open-coded as well. Being preempted during object allocation would still be a problem, but the allocator will have to get a lock on the alloc-table before it can do anything, and GC can't run without the alloc table. Of course, we will have to go to deep binding when we support threads. We can eliminate the binding stack by threading the BIND blocks together on the control stack. This will simplify the problem of allocating stack for each thread. Deciding on the exact implementation of deep-binding is non-trivial, since there are many different ways it could be done. The scheme used on the S1 seems like a reasonable start, though. A BIND block looks like this: . . . The value cells are just allocated on the stack like a lexical variable. The BIND block is delimited by the prev pointer. The reason for keeping a pointer to the cell instead of just using the stack location directly is that it lets us cache binding lookups. Whenever we look up a variable, we link a LOOKUP frame into the bindings. This lookup frame contains the result of our search and looks just like a BIND frame as far as the search is concerned. The theory is that this will reduce the average distance searched, since commonly used variables will commonly have been used somewhere near you on the stack. Since we look up the address of the value cell rather than the value proper, we only have to do the lookup once per function. Note that there is no obligation to link in a LOOKUP block; if it is unlikely or impossible that any function we call will reference any of the variables that we reference, then not making a LOOKUP block is a good idea. The S1 has an important optimization of variable lookup. It is based on the observation that many special variables are never actually bound. If a variable is known not to be bound, we can just go directly for the value cell. One solution to the problem would be to introduce a new GLOBAL variable kind which cannot be bound, but we can do nearly as well by noticing the lack of bindings at run time. The S1 keeps a count of bindings in a cell in each symbol; if the count is zero at lookup time, we don't have to do any lookup. I think that it would work just as well to have an EVER-BOUND flag which is set and not reset. One advantage of this approach is that there is no need to do anything when a binding is removed. On the S1, THROW needs to scan up the bindings, decrementing the count, which requires a way to distinguish BIND frames from LOOKUP frames. There is a trick which might be a win if BIND and LOOKUP blocks have a large enough average size. The idea is to use a hash filter to determine if a given symbol might be in a particular block. Each symbol name is hashed to a fixnum with a few (3 or 4) random bits on. When a block is built, the OR of these hash values is placed at the beginning. When looking for a symbol, we can skip a block if any of the bits for the symbol are zero in the hash for the block. All these hash codes are computed at compile time, of course. When binding an unknown symbol, we just use all 1's. We can't delimit the block with the prev link, since that has to be next to the hash so that we can skip the block without going to the end. A block would look more like this: . . . In a block less than six or so entries, we can reject it with about 90% probability. The probability improves as the block size decreases, but the advantage decreases as well. The actual advantage also depends on the dynamic characteristics of the lookup process, since if it turns out that a lookup is usually satisfied on the previous block, we will never reject any blocks. Bizzare optimization on deep-binding: If we notice that a function which we call in a local call is referencing a variable we bind, we can pass in the value cell pointer as an argument so that it doesn't have to do the lookup. It is questionable whether this would help real code even with massive block compilation, but it sure would win for STAK. We could actually generalize this optimization to be less obviously bogus. What we do is shallow-bind the *location* of the current value cell for each special variable which is both referenced and bound. At each entry-point to the block, we build a specbind frame for variables bound in the block (doesn't need to be all). Initially all the value-cell pointers are to the looked-up current values. Within the block, we keep track of the current value cell by shallow-binding it in some known location. This could be in a register or in the specbind frame itself. Internally to the block, we can do a special reference by indirecting through the magic location, and can do a binding by saving the old value in our frame and then storing the location of the new cell in our frame. When we do a full function call, we move any value cell pointers which are in registers into the correct place in the specbind frame. This could make STAK *faster* than normal shallow-bound implementations, and might well help real programs also. FASL FORMAT At a minimum we will need new FASLOPS to deal with multiple entry-points into functions. Our "no such thing as top-level" approach also has some ramifications. DEFUN et al work by expanding into real code that does stuff. It is difficult to back-translate this real code into FOP-FSET and whatnot. Having DEFUN be implemented by code in top-level forms is no problem for normal load, but it creates problems for cold load, since we need to be able to resolve function references at cold-load time. What we could do is have a new VOP which notes the presence of an entry point for cold-load purposes. This would be a noop in normal load. It is interesting to note that the FASLOP language has been deprived of almost all significance to normal load. All it does is create constants and then make functions which use these constants, possibly calling them. This suggests a radically different approach to the FASL format. What we do is punt the FASLOP byte-code and express all of the load-time semantics as native code. To load a file, you just suck in the load-time executed code and jump into it. The code in the file does the rest. Some minimal amount of semantics probably wants to be declaratively specified in the file. We would use the Core File format, and define new entries that throw information into the Table, which is a G-Vector that is passed to the top-level code as its environment. Ideas for entries: Link-table fixup, miscop fixup: (op, start-offset, fixup offset1, fixup offset2...) The offsets are places in the data pages that are to be patched. Intern: (package, len1 name1, len2 name2...) The resulting symbols are pushed into the table. Unboxed constant: (type, start-offset, data-size, object1-offset, object2-offset...) Used to load strings and code vectors. The objects are represented in their actual in-core format so that they can potentially be mapped in directly. The objects are stored in the data part of the core file. Pointers to the loaded objects are pushed into the table. Top-level form: (start-offset, size) Just start running at some code in the data pages. Size is there so that Genesis can pick up the code and move it. We would probably want to relax the current page-alignment requirements so that files wouldn't be mostly padding. If a chunk looks big enough to be worth mapping, then we could align it. Boxed constants would be created with the normal consing operations. [Possible problems with interactions between mass interning and package hacking top-level forms? Seems like the problem isn't fundamentally worse than it currently is, but people might bum out even worse than they do now if we interned all symbols at the beginning of the file. I guess we can force package hacks into their own top-level forms, and do them separately.] Changes along these lines could clearly wildly speed up fasloading, and would also make dumping circular structures more reasonable. The downside is: 1] FASL files would probably be bigger. The bulkup is not as big as you might suppose, since the actual code and string constants will dominate the load-time overhead. A large part of the load time action is already in native code in top-level forms anyway. With a dense byte-code instruction set like on the PERQ, native code may actually be more compact than FASLOPs. 2] All pretense of portable FASL files would be abandoned, since basically nothing could be done with a FASL file without running machine code. This pretense never seems to have done us much good. Due to the use of top-level forms, it is currently impossible to generate a portable fasl file. Theoretically this capability might be useful someday, but then again maybe not. [Sending arbitrary data structures in messages?] 3] Fasdumping would be slower. Once again this might not be as big a problem as you might suppose, for the same reasons mentioned for 1 above. 4] Fasdumping would be more complicated and less portable. This is probably the biggest counter-indication. Dumping code could be made mostly portable by emitting as Lisp rather than assembler, but this would slow down compilation. To be fair, you would have to consider the simplification of the fasloader and cold loader, and the reduction in duplication of functionality. Cold load would be similar to the way it currently is, but would be simpler since it wouldn't need to know about as many different operations and data types. It would only need to handle the fixup/intern/unboxed-constant ops described above. Running of the top-level forms would generate the function objects and the other constants. Cold load would compute the table for each top-level form and dump it as a g-vector in the core. Mostly it would just mash the little files together into a big one. Another way to look at this the following: there is no such thing as a boxed constant. The "constants" for a function are simply variables in the top-level form which have been closed over. The top-level form initializes these variables to the appropriate values. If we actually implemented things this way in the compiler, then it would have the desirable effect that the variables in the standard "DEFUN in LET" hack would be dealt with just as efficiently as "constants". There is no real difference between a top-level var which is not set and a constant. LOAD-TIME-EVAL just magically binds a variable in the current top-level form and then references it.