System Design

My system is called nitrous [note name]. This section describes its design and how the goals influence it. The design is based on the techniques developed by the off-line (aka self-applicable) Partial Evaluation (PE) community [JoGoSe93]. The following subsections provide detailed technical information.

  1. root language
  2. interface to cogen
  3. binding time analysis
  4. generation pass
  5. variable splitting
  6. lift compiler

Currently Nitrous is implemented with scheme48 [scheme48], an efficient bytecoded Scheme. Scheme is very good for prototyping languages.

The rest of this section serves as a summary of and map to the subsections.

The easiest way it to make cogen composable is to make make cogen accept the same language as it produces. I call this language root. To facilitate fast compilers root is just an abstract machine language (see root for more).

The compiler generator itself is called cogen. It takes a procedure and a binding time pattern as arguments. The binding time pattern specifies which of the interpreter's arguments are the program (S) and which are the data (D) (see interface for more).

How does cogen work? It makes two passes over the interpreter. The first pass factors the program into the generating extension and the residual code, this pass is called binding time analysis (see bta for more). There are two main ways of implementing BTA: with type inference [Henglein91] or via abstract interpretation [Consel93]. The prototype currently is oriented towards the latter, though an inference system may eventually prove better.

The second pass (usually called `specialization', though `generation' is more accurate here since this is a direct cogen) recursively traverses the code structures of the interpreter and converts each into a compiler (generating extension). Static instructions are copied into the compiler as they are. Dynamic instructions appear as templates: they produce the output from the compiler (see gen for more).

There are three environments (envs) to keep straight: the metastatic env, the static env, and the dynamic env. The metastatic env (also called the BT env, or the analysis env, or cogen's env) exists in the abstract interpretation, it maps all the program's variables to binding times. The static env (also called the compiler env) exists in the compiler, it maps the static variables to values. The dynamic env (also called the residual env) exists in the generated code. Note that if cogen is applied to an interpreter with an environment, then there are four.

Both cogen itself and the compilers it produces are memoizing. Memoization is used to avoid infinite programs by producing loops. Cogen memoizes on binding times. Generated compilers memoize on static values. (see envs) (see gen) (see layers).

Variable splitting is a technique for eliminating unnecessary intermediate data structures in programs. While this is a generally useful optimization, it is a necessity in PE if environments are to be treated efficiently (see splitting for more).

Inlining is another generally useful optimization that is required to get decent results from PE.

It is important that variable splitting and inlining not just be handled as general post-pass optimizations; that would be too slow. Instead specialized versions of these optimizations will be integrated into the generated compilers.

Side-effects are handled simply by forcing them all to runtime, thus compilers must be purely functional. This is an important deficiency. It appears likely that analysis can overcome this [Heintze94].


In summary, the design objectives and how they are satisfied are:

fast compilers
compilers produce (and the backend accepts) abstract machine code.

semiautomatic
directly implemented cogen based on partial evaluation converts interpreters into compilers with minimal human direction.

composable
since cogen's output language is the same as its input language both multiple layers of interpreters and multiple stages of compilation are handled.

powerful
nitrous already handles higher-order jumps, polymorphism, data structures, and dynamic side-effects. At least modules, exceptions, and compile-time side-effects remain.

predictable
since I have no fear of annotation and binding time assertions, the results and termination of cogen should be understood by the programmer.