Composing Languages

How does compiler generation work when a system is composed of many languages? Organizing and specifying systems with interfaces is a well-known software engineering technique. But interfaces, languages, and procedures are all really the same thing [Lampsons-hints]. In this section we examine how multiple languages are assembled into systems and how this impacts compiler generation.


Emacs is a good example of a system with many languages: C code is run through cpp then compiled into machine code. Elisp is compiled into byte-code then interpreted. Regular expression matching is available as an elisp primitive. Modes implement UI languages, and some modes even have full-scale interpreters built in [note emacs-languages-all]. These relationships form a directed graph [this graph isn't quite right. should delete killfiles and header format, dotted edges, but what is elisp's eval?].



Say interpreter int1 has another interpreter int2 as a primop. int2 is an embedded language. In emacs, this is the relationship of bytecode to the regexp language. Schematically, the code looks like this:

Now say that for some program p calls int2 with the same data1 again and again, ie there are three stages: prog, data1, and data2. Data1 can be compiled by adding another case to int1:

Here comp_int1 = cogen(int1, (s d)) works fine.

What happens if int1 and int2 are the same? This is reflection. A lisp system's eval is a familiar example. A fixed point is required to generate the compiler: it is closed because cogen memoizes on binding times (the table is stored in eval) (note: it has to look it up in the table every time it is called, here we see another artifact of direct cogen instead of self-applicataion).

In general, reflective sublanguage relationships form a directed graph. This graph is lazily traversed by cogen.


Consider a different kind of composition:

That is, int2_1 is a program written in the language defined by int1. We say int2 is a layer on top of int1. Here cogen(int2 (s d)) fails because int2_1 is represented with data instead of code, so it doesn't get very far as a metastatic value. So instead write: Now in cogen(int2, (s d)) obj is a procedure so it is analyzed by cogen properly. Since various annotations are required for most interesting inputs to cogen, obj must in general contain annotations. These annotations must be created by cogen from int1 and int2_1.

The above is equivalent to using a binding time lattice with multiple stages.