Research Overview

According to my views of a foundational theory, it must explain the practice of computing. So the basic structures and concepts of real computing systems must emerge in the theory. It seems to me, then, that a critical element of a foundational research program is to force the identity of software artifacts with the formal constructs of a mathematical theory.

Robert Constable, June, 1995

My main research interest is the application of programming language theory to programming practice. With our ever-expanding reliance on computers in everyday life, it is no longer possible to rely solely on informal assurances of software quality. Moreover, as individual programs become more complex, and systems are built by combining components obtained from a variety of sources, it is no longer possible for a single programmer, or even a small team, to manage the complexities of their interaction. What is needed are means of making the properties of a program component part of the component itself, in a form that can be used by verification and integration tools to ensure the integrity of composite software systems.

Over the last two decades type theory has emerged as the central organizing framework for the design and implementation of programming languages. Type theory (the study of type systems) provides the theoretical foundation for safe component integration. In the words of John Reynolds, a type system is a syntactic discipline for enforcing levels of abstraction. By syntactic we mean that the type system is part of the program, rather than purely in the mind of the implementer. By discipline we mean that type restrictions are enforced; ill-typed combinations are ruled out by the type system. By levels of abstraction we mean the clean separation between conceptually distinct data objects that may, in a particular program or compiler, have the same or similar representations. (In the end, everything's a bunch of bits, but that doesn't mean we should regard, say, floating point numbers as such.)

Implicit in this definition is an important principle: the power of a type system lies as much in what it precludes as what it allows. The most powerful type system of all is the one that rules out all programs. However, such a type system, while surely preventing safety violations, is hardly very useful. The goal of type system design is to increase expressiveness by admitting useful programs, but without compromising safety.

Type Systems for Programming Languages

My interest in type theory began as a founding member of the PRL Project at Cornell University, which is led by my thesis advisor, Robert Constable. The goal of the PRL Project is to develop a comprehensive system for programming and implementing computational mathematics. The project developed the NuPRL interactive theorem prover for a constructive type theory that is sufficiently powerful to support both programming and mathematical reasoning.

My experience with NuPRL led me to consider the direct application of type theory to the design of practical programming languages. One outcome of this effort was the Standard ML programming language, which I helped to design together with Robin Milner, David MacQueen, and Mads Tofte in the mid- to late 1980's. The language exhibits a number of novel features that have been adopted in other languages, including O'Caml and Haskell. These include a flexible and convenient type inference mechanism, a uniquely expressive module system, a convenient exception mechanism, and the ability to define functions by pattern matching.

Standard ML is one of the few languages to be rigorously defined. This is achieved by using a single formalism (operational semantics) that defines both the static semantics (the type checking rules) and the dynamic semantics (the execution rules) of the language. The Definition has served as the basis for numerous implementations of Standard ML, including Standard ML of New Jersey, the Kit Compiler, Moscow ML, Poly/ML, MLton, MLj, and TILT.

The design of Standard ML raised many questions that have been pursued by a large, and growing, community of researchers in type system design. My own interests have included type systems for modularity, especially those that support separate compilation, higher-order, recursive, and first-class modules, and on type systems for references and continuations.

Typed-Based Certifying Compilers

Early compilers for Standard ML were derived (directly or indirectly) from compilers for Lisp. These compilers were based on a "type erasure" interpretation in which type checking was entirely a front-end issue, with no effect on code generation or execution. In essence Standard ML was conceived as a rich dialect of Lisp that happened to come equipped with a type checker for the source language. While surely useful as a pragmatic step towards building efficient ML compilers, this approach ran into a number of difficulties with both expressiveness and efficiency. When examined closely, the type-free execution model is really a type-ful model based on dynamic type checking. To support these checks, unnatural, tagged and boxed data representations are required, increasing space and time overhead. Although the static type system would, in many cases, obviate the need for run-time type checking, it proved difficult in practice to eliminate this overhead. This proved to be particularly true of polymorphic (including modular) code.

What was needed was a more refined run-time model that could express distinctions between dynamically and statically typed values so that a statically-typed integer could be represented as a machine word, without tag information. A natural choice is to move from an untyped (really, uni-typed) run-time model to a typed run-time model, in which object code is equipped with a type system that enforces invariants governing data representations. This type system may be used to guide a garbage collector (to distinguish pointers from integers, for example), and provides a framework for compiling polymorphic programs efficiently, without run-time checks. To achieve this Greg Morrisett, and I devised a new approach to compilation based on typed intermediate languages and type-directed translation that carries types through the back end of a compiler.

The first such compiler for Standard ML was TIL, which was implemented by David Tarditi, Greg Morrisett, Perry Cheng, and Chris Stone. TIL demonstrated the feasibility of using typed intermediate languages for a full-scale programming language, the core language of Standard ML. The original TIL compiler was extended to all of Standard ML by Perry Cheng, Leaf Petersen, and Chris Stone, with contributions by Andrew Bernard, Dave Swasey, and Stephanie Weirich. This compiler, called TILT, is the basis for much of my current research on type-based compilation. The type-based approach was later adopted and extended by Zhong Shao in the FLINT intermediate language, which is now in daily use as part of the Standard ML of New Jersey compiler.

To build a type-based compiler for Standard ML requires a careful re-think of conventional compiler techniques in type-theoretic terms. I, my students, and my collaborators have studied a number of issues in typed compilation. These include my work with Mark Lillibridge on typed CPS conversion for the polymorphic lambda calculus, with Greg Morrisett and Yasuhiko Minamide on typed closure conversion, with Greg Morrisett on intensional type analysis for compiling polymorphism, and with Greg Morrisett and Matthias Felleisen on typed storage management, and with Chris Stone on singleton kinds. These ideas have been extended and modified for use in practical type-based compilers such as TILT and SML/NJ.

The availability of types during code generation makes it possible to build a self-checking compiler, and, by extending types to generated object code, a certifying compiler. A self-checking compiler can check the correctness of its own work by using the type checker to ensure that the intermediate code is well-typed after each stage of compilation. This greatly facilitates team development, and helps to isolate many compiler bugs to the earliest stage at which they arise. A certifying compiler extends this idea to object code, allowing the recipient of generated code to ensure its safety, without relying on the correctness of the compiler! The TAL Project at Cornell has developed type systems for x86-like object code, and built a type-based certifying compiler, called Popcorn, for a type-safe C-like language. The TILT Project is currently engaged in extending types to object code, with the goal of implementing a certifying compiler for all of Standard ML.

Logical Frameworks

Much of my research on programming languages is based on the study of formal systems that capture one or more aspects of the language. The Definition of Standard ML, discussed above, is one such formalism, as are the numerous type systems that are studied in the literature. Whenever a new programming formalism is introduced, it is essential to check that it is sound in the sense that it properly enforces levels of abstraction. A soundness proof is usually only interesting insofar as it fails; it is expected that any well-defined formalism is sound. Unfortunately, soundness proofs often involve checking numerous cases, making them especially prone to error. It would be useful to have a means of mechanically checking (or even proving) such results.

Similarly, a proliferation of logical systems have been introduced for reasoning about programs. These include not only classical formalisms such as predicate calculus, but also less familiar ones based on modal and temporal logics. If these logics are to be used as part of a mechanical verification system, it is necessary to implement a proof checker and related tools for each such logic. Experience shows that much of the work is tantalizingly similar for many different logics. It would be useful to have a general means of implementing proof checkers for a wide variety of logics.

The purpose of a logical framework is to provide a solution to these problems. The idea of a logical framework is to provide a single, universal formalism for representing formalisms so that an implementation of a formal system may be obtained by simply representing it in the framework. Furio Honsell, Gordon Plotkin, and I introduced the LF logical framework for exactly this purpose. Drawing on experience from N. G. deBruijn's seminal AUTOMATH Project, we proposed a very simple dependently typed lambda calculus as a universal formalism for representing formalisms. A surprisingly large number of formal systems may be encoded in LF, providing evidence for its universality. Some limitations of LF are also known, which leads to variations such as Frank Pfenning and Iliano Cervesato's Linear LF.

One benefit of LF is that it provides an explicit, checkable representation of proofs in the encoded formal system. This capability was exploited by George Necula and Peter Lee in their Touchstone Compiler that generates proof-carrying code as a form of certificate for object code safety. Proofs are lambda terms in the LF lambda calculus; proof checking is simply type checking in LF.

Twelf is an industrial-strength implementation of LF that includes a flexible and natural type inference mechanism and a powerful theorem prover for automatic and semi-automatic verification of properties of encoded systems. Information on other logical frameworks may be obtained from Frank Pfenning's Logical Frameworks page.

Scientific Computing

Traditionally, scientific (numerical) computing has been carried out in low-level languages such as Fortran and C that help the programmer to maximize the use of machine resources to achieve the fastest possible implementation of a routine. While such low-level languages allow the programmer to take full advantage of machine-specific details such as cache architecture or interconnection network, they also make it more difficult to implement sophisticated data structures (such as complex tree or graph structures) and provide only weak support for modular programming.

On the other hand, symbolic programming environments such as Mathematica or Maple are based on very high-level languages that are not implemented efficiently and that are, in some cases, semantically suspect. These systems emphasize prototyping and experimentation, relying on lower-level languages for the implementation of efficient production code. Here again little support for modular programming is provided, beyond very high level mechanisms for connecting to remote servers running special-purposes packages.

Together with Guy Blelloch, Gary Miller, and Noel Walkington in the context of the PSciCo Project, I am applying advanced programming languages to scientific computing problems. The premise of our work is that it is possible to eliminate the distinction between production and prototype languages by devising a language that supports both numeric and symbolic computing in a single language that may be implemented with acceptable efficiency on stock hardware.

Our strategy for achieving this integration is two-fold. First, we claim that by programming in the purely functional fragment of Standard ML we may exploit available parallelism without explicit programmer intervention. This work is based on experience with NESL, a purely applicative, data parallel programming language, which we are integrating into Standard ML. Second, we are considering scientific applications for which sophisticated algorithms and data structures are required. Such algorithms are often difficult, or even impossible, to implement in a low-level language, but are relatively easily implemented in Standard ML.

Our current research focuses on computational geometry (including mesh generation and multi-resolution rendering), computational physics (various implementations of the n-body problem, with applications to cosmology and protein folding), and differential equations (the finite element method for solving boundary value problems).

Trustless Grid Computing

Grid computing envisions the Internet as a huge, parallel computer. Countless machines are connected to the network, most of which are either idle, or only very lightly loaded. Can we exploit these spare computing resources to perform useful work?

Numerous projects are based on the network-as-computer model. Among the best known are SETI@HOME, FOLDING@HOME, and GIMPS. These projects are typical in that they divide up a very large problem into many independent sub-problems that can be parcelled out to volunteers on the network for their solution. A central site collects the solutions to these sub-problems and integrates them into a solution to some large-scale computing problem.

Security is a significant problem for volunteer computing systems. The main issues may be divided into two categories:

  1. How can the host system be protected against malicious applications? A grid application might well be a trojan horse that compromises the integrity, privacy, or usability of a host system. Even well-intentioned programs can go wrong, wasting resources of the host machine.
  2. How can the application developer be protected against malicious hosts? Hosts may deliberately or inadvertently return incorrect results, or refuse to return results at all.
Currently these issues are largely a matter of trust. For example, I personally trust that SETI@HOME will not compromise my system, and I make no effort to interfere with its work. For one or two well-known applications, this approach is workable, but it requires far too much trust to be practical in the large.

The goal of the ConCert Project is to explore trust-free grid computing. By this we mean that we wish to develop a model of grid computing that minimizes trust relationships between host owners and application developers without compromising security, integrity, or efficiency. The corner-stone of our work is a proof-based approach to security and integrity:

  1. Application developers must prove to host owners that their code complies with the owner's safety policy. This requires some form of code certification by which applications are equipped with sufficient information that the host can check compliance with its safety policy.
  2. Hosts must prove to application developers that their results are tamper-free. This requires developers to employ some form of result certification so that answers may be checked for accuracy.
The goal of the ConCert Project is to develop the language and compiler technology required to enable trust-free grid computing.

Last modified: Thu Mar 24 14:45:18 EST 2005