Sample Code for this Chapter

Standard ML is a type-safe programming language that embodies many innovative ideas in programming language design.  It is a statically-typed language, with a user-extensible type system.  It supports polymorphic type inference, which all but eliminates the burden of specifying types of variables and greatly facilitates code re-use.  It provides efficient automatic storage management for data structures and functions.  It encourages functional (effect-free) programming where appropriate, but allows imperative (effect-ful) programming where necessary (e.g., for handling I/O or implementing mutable data structures).  It facilitates programming with recursive data structures (such as trees and lists) by encouraging the definition of functions by pattern matching.  It features an extensible exception mechanism for handling error conditions and effecting non-local transfers of control.   It provides a richly expressive and flexible module system for structuring large programs, including mechanisms for enforcing abstraction, imposing hierarchical structure, and building generic modules.  It is portable across platforms and implementations because it has a precise definition given by a formal operational semantics that defines both the static and dynamic semantics of the language.  It provides a portable standard basis library that defines a rich collection of commonly-used types and routines.

These features are supported by all implementations of Standard ML, but many go beyond the standard to provide experimental language features, more extensive libraries, and handy program development tools.  Details can be found with the documentation for your compiler, but here's a brief overview of what you might expect.  Most implementations provide an interactive system supporting on-line entry and execution of ML programs and providing access to tools for compiling, linking, and analyzing the behavior of programs.  A few compilers are "batch only", relying on the ambient operating system to manage the construction of large programs from compiled parts.   Nearly every compiler is capable of generating native machine code, even in the interactive system, but some optionally generate byte codes for a portable abstract machine.  Most implementations support separate compilation and incremental recompilation based on automatically-generated or manually-constructed component dependency specifications.  Some implementations provide interactive tools for tracing and stepping programs; many provide tools for time and space profiling.  Most implementations supplement the standard basis library with a rich collection of handy components such as dictionaries, hash tables, or interfaces to the ambient operating system.  Some implementations support experimental language extensions, notably mechanisms for concurrent programming (using message-passing or locking), richer forms of modularity constructs, and support for "lazy" data structures.

To develop a feel for the language and how it is used, let us consider a small, but non-trivial, program to implement a regular expression package for checking whether a given string matches a given regular expression.  We'll structure the implementation into two modules, an implementation of regular expressions themselves and an implementation of a matching algorithm for them.  The structure of the system is neatly expressed using signatures that describe the components of these two modules.

signature REGEXP = sig

  datatype regexp =
    Zero | One | Char of char |
    Plus of regexp * regexp | Times of regexp * regexp |
    Star of regexp

  exception SyntaxError of string
  val parse : string -> regexp

  val format : regexp -> string
 
end
 
signature MATCHER = sig

  structure RegExp : REGEXP

  val match : RegExp.regexp -> string -> bool

end

The signature REGEXP describes a module that implements regular expressions.  It consists of a description of the abstract syntax of regular expressions, together with operations for parsing and unparsing (formatting) them.   The definition of the abstract syntax takes the form of a datatype declaration that is reminiscent of a context-free grammar, but which abstracts from matters of lexical presentation (such as precedences of operators, parenthesization, conventions for naming the operators, etc.)  The abstract syntax consists of 6 clauses, corresponding to the regular expressions 0, 1, a, r1 + r2, r1 r2, and r*.  The functions parse and format specify the parser and unparser for regular expressions.  The parser takes a string as argument and yields a regular expression; if the string is ill-formed, the parser raises the exception SyntaxError with an associated string describing the source of the error.  The unparser takes a regular expression and yields a string that parses to that regular expression.  In general there are many strings that parse to the same regular expressions; the unparser generally tries to choose one that is easiest to read.

The signature MATCHER describes a module that implements a matching algorithm for regular expressions.  The matcher is a function match that takes a regular expression and yields a function that takes a string and determines whether or not that string matches that regular expression.  Obviously the matcher is dependent on the implementation of regular expressions.  This is expressed by a structure specification that specifies a hierarchical dependence of an implementation of a matcher on an implementation of regular expressions --- any implementation of the MATCHER signature must include an implementation of regular expressions as a constituent module.   This ensures that the matcher is self-contained, and does not rely on implicit conventions for determining which implementation of regular expressions it employs.

Now let's look at the high-level structure of an implementation of a regular expression matcher.  It consists of two major components: an implementation of regular expressions, and an implementation of the matcher.  Implementations of signatures are called structures in ML; the implementation of the regular expression matcher consists of two structures.  Since the implementation of the matcher depends on an implementation of regular expressions, but is independent of any particular implementation of regular expressions, we use a parameterized module, or functor, to implement it.  Here's the high-level structure we're considering:

structure RegExp :> REGEXP = ...

functor Matcher (structure RegExp : REGEXP) :> MATCHER = ...

structure Matcher :> MATCHER = Matcher (structure RegExp = RegExp)

The structure identifier RegExp is bound to an implementation of the REGEXP signature.  Conformance with the signature is ensured by the ascription of the signature REGEXP to the binding of RegExp using the ":>" notation.  Not only does this check that the implementation (elided here) conforms with the requirements of the signature REGEXP, but it also ensures that subsequent code cannot rely on any properties of the implementation other than those explicitly specified in the signature.  This helps to ensure that modules are kept separate, facilitating subsequent changes to the code.

The functor identifier Matcher is bound to a structure that takes an implementation of REGEXP as parameter.  We may think of Matcher as a kind of function mapping structures to structures.  The result signature of the functor specifies that the implementation must conform to the requirements of the signature MATCHER, and ensures that the only visible properties of any instance of this functor (obtained by applying it to an implementation of REGEXP) are precisely what is specified in that signature.  A specific matcher is provided by applying the functor Matcher to the stucture RegExp to obtain an implementation of MATCHER.

Once the system is built, we may use it by referring to its components using paths, or long identifiers.  The function Matcher.match has type Matcher.RegExp.regexp -> string -> bool, reflecting the fact that it takes a regular expression as implemented within the package itself and yields a matching function on strings.   We may build a regular expression by applying the parser, Matcher.RegExp.parse, to a string representing a regular expression, then passing this to Matcher.match.   Here's an example:

val regexp = Matcher.RegExp.parse "((a + %).(b + %))*"
val matches = Matcher.match regexp

matches "aabba"     (* matches successfully *)
matches "abac"      (* fails to match *)

We use the convention that "@" stands for the empty regular expression and "%" stands for the regular expression accepting only the null string.  Concatentation is indicated by a ".", alternation by "+", and iteration by "*".

The use of long identifiers can get tedious at times.  There are two typical methods for alleviating the burden.  One is to introduce a synonym for a long package name.  Here's an example:

structure M = Matcher
structure R = M.RegExp

val regexp = R.parse "((a + %).(b + %))*"
val matches = M.match regexp

matches "aabba"
matches "abac"

Another is to "open" the structure, incorporing its bindings into the current environment:

open Matcher Matcher.RegExp

val regexp = parse "((a + %).(b + %))*"
val matches = match regexp

matches "aabba"
matches "abac"

It is advisable to be sparing in the use of open because it is often hard to anticipate exactly which bindings are incorporated into the environment by its use.

Now let's look at the internals of these structures.  Here's an overview of the implementation of regular expressions:

structure RegExp :> REGEXP = struct

  datatype regexp =
    Zero | One | Char of char |
    Plus of regexp * regexp | Times of regexp * regexp |
    Star of regexp

  ... implementation of the tokenizer ...

 
fun tokenize s = tokenize_exp (String.explode s)

  ... implementation of the parser components ...

 
fun parse s =
      let
          val (r, s') = parse_exp (tokenize (String.explode s))
      in
          case s'
            of nil => r
             | _ => raise SyntaxError "Unexpected input.\n"
      end
      handle LexicalError => raise SyntaxError "Illegal input.\n"

  ... implementation of the formatter ...

  fun format r =
      String.implode (format_exp r)

end

The implementation is bracketed by the keywords struct and end.   The type regexp is implemented precisely as specified by a datatype declaration.  The parser works by "exploding" the string into a list of characters (making it easier to process them character-by-character), transforming the character list into a list of "tokens" (abstract symbols representing lexical atoms), and finally parsing the resulting list of tokens.  If there is remaining input after the parse, or if the tokenizer encountered an illegal token, an appropriate syntax error is signalled.  The formatter works by calling an associated function that yields a list of characters, then "imploding" that list into a string.

It is interesting to consider in more detail the structure of the parser since it exemplifies well the use of pattern matching to define functions.  Let's start with the tokenizer, which we present here in toto:

datatype token =
  AtSign | Percent | Literal of char | PlusSign | TimesSign |
  Asterisk | LParen | RParen

exception LexicalError

fun tokenize nil = nil
  | tokenize (#"+" :: cs) = (PlusSign :: tokenize cs)
  | tokenize (#"." :: cs) = (TimesSign :: tokenize cs)
  | tokenize (#"*" :: cs) = (Asterisk :: tokenize cs)
  | tokenize (#"(" :: cs) = (LParen :: tokenize cs)
  | tokenize (#")" :: cs) = (RParen :: tokenize cs)
  | tokenize (#"@" :: cs) = (AtSign :: tokenize cs)
  | tokenize (#"%" :: cs) = (Percent :: tokenize cs)
  | tokenize (#"\\" :: c :: cs) = Literal c :: tokenize cs
  | tokenize (#"\\" :: nil) = raise LexicalError
  | tokenize (#" " :: cs) = tokenize cs
  | tokenize (c :: cs) = Literal c :: tokenize cs

We use a datatype declaration to introduce the type of tokens corresponding to the symbols of the input language.  The function tokenize has type char list -> token list; it transforms a list of characters into a list of tokens.   It is defined by a series of clauses that dispatch on the first character of the list of characters given as input, yielding a list of tokens.  The correspondence between characters and tokens is relatively straightforward, the only non-trivial case being to admit the use of a backslash to "quote" a reserved symbol as a character of input.  (More sophisticated languages have more sophisticated token structures; for example, words (consecutive sequences of letters) are often regarded as a single token of input.)  Notice that it is quite natural to "look ahead" in the input stream in the case of the backslash character, using a pattern that dispatches on the first two characters (if there are such) of the input, and proceeding accordingly.   (It is a lexical error to have a backslash at the end of the input.)

Now here's an overview of the parser.  It is a simple recursive-descent parser implementing the standard precedence conventions for regular expressions (iteration binds most tightly, then concatentation, then alternation).  The parser is defined by four mutually-recursive functions, parse_exp, parse_term, parse_factor, and parse_atom.  These implement a recursive descent parser that dispatches on the head of the token list to determine how to proceed.   The code is essentially a direct transcription of the obvious LL(1) grammar for regular expressions capturing the binding conventions described earlier.

fun parse_exp ts =
    let
        val (r, ts') = parse_term ts
    in
        case ts'
          of (PlusSign :: ts'') =>
             let
                 val (r', ts''') = parse_exp ts''
             in
                 (Plus (r, r'), ts''')
             end
           | _ => (r, ts')
    end

and parse_term ts = ... (elided) ...

and parse_factor ts =
    let
        val (r, ts') = parse_atom ts
    in
        case ts'
        of (Asterisk :: ts'') => (Star r, ts'')
         | _ => (r, ts')
    end

and parse_atom nil = raise SyntaxError ("Atom expected\n")
  | parse_atom (AtSign :: ts) = (Zero, ts)
  | parse_atom (Percent :: ts) = (One, ts)
  | parse_atom ((Literal c) :: ts) = (Char c, ts)
  | parse_atom (LParen :: ts) =
    let
        val (r, ts') = parse_exp ts
    in
        case ts'
          of (RParen :: ts'') => (r, ts'')
           | _ => raise SyntaxError ("Right-parenthesis expected\n")
    end

Once again it is quite simple to implement "lookahead" using patterns that inspect the token list for specified tokens.  This parser makes no attempt to recover from syntax errors, but one could imagine doing so, using standard techniques.

This completes the implementation of regular expressions.  Now for the matcher.   The main idea is to implement the matcher by a recursive analysis of the given regular expression.  The main difficulty is to account for concatenation --- to match a string against the regular expression r1 r2 we must match some initial segment against r1, then match the corresponding final segment against r2.  This suggests that we generalize the matcher to one that checks whether some initial segment of a string matches a given regular expression, then passes the remaining final segment to a continuation, a function that determines what to do after the initial segment has been successfully matched.  This facilitates implementation of concatentation, but how do we ensure that at the outermost call the entire string has been matched?  We achieve this by using an initial continuation that checks whether the final segment is empty.   Here's the code, written as a functor parametric in the regular expression structure:

functor Matcher (structure RegExp : REGEXP) :> MATCHER = struct

  structure RegExp = RegExp

  open RegExp

  fun match_is Zero _ k = false
    | match_is One cs k = k cs
    | match_is (Char c) (d::cs) k = if c=d then k cs else false
    | match_is (Times (r1, r2)) cs k =
      match_is r1 cs (fn cs' => match_is r2 cs' k)
    | match_is (Plus (r1, r2)) cs k =
      match_is r1 cs k  orelse  match_is r2 cs k
    | match_is (Star r) cs k =
      k cs  orelse  match_is r cs (fn cs' => match_is (Star r) cs' k)

  fun match r s =
      match_is r (String.explode s) (fn nil => true | false)

end

Note that we must incorporate the parameter structure into the result structure, in accordance with the requirements of the signature.  The function match explodes the string into a list of characters (to facilitiate sequential processing of the input), then calls match_is with an initial continuation that ensures that the remaining input is empty to determine the result.  The type of match_is is

RegExp.regexp -> char list -> (char list -> bool) -> bool.

That is, match_is takes in succession a regular expression, a list of characters, and a continuation of type char list -> bool; it yields as result a value of type bool.  This is a fairly complicated type, but notice that nowhere did we have to write this type in the code!  The type inference mechanism of ML took care of determining what that type must be based on an analysis of the code itself.

Since match_is takes a function as argument, it is said to be a higher-order function.  The simplicity of the matcher is due in large measure to the ease with which we can manipulate functions in ML.  Notice that we create a new, unnamed functions, to pass as a continuation in the case of concatenation --- it is the function that matches the second part of the regular expression to the characters remaining after matching an initial segment against the first part.  We use a similar technique to implement matching against an iterated regular expression --- we attempt to match the null string, but if this fails, we match against the regular expression being iterated followed by the iteration once again.  This neatly captures the "zero or more times" interpretation of iteration of a regular expression.

(Important aside: the code given above contains a subtle error.  Can you find it?  If not, see the chapter on proof-directed debugging for further discussion!)

This completes our brief overview of Standard ML.  The remainder of these notes are structured into three parts.  The first part is a detailed introduction to the core language, the language in which we write programs in ML.   The second part is concerned with the module language, the means by which we structure large programs in ML.  The third is about programming techniques, ideas for building reliable and robust programs.  I hope you enjoy it!

Sample Code for this Chapter