Standard ML at CMU / SML/NJ

Using the SML/NJ System

  1. Back to the Introduction

  2. Interacting with SML/NJ

  3. Using Files and the Standard Basis

  4. Editing SML Programs Using Emacs

  5. Making Sense of Error Messages

  6. Exporting Heaps

  7. Tools


This is a guide to editing and executing Standard ML (SML) programs at Carnegie Mellon University, using the Standard ML of New Jersey system. This document was written by Peter Lee (petel@cs.cmu.edu), with extensive contributions by Robert Harper (rwh@cs.cmu.edu), Iliano Cervesato (iliano@cs.cmu.edu), Carsten Shurmann (carsten@cs.cmu.edu), Frank Pfenning (fp@cs.cmu.edu), and Herb Derby (derby@cs.cmu.edu).

This is not a reference manual for the Standard ML language. If you need a reference manual or a tutorial, you can find several sources of information, both on-line and in hard copy from the Introduction.


Interacting with SML/NJ

When you start the SML/NJ system, it loads and responds with a message giving the current version number and then a prompt for user input. The prompt is a single dash ("-").

When prompted, you can type in a top-level declaration. There are several kinds of top-level declarations in SML. For example, the following is declaration of a function called inc that increments its integer argument. (In these examples, the dash ("-") is the SML/NJ prompt, and the text in teletype font is the user input. In some browsers, user input will also appear in blue text. The italic font is used for the output from the SML/NJ system. The symbol represents a carriage return on Unix-based systems or the Enter key on the PC and Macintosh systems.)

- fun inc x = x + 1;
    val inc = fn : int -> int

The text "fun inc x = x + 1" is the declaration for the inc function. The semicolon (";") is a marker that indicates to the SML/NJ system that it should perform the following actions: elaborate (that is, perform typechecking and other static analyses), compile (to obtain executable machine code), execute, and finally print the result of this declaration. After all of this, the system then prompts for new input and the whole process starts again. This is the so-called "top-level loop". To exit from the SML/NJ system, simply type an end-of-file character (Control-d) to the prompt.

In the example above, the printed result shows that inc is a function that takes an integer argument and yields an integer result. Actually, it is important for you to know that, in SML, functions are "first-class" values, fundamentally no different than other values such as integers. So, to be more precise, it is better to say that the identifier inc has been bound to a value (which happens to be a function, as denoted by the fn keyword above) of type int -> int.

If we had left out the semicolon, then the elaboration, compilation, execution, and printing would have been deferred and a prompt (this time, an equal sign, "=") would be given, for either a continuation of the declaration of inc or else another top-level declaration. When a semicolon is finally entered (perhaps after several more top-level declarations), all of the declarations since the last semi-colon would be processed in sequence. For example:

- fun inc x = x + 1
= fun f n = (inc n) * 5;
    val inc = fn : int -> int
    val f = fn : int -> int

In this example, we have defined the inc function as well as a function f that uses inc.

In the interactive top-level loop, the simplest form of input is an expression. For example, after typing in the declarations for inc and f above, we can now call f by typing in:

- f (2+4);
    val it = 35 : int

Notice that since no identifier is given to bind to the value, the interactive system has chosen the identifier it and bound it to the result of compiling and executing the expression f (2+4).

You might have experience with other languages whose implementations support a similar kind of interactive top-level loop. For example, most implementations of the Lisp, Scheme, and Basic languages support top-level loops. If you have experience with any of these languages, then you might expect that re-defining a function will change the binding of the function name, as well as any other functions that call that function. However, in the SML/NJ system, this is not the case. For example, suppose we wish to change the definition of the inc function, so that it increments by two instead of one:

- fun inc x = x + 2;
    val inc = fn : int -> int

In typical Lisp and Scheme systems, such a re-definition would cause the function f to change as well, since f calls inc. But in the SML/NJ system, f's binding does not change, so in fact referring to f now still yields the original function:

- f (2+4);
    val it = 35 : int

To understand why the SML/NJ system behaves in this way, consider what would happen if we re-defined inc so that it had a type different than int -> int, for example:

- fun inc x = (x mod 2 = 0);
    val inc = fn : int -> bool

Here, inc has been changed to a function that returns true if and only if its integer argument is even. Now, if f should also be changed to reflect this re-definition (as it would be in Lisp and Scheme systems), it would fail to typecheck. This is not necessarily a bad thing, but at any rate the SML/NJ system does not bother to go back to earlier top-level declarations and re-elaborate them; hence, f's binding is left unchanged.

If you are already familiar with the SML language, then you can think of the sequence of top-level declarations typed into an SML/NJ interactive top-level loop as being in nested let-bindings:

let fun inc x = x + 1 in
  let fun f n = (inc n) * 5 in
    let fun inc x = x + 2 in
      ...

[ Back to the Table of Contents ]


Using Files and the Standard Basis

Instead of typing your program into the interactive top-level, it is more productive to put your program into a file (or set of files) and then load it (them) into the SML/NJ system. The simplest way to do this is to use the built-in function use. For example:

- use "myprog.sml";
    [opening myprog.sml]
    ...
    val it = () : unit

The use function takes the name of the file (of type string) to load. If the file exists, it is opened and read, with each top-level declaration in the file processed in turn (and the results printed on the standard output). The "result" of the use function is the unit value ("()").

As your programs get larger and the code becomes spread over many modules, you can find it extremely difficult to remember exactly the right order in which to "use" the files. In order to alleviate this problem, the SML/NJ system has a built-in feature called the Compilation Manager, or simply CM, which I highly recommend that you use. (Actually, you might have to start the SML/NJ system by invoking the "sml-cm" binary, instead of simply "sml".) CM is a complex system with documentation available on-line at http://www.cs.princeton.edu/~blume/cm-manual.ps. For most uses the simplest interface is sufficient: simply create a file in the current directory called sources.cm which contains the names of all of your SML source files, listed one per line in any order. Once this file is created, then you can use the function CM.make to load, compile, and execute your system. For example, suppose you have three source files, a.sig, b.sml, and c.sml. Then you can create a file called sources.cm with the following contents:

Group is

a.sig
b.sml
c.sml

Note that it does not matter in what order the file names occur. Once this file has been created, typing the following to the SML/NJ system will do whatever is necessary in order to load your program:

- CM.make();

The CM.make function will scan all of your sources files and calculate the dependencies among them so as to compile and load them in the right order. If CM.make has already been used before to compile and load your program, then it looks to see what files have been changed since the last "make", and then loads and compiles the minimal number of files necessary in order to bring the system up-to-date. After running CM.make, you might notice a new directory in your source file directory. This new directory is used by CM to "remember" the results of the dependency calculation, as well as to store the results of compiling your files so that they don't have to be compiled again (unless, of course, they have been changed).

There is an extensive set of pre-defined values and functions in the SML/NJ system. This is referred to as the standard basis, or sometimes the pervasive environment. As with CM, there is also extensive documentation available on-line for the standard basis at http://cm.bell-labs.com/cm/cs/what/smlnj/basis/index.html. (A book on the standard basis will be published soon.) For dealing with files, the following function is often useful:

OS.FileSys.chDir : string -> unit

This function implements the standard "cd" Unix command, which changes the current working directory to the directory specified in the string argument. This is useful if you have started the SML/NJ system in a directory different from the one containing your source files.

Another set of basis functions are useful for controlling the output produced by the SML/NJ system:

Compiler.Control.Print.printDepth : int ref
Compiler.Control.Print.printLength : int ref

These variables control the maximum depth and length to which lists, tuples, and other data structures are to be printed. When a data structure is deeper than printDepth or longer than printLength, the remaining portion of the structure is printed as an ellipse ("...").

To change the value of one of these variables, an assignment can be used. For example:

- Compiler.Control.Print.printDepth := 10;

changes the maximum print depth to ten.

The standard basis contains many modules and functions for manipulating values of all of the basic types, including booleans, integers, reals, characters, strings, arrays, and lists. Unfortunately, the SML/NJ system does not provide any kind of browser, so either you need to refer to the written documentation for the standard basis, or use a little bit of a hack in order to see the complete set of basis functions currently supplied in the SML/NJ for these types. For example, type the following to the interactive top-level:

- signature S = INTEGER;

Each set of standard basis functions is encapsulated in an SML module, and each such module has a signature, or "interface", whose name is written entirely in uppercase and refers to the type of values for which the module provides functionality. (Note that SML is case sensitive.) For the integer functions, the signature is called INTEGER. So, the above declaration simply binds the identifier S to the signature INTEGER, which causes the SML/NJ system to respond with a listing of the entire INTEGER interface. (We could have used any name besides S.) Other useful signatures include BOOL, REAL, CHAR, STRING, ARRAY, and LIST. For functions that interface to the operating system (such as OS.FileSys.chDir above), see the signature OS (and POSIX, if provided). There are many many other useful modules in the standard basis as well.

[ Back to the Table of Contents ]


Editing Files Using Emacs

I recommend using Emacs to edit your SML programs and also to manage interaction with the SML/NJ system. To do this, you should incorporate the "sml mode" into your emacs startup file. The relevant emacs lisp files can be found in the same directory tree as the SML/NJ system itself. For example, from Unix machines in the Computer Science Department, you can simply add the line

(load "/usr/local/lib/sml/sml-mode/sml-site")

to your .emacs file so that the next time you start Emacs, the sml mode will be present. From the Andrew network, you can find the emacs lisp files in the 15-411 course directory.

With the sml mode, a special editing mode will be invoked any time you edit a file with an appropriate extension (such as ".sml"; other extensions can be specified in the init.el file). As in other special editing modes, using the Tab key or Control-j will cause emacs to attempt to indent your code in a pleasing way. Control-c followed by Tab will indent the current region. Since SML's syntax is rather complex, the sml mode indentation can be rather haphazard at times. Still, many people find it to be quite useful. A particularly useful key combination is "Meta" along with a vertical bar ("|"); this creates a template for an arm of a case expression or clause of a function.

To run SML/NJ from Emacs, make sure that the emacs variable sml-program-name is set to "sml" (which is the default), and then type M-x sml (that is, "Meta" along with "x", followed by "sml"). This will start up the SML/NJ system as an inferior shell process. There are several useful emacs commands for interacting with the inferior sml shell. You can find documentation for them by hitting Control-h m. Some of the most basic commands are

C-cC-l save the current buffer and then "use" the file
C-cC-r send the current region to the sml shell
C-c` find the next error message and position the cursor on the corresponding line in the source file
C-cC-s split the screen and show the sml shell

[ Back to the Table of Contents ]


Making Sense of Error Messages

As with most compilers, the SML/NJ system oftens produce error messages that can be hard to decipher. The problem is compounded by the fact that SML supports polymorphic type inference, which makes it very difficult for the compiler to figure out precisely the real source of a type error. On the other hand, once all of the compile-time type errors are removed, it is often the case that the bulk of the bugs have already been stamped out. In practice, SML programs often work the first time, once all of the type errors reported by the compiler have been removed!

Type mismatches

The most common kind of error is the simple type mismatch. For example, suppose we have the following code in a file called myprog.sml:

fun inc x = x + 1
fun f n = inc true

Notice that a semi-colon is not needed here, since the end-of-file marker will serve the same purpose. Now, if we load this file, we get the following error message:

use "myprog.sml";
    myprog.sml:2.11-2.18 Error: operator and operand don't agree (tycon mismatch)
    operator domain: int
    operand: bool
    in expression:
    inc true

The error message indicates that the expression inc true, on line 2, between columns 11 and 18, is guilty of a type mismatch. The function inc is being applied to an argument of type bool in this expression, but its domain (argument type) is int.

If we are using the sml mode in Emacs, then typing C-c C-l in an edit buffer containing the program would cause the SML/NJ system to load the file, and then typing C-c ` would move the edit cursor to the exact point in the program corresponding to this error message.

Unresolved overloading

Some of the arithmetic operators, such as +, *, -, = , and so on, are "overloaded", in the sense that they can be used with either integer arguments or real arguments. This overloading feature leads to possible source of confusion for the novice SML programmer. Consider, for example, the following declaration of a function for squaring numbers:

fun square x = x * x

The following error message is given for this program:

myprog.sml:1.18 Error: overloaded variable not defined at type
symbol: *
type: 'Z

Because there is not enough information in this program to determine whether the * is for integers or for reals, an error message is generated to complain about the inability to "resolve" the overloading.

The simple fix for this kind of error is simply to declare the type of one of the arguments to (or the result of) the arithmetic operation. For example, here are three versions that work:

fun square' x = x * x : int
fun square'' (x : int) = x * x
fun square''' x : int = x * x

The first version explicitly declares the type of the second argument to the * operator. The second version declares the type of the argument. Finally, the third version declares the type of the result of the square''' function. All three versions allow the SML type inference mechanism to infer the types of the identifiers in the declarations.

It is not uncommon to spend quite a long time tracking down the source of a type error. (Actually, the time spent doing this is almost always much less than the time it takes to track down the same error without the benefit of static typechecking!) A common way to narrow down the possibilities, and also to improve the precision of the error messages produced by the compiler, is to annotate the program with explicit types, in the way that we have done above. It is particularly helpful to annotate the types of function parameters, as we have done in square'' above. This is similar to the declaration of parameter types in languages such as C and Pascal. Of course, in those languages the declarations are required; in SML they are optional.

The value restriction

One of the most fundamental changes in the 1997 revision of the SML language is that it now enforces something called the value restriction. Essentially, this restricts polymorphism to expressions that clearly are values, specifically single identifiers and functions. When this restriction is violated, the error message, "nongeneric type variable," is given. For example, the following program results in this error:

fun id x = x

fun map f nil = nil
  | map f (h::t) = (f h) :: (map f t)

val f = map id

The message given is

myprog.sml:6.1-6.14 Error: nongeneralizable type variable
f : 'Y list -> 'Y list

which indicates that the expression map id is polymorphic, but not syntactically a value (that is, not an identifier or lambda expression), and hence the attempt to use it as a polymorphic value (by binding f to it) violates the value restriction. The reasons for this restriction are beyond the scope of this document, but are explained in several papers as well as the textbook by Paulson.

Syntax errors

Because the syntax of SML is rather complex, there are several common errors that novices tend to make. One of the most common has to do with the syntax of patterns in clausal-form function declarations and case expressions. Consider the following code:

datatype 'a btree = Leaf of 'a
                  | Node of 'a btree * 'a btree
fun preorder Leaf(v) = [v]
  | preorder Node(l,r) = preorder l @ preorder r

The SML/NJ system complains vigorously over this:

myprog.sml:4.5-5.48 Error: data constructor Leaf used without argument in pattern
myprog.sml:4.5-5.48 Error: data constructor Node used without argument in pattern
myprog.sml:4.1-5.48 Error: pattern and expression in val rec dec don't agree (tycon mismatch)

pattern: 'Z -> ('Z * 'Z) list
expression: 'Z -> 'Z * 'Z -> ('Z * 'Z) list
in declaration:
preorder = (fn arg => (fn <pat> => <exp>))

The problem here is that Leaf and Node are patterns that are syntactically separate from, respectively, the (v) and (l,r) patterns. The (admittedly strange) syntax of SML requires extra parenthesization:

fun preorder (Leaf v) = [v]
  | preorder (Node(l,r)) = preorder l @ preorder r

This is true in all contexts where patterns are used, including clausal-form function declarations, case expressions, and exception handlers.

Another rather confusing part of the syntax has to do with the interaction between case expressions, exception handlers, and clausal-form function declarations. Consider the following function, taken in slightly modified form from the SML/NJ library (which is described later):

datatype 'a option = NONE | SOME of 'a
fun filter pred l =
      let fun filterP (x::r, l) =
                case (pred x) of
                   SOME y => filterP(r, y::l)
                 | NONE => filterP(r, l)
            | filterP ([], l) = rev l
      in
        filterP (l, [])
      end

In this example, the local function filterP is defined in two clauses, the first handling the case of a non-empty list argument, and the second handling the empty list. In the first clause, a case expression is used. The syntactic ambiguity arises from the fact that it takes too much ``lookahead'' to figure out whether or not the second clause of filterP is actually the third arm of the case expression. This leads to the following rather cryptic error message:

myprog.sml:8.23-8.28 Error: syntax error: deleting EQUALOP ID
myprog.sml:9.3-9.13 Error: syntax error: deleting IN ID

As before, parenthesization fixes the problem:

fun filter pred l =
      let fun filterP (x::r, l) =
                (case (pred x) of
                    SOME y => filterP(r, y::l)
                  | NONE => filterP(r, l))
            | filterP ([], l) = rev l
      in
        filterP (l, [])
      end

Alternatively, in this example we can also exchange the two clauses of filterP:

fun filter pred l =
      let fun filterP ([], l) = rev l
            | filterP (x::r, l) =
                case (pred x) of
                   SOME y => filterP(r, y::l)
                 | NONE => filterP(r, l)
      in
        filterP (l, [])
      end

As with many programming languages, the basic advice to follow is: When in doubt, parenthesize.

[ Back to the Table of Contents ]


Exporting Heaps

The SML language encourages modularity, and in practice separate modules tend to be placed into separate files. While this is useful during development, it becomes highly inconvenient when you finally "ship" your finished program to your users. The standard way to ship a program, then, is to save an image of the system heap after all of your files have been loaded. This is referred to as "exporting" the heap, and results in a single file that contains the state of your SML world at the time you performed the export operation.

You can export a heap with the function exportML. For example, to save the heap image in a file called mysml, the following should be typed to the SML/NJ prompt:

- SMLofNJ.exportML "mysml";

This will save the current state of the SML/NJ system into the file mysml. This can then be executed later by running the sml system with the command-line option, "@SMLload=mysml". This will restart the SML/NJ system at the same point in which the exportML took place. (Note that exportML is not supported for the Macintosh System 7 version.)

There is also a function called exportFn, which saves an SML state as a function that takes in the shell command-line arguments when restarted. The functionality of exportFn is

SMLofNJ.exportFn : string * (string * string list -> OS_Process.status) -> unit

The first argument is the name of the file to contain the exported heap image. The second argument is a function that takes the command line and command line arguments (as strings) and returns a process-status value (usually OS_Process.success or OS_Process.failure).

[ Back to the Table of Contents ]


Tools

In addition to the standard basis, the SML/NJ system comes with several tools and libraries. The ml-lex and ml-yacc programs perform automatic generation of lexical analyzers and LALR(1) parsers, respectively. Documentation for these and other useful tools can be found at the SML/NJ documentation page.

An extensive library of useful data structures and functions are also available, at http://cm.bell-labs.com/cm/cs/what/smlnj/doc/smlnj-lib/index.html.

Finally, extensions to SML for concurrency and interaction with the X window system are supported by the Concurrent ML and eXene extensions to SML, available at http://cm.bell-labs.com/cm/cs/who/jhr/sml/eXene/index.html.

[ Back to the Table of Contents ]


petel@cs.cmu.edu