Lists are one example of the notion of a *recursive datatype*. ML provides
a general mechanism, the `datatype`

declaration, for introducing recursive
types. Earlier we introduced `type`

declarations as an abbreviation
mechanism. Types are given names as documentation and as a convenience to the programmer,
but doing so is semantically inconsequential --- one could replace all uses of the type
name by its definition and not effect the behavior of the program. In contrast the `datatype`

declaration provides a means of introducing a *new* type that is distinct from all
other types and that does not merely stand for some other type. It is the means by
which the ML type system may be extended by the programmer.

The `datatype`

declaration in ML has a number of facets. A `datatype`

declaration introduces

- One or more "new" type constructors. The type constructors introduced may, nor may not, be (mutually) recursive.
- One or more "new" value constructors for each of the type constructors introduced by the declaration.

The type constructor may take zero or more arguments; a zero-argument, or *nullary*,
type constructor is just a type. Each value constructor may also take zero or more
arguments; a nullary value constructor is just a constant. The type and value
constructors introduced by the declaration are "new" in the sense that they are
distinct from all other type and value constructors previously introduced; if a datatype
re-defines an "old" type or value constructor, then the old definition is
shadowed by the new one, rendering the old ones inaccessible in the scope of the new
definition.

Here's a simple example of a nullary type constructor with four nullary value constructors.

datatype suit = Spades | Hearts | Diamonds | Clubs

This declaration introduces a new type `suit`

with four nullary value
constructors, `Spades`

, `Hearts`

, `Diamonds`

, and `Clubs`

.
This declaration may be read as introducing a type `suit`

such that a
value of type `suit`

is either `Spades`

, or `Hearts`

, or `Diamonds`

,
or `Clubs`

. There is no significance to the ordering of the constructors
in the declaration; we could just as well have written

datatype suit = Hearts | Diamonds | Spades | Clubs

(or any other ordering, for that matter). It is conventional to capitalize the names of value constructors, but this is not required by the language.

Given the declaration of the type suit, we may define functions on it by case analysis on the value constructors using a clausal function definition. For example, we may define the suit ordering in Bridge by the function

fun outranks (Spades, Spades) = false

| outranks (Spades, _) = true

| outranks (Hearts, Spades) = false

| outranks (Hearts, Hearts) = false

| outranks (Hearts, _) = true

| outranks (Diamonds, Clubs) = true

| outranks (Diamonds, _) = false

| outranks (Clubs, _) = false

This defines a function of type

suit * suit -> bool

that determines whether or not the first `suit`

outranks the second.

Data types may also be *parameterized* by another type. For example,

datatype 'a option = NONE | SOME of 'a

introduces the unary type constructor `'a option`

. The values of type *typ
*`option`

are:

- The constant
`NONE`

, and - Values of the form
`SOME`

*val,*where*val*is a value of type*typ*.

For example, some values of type string option are `NONE`

, `SOME`

"abc", and `SOME`

"def".

The option type constructor is pre-defined in Standard ML. One common use of
option types is to handle functions with an optional argument. For example, here is
a function to compute the base-*b *exponential function for natural number
exponents that defaults to base 2:

fun expt (NONE, n) = expt (SOME 2, n)

| expt (SOME b, 0) = 1

| expt (SOME b, n) =

if n mod 2 = 0 then expt (SOME b*b, n div 2) else b * expt (SOME b, n-1)

The advantage of the option type in this sort of situation is that it avoids the need
to make a special case of a particular argument, *e.g.*, using `0`

as
first argument to mean "use the default exponent".

A related use of option types is in aggregate data structures. For example, an address book entry might have a record type with fields for various bits of data about a person. But not all data is relevant to all people. For example, someone may not have a spouse, but they all have a name. For this we might use a type definition of the form

type entry = { name:string, spouse:string option, ... }

so that one would create an entry for an unmarried person with a `spouse`

field of `NONE`

.

The next level of generality is the recursive type definition. For example, one
may define a type *typ *`tree`

of binary trees with values of type *typ*
at the nodes using the following declaration:

datatype 'a tree = Empty | Node of 'a tree * 'a * 'a tree

This declaration corresponds directly to the informal definition of binary trees with
values of type *typ* at the nodes:

- The empty tree
`Empty`

is a binary tree. - If
*tree*and_{1}*tree*are binary trees, and_{2}*val*is a value of type*typ*, then`Node`

`(`

*tree*,_{1}*val*,*tree*_{2}`)`

is a binary tree. - Nothing else is a binary tree.

The distinguishing feature of this definition is that it is *recursive* in the
sense that binary trees are constructed out of other binary trees, with the empty tree
serving as the base case.

Incidentally, a *leaf* in a binary tree is here represented as a node both of
whose children are the empty tree. Our definition of binary trees is analogous to
starting the natural numbers with zero, rather than one. In fact you can think of
the children of a node in a binary tree as the "predecessors" of that node, the
only difference compared to the usual definition of predecessor being that a node has two,
rather than one, predecessors.

To compute with a recursive type one ordinarily defines recursive functions. For
example, here is the function to compute the *height* of a binary tree:

fun height Empty = 0

| height (Node (lft, _, rht)) = 1 + max (height lft, height rht)

Notice that `height`

is called recursively on the children of a
node, and is defined outright on the empty tree. This pattern of definition is
called *structural induction*. The function `height`

is said to be
defined by induction on the structure of its argument, a tree. The general idea is
to define the function directly for the base cases of the recursive type (*i.e.*,
value constructors with no arguments or whose arguments do not involve values of the type
being defined), and to define it for non-base cases in terms of its definitions for the
constituent values of that type. We will see numerous examples of this as we go
along.

Here's another example. The *size* of a binary tree is the
number of nodes occurring in it. Here's a straightforward definition in ML:

fun size Empty = 0

| size (Node (lft, _, rht)) = 1 + size lft + size rht

The function size is defined by structural induction on trees.

A word of warning. One reason to capitalize value constructors is to avoid a
pitfall in the ML syntax. Suppose we gave the following definition of `size`

:

fun size empty = 0

| size (Node (lft, _, rht)) = 1 + size lft + size rht

What happens? The compiler will warn us that the second clause of the definition
is *redundant*! Why? Because `empty`

, spelled with a
lower-case "e", is a *variable*, not a *constructor*, and hence
matches *any* tree whatsoever. Consequently the second clause never applies.
By capitalizing constructors we can hope to make mistakes such as these more
evident, but in practice you are bound to run into this sort of mistake.

The `tree`

datatype is appropriate for binary trees: those for which every
node has exactly two children. (Of course, either or both children might be the
empty tree, so we may consider this to define the type of trees with *at most* two
children; it's a matter of terminology which interpretation you prefer.) It should
be obvious how to define the type of *ternary* trees, whose nodes have at most
three children, and so on for other fixed arities. But what if we wished to define a
type of trees with a *variable* number of children? In a so-called *variadic
tree* some nodes might have three children, some might have two, and so on. This
can be achieved in at least two ways. One way combines lists and trees, as follows:

datatype 'a tree = Empty | Node of 'a * 'a tree list

Each node has a *list* of children, so that distinct nodes may have different
numbers of children. Notice that the empty tree is distinct from the tree with one
node and no children because there is no data associated with the empty tree, whereas
there is a value of type `'a`

at each node.

Another approach is to simultaneously define a variadic tree to be either empty, or a node collecting together a forest to form a tree, and a forest to be either empty or a variadic tree together with another forest. This leads to the following definition:

datatype 'a tree = Empty | Node of 'a * 'a forest

and 'a forest = Nil | Cons of 'a tree * 'a forest

This example illustrates the introduction of two *mutually recursive datatypes*,
which is why we present it here. Mutually recursive datatypes beget mutually
recursive functions defined on them. Here's a definition of the size (number of
nodes) of a variadic tree:

fun size_tree Empty = 0

| size_tree (Node (_, f)) = 1 + size_forest f

and size_forest Nil = 0

| size_forest (Cons (t, f')) = size_tree t + size_forest f'

Notice that we define the size of a tree in terms of the size of a forest, and *vice
versa*, just as the type of trees is defined in terms of the type of forests.

Many other variations are possible. Suppose we wish to define a notion of binary
tree in which data items are associated with branches, rather than nodes. Here's a `datatype`

declaration for such trees:

datatype 'a tree = Empty | Node of 'a branch * 'a branch

and 'a branch = Branch of 'a * 'a tree

Notice that in contrast to our first definition of binary trees in which the branches
from a node to its children were *implicit*, these branches are now *explicit*
since they are labelled with data items. For example, we can collect up into a list
the data items labelling the branches of such a tree using the following code:

fun collect Empty = nil

| collect (Node (Branch (ld, lt), Branch (rd, rt))) =

ld :: rd :: (collect lt) @ (collect rt)

Returning to the original definition of binary trees (with data items at the nodes),
observe that the *type* of the data items at the nodes must be the same for every
node of the tree. For example, a value of type `int tree`

has an integer
at every node, and a value of type string tree has a string at every node. Therefore
it makes no sense to evaluate the expression

Node (Empty, 43, Node (Empty, "43", Empty))

since the result, if it were to be accepted, would be a "heterogeneous" tree with integers at some nodes and strings at others. Such structures are ruled out in ML as type-incorrect.

In 95% of the cases this apparent restriction is no restriction at all; it is quite
rare to encounter heterogeneous data structures in real programs. For example, a
dictionary with strings as keys might be represented as a binary search tree with strings
at the nodes; there is no need for heterogeneity to represent such a data structure.
But what about the other 5%? What if one really wanted to have a tree with integers
at some nodes and strings at others? How would one represent such a thing in
ML? To see the answer, first think about how one might manipulate such a data
structure. When accessing a node, we would need to check at run-time whether the
data item is an integer or a string; otherwise we would not know whether to, say, add 1 to
it, or concatenate "1" to the end of it. This suggests that the data item
must be *labelled* with sufficient information so that we may determine the type of
the item at run-time. We must also be able to recover the underlying data item
itself so that familiar operations (such as addition or string concatenation) may be
applied to it. This is neatly achieved using a datatype declaration. Suppose
we wish to represent the type of integer-or-string trees. First, we define the type
of values to be integers or strings, marked with a constructor indicating which:

datatype int_or_string = Int of int | String of string

Then we define the type of interest as follows:

type int_or_string_tree = int_or_string tree

*Voila!* Perfectly natural and easy --- heterogeneity is really a special
case of homogeneity!

`Datatype`

declarations and pattern matching are extremely useful for
defining and manipulating the *abstract syntax* of a language. For example,
we may define a small language of arithmetic expressions using the following declaration:

datatype expr = Numeral of int | Plus of expr * expr | Times of expr * expr

This definition has only three clauses, but one could readily imagine adding others. Here is the definition of a function to evaluate expressions of the language of arithmetic expressions written using pattern matching:

fun eval (Numeral n) = Numeral n

| eval (Plus (e1, e2)) =

let

val Numeral n1 = eval e1

val Numeral n2 = eval e2

in

Numeral (n1+n2)

end

| eval (Times (e1, e2)) =

let

val Numeral n1 = eval e1

val Numeral n2 = eval e2

in

Numeral (n1*n2)

end

The combination of `datatype`

declarations and pattern matching contributes
enormously to the readability of programs written in ML. A less obvious, but more
important, benefit is the error checking that the compiler can perform for you if you use
these mechanisms in tandem. As an example, suppose that we extend the type `expr`

with a new component for the reciprocal of a number, yielding the following revised
definition:

datatype expr =

Numeral of int | Plus of expr * expr | Times of expr * expr | Recip of expr

First, observe that the "old*" *definition of `eval`

is no
longer applicable* *to values of type* *`expr`

! For
example, the expression

eval (Plus (Numeral 1, Numeral 2))

is ill-typed, even though it doesn't use the `Recip`

constructor. The
reason is that the re-declaration of `expr`

introduces a "new" type
that just happens to have the same name as the "old" type, but is in fact
distinct from it. This is a boon because it reminds us to recompile the old code
relative to the new definition of the `expr`

type.

Second, upon recompiling the definition of `eval`

we encounter an *inexhaustive
match* warning: the old code no longer applies to every value of type `expr`

according to its new definition! We are of course lacking a case for `Recip`

,
which we may provide as follows:

fun eval (Numeral n) = Numeral n

| eval (Plus (e1, e2)) = ... as before ...

| eval (Times (e1, e2)) = ... as before ...

| eval (Recip e) =

let val Numeral n = eval e in Numeral (1 div n) end

The value of the checks provided by the compiler in such cases cannot be overestimated.
When recompiling a large program after making a change to a `datatype`

declaration the compiler will automatically point out *every line of code* that
must be changed to conform to the new definition; it is impossible to forget to attend to
even a single case. This is a tremendous help to the developer, especially if she is
not the original author of the code being modified and is another reason why the static
type discipline of ML is a positive benefit, rather than a hindrance, to programmers.