Sample Code for this Chapter

It is difficult to write a program that works well.  A significant part of the problem is to state precisely what it means for a program to work correctly.  What assumptions do we make about the way in which it is invoked?  What guarantees does it make about its results?  How much time and space does it require?  Answers to these questions are called specifications --- descriptions of the expected behavior of a program.  Checking that a particular program satisfies a given specification is called verification.  There are many forms of specification and many techniques for verification of programs.  One form of specification with which you are by now very familiar is a type specification; verification of a type specification is called type checking.  We've seen that type specification and type checking are useful tools for helping us to get programs right.   Another form of specification is an asymptotic time and space bound on a procedure, expressed as a function of the argument to the procedure.  For example, we may state that the function sort : int list -> int list takes time T(n) = O(n lg n) and space S(n) = O(n) for an input of size n.  Verification of a complexity bound is often a tricky business.  Typically we define a recurrence relation governing the time or space complexity of the program, then solve the recurrence using asymptotic methods to obtain the result.

Type specifications and complexity specifications are useful tools, but it is important to keep in mind that neither says very much about whether the code works properly.  We might define an incorrect sort routine (say, one that always returns its input untouched), verify that its type is int list -> int list, and check that it runs in time O(n lg n), yet the code doesn't sort its input, despite its name!  Clearly more refined forms of specification are needed to state precisely the expected behavior of a program, and some methods are needed to verify that a program satisfies a given specification.  We've explored such forms of specification and verification earlier in these notes, for example when we checked the correctness of various forms of the integer exponential function.  In this chapter we'll put these ideas to work to help us to devise a correct version of the regular expression matcher sketched in the Overview, correcting a subtle error that may not be immediately apparent from inspecting or even testing the code.  The goal of this chapter is to demonstrate the use of specification and verification to discover and correct an error in a program through a technique that we call proof-directed debugging.  We first devise a precise specification of the regular expression matcher, a difficult problem in itself, then attempt to verify that the matching program satisfies this specification.  The attempt to carry out a proof breaks down, and suggests a counterexample to the specification.  We then consider various methods of handling the problem, ultimately settling on a change of specification rather than a change of implementation.

Let us begin by devising a specification for the regular expression matcher.  As a first cut we write down a type specification.  We seek to define a function match of type regexp -> string -> bool that determines whether or not a given string matches a given regular expression.  More precisely, we wish to satisfy the following specification:

For every regular expression r and every string s, match r s terminates, and evaluates to true iff s in L(r).

Recall that the language of a regular expresson r is a set of string L(r) defined as follows:

L(0) = 0

L(1) = 1

L(a) = a

L(r1r2) = L(r1) L(r2)

L(r1+r2) = L(r1) + L(r2)

L(r*) = L(0) + L(r) + L(rr) + L(rrr) + ...

where 0 = {}, 1 = {""}, a= {"a"}, L1 L2 = { s1s2 | s1 in L1 and s2 in L2 }, and L1+L2 = { s | s in L1 or s in L2 }.  The language L(r*) can be characterized as the smallest language L such that L=1 + L(r) L.  For if s in L(r*) as defined above, then s in L(ri) for some i>=0.  We may show by induction on i that s in 1+L(r)L.  If i=0, then s="" in 1, and if i>0, then s=tu with t in L(r) and u in L(ri-1).   By induction u in L, and hence s in 1+L(r)L and hence s in L.   Conversely, if s in L, then either s in 1, in which case s in L(r*), or s=tu with t in L(r) and u in L.   Inductively u in L(r*) and hence s in L(r)L(r*) and hence s in L.

We saw in the Overview that a natural way to define the procedure match is to use a technique called continuation passing.  We defined an auxiliary function match_is with the type regexp -> char list -> (char list -> bool) -> bool that takes a regular expression, a list of characters (essentially a string, but in a form suitable for incremental processing), and a continuation, and yields a boolean.  The idea is that match_is takes a regular expression r, a character list cs, and a continuation k, and determines whether or not some initial segment of cs matches r, passing the remaining characters cs' to k in the case that there is such an initial segment, and yields false otherwise.  Put more precisely,

For every regular expression r, character list cs, and continuation k, if cs=cs'@cs'' with cs' in L(r) and k cs'' evaluating to true, then match_is r cs k evaluates true; otherwise, match_is r cs k evaluates to false.

Unfortunately, this specification is too strong to ever be satisfied by any implementation of match_is!  Can you see why?  The difficulty is that if k is not guaranteed to terminate for all inputs, then there is no way that match_is can behave as required.  If there is no input on which k terminates, the specification requires that match_is return false.  It should be intuitively clear that we can never implement such a function.  Formally, we can reduce the halting problem to the matching problem so defined, which suffices to show that no such match_is procedure exists.  Instead, we must restrict attention to total continuations, those that terminate for all inputs.  This leads to the following revised specification:

For every regular expression r, character list cs, and total continuation k, if cs=cs'@cs'' with cs' in L(r) and k cs'' evaluating to true, then match_is r cs k evaluates to true; otherwise, match_is r cs k evaluates to false.

Observe that the condition "If cs=cs'@cs' with ..., then ..." contains an implicit existential quantification.  Written out in full, we might say "If there exists cs' and cs'' such that cs = cs'@cs'' with ..., then ...".  This is an important observation because it makes clear that we must search for a suitable splitting of cs into two parts such that the first part is in L(r) and the second is accepted by k.  There may, in general, be many ways to partition the input to as to satisfy both of these requirements; we need only find one such way.  Note, however, that if cs = cs' @ cs'' with cs' in L(r) but k cs'' yielding false, we must reject this partitioning and search for another.  In other words we cannot simply consider any partitioning whose initial segment matches r; we can consider only those that also induce k to accept the corresponding final segment.

Suppose for the moment that match_is satisfies this specification.  Does it follow that match satisfies the original specification?   Recall that match is defined as follows:

fun match r s =
    match_is r (String.explode s) (fn nil => true | false)

Notice that the initial continuation is indeed total, and that it yields true (accepts) iff it is applied to the null string.  Therefore, if match_is satisfies its specification, then match satisfies the following property obtained by plugging in the initial continuation:

For every regular expression r and character list cs, if cs in L(r), then match r cs evaluates to true, and otherwise match r cs evaluates to false.

This is precisely the property that we desire for match.   Thus match is correct (satisfies its specification) if match_is is correct (satisfies its specification).

So far so good.  But does match_is satisfy its specification?  If so, we are done.  How might we check this?  Recall the definition of match_is given in the overview:

fun match_is Zero _ k = false
    | match_is One cs k = k cs
    | match_is (Char c) (d::cs) k = if c=d then k cs else false
    | match_is (Times (r1, r2)) cs k =
      match_is r1 cs (fn cs' => match_is r2 cs' k)
    | match_is (Plus (r1, r2)) cs k =
      match_is r1 cs k  orelse  match_is r2 cs k
    | match_is (Star r) cs k =
      k cs  orelse  match_is r cs (fn cs' => match_is (Star r) cs' k)

Since match_is is defined by a recursive analysis of the regular expression r, it is natural to proceed by induction on the structure of r.   That is, we treat the specification as a conjecture about match_is, and attempt to prove it by structural induction on r.

We first consider the three base cases.  Suppose that r is 0.   Then no string is in L(r), so match_is must return false, which indeed it does.  Suppose that r is 1.  Since the null string is an initial segment of every string, and the null string is in L(1), we must yield true iff k cs yields true, and false otherwise.   Again, this is precisely how match_is is defined.  Suppose that r is a.  Then to succeed cs must have the form a cs' with k cs' evaluating to true; otherwise we must fail.  The code for match_is checks that cs has the required form and, if so, passes cs' to k to determine the outcome, and otherwise yields false.  Thus match_is behaves correctly for each of the three base cases.

We now consider the three inductive steps.  For r=r1+r2, we observe that some initial segment of cs matches r and causes k to accept the corresponding final segment iff either some initial segment matches r1 and drives k to accept or some initial segment matches r2 and drives k to accept.  By induction match_is works as specified for r1 and r2, which is sufficient to justify the correctness of match_is for r=r1+r2.  For r=r1r2, the proof is slightly more complicated.  By induction match_is behaves according to the specification if it is applied to either r1 or r2, provided that the continuation argument is total.  Note that the continuation k' given by (fn cs' => match_is r2 cs' k) is total, since by induction the inner recursive call to match_is always terminates.  Suppose that there exists a partitioning cs=cs'@cs'' with cs' in L(r)and k cs'' evaluating to true.   Then  cs'=cs'1cs'2 with cs'1 in L(r1) and cs'2 in L(r2), by definition of L(r1r2).  Consequently, match_is r2 cs'2cs'' k evaluates to true, and hence match_is r1 cs'1cs'2cs'' k' evaluates to true, as required.  If, however, no such partitioning exists, then either no initial segment of cs matches r1, in which case the outer recursive call yields false, as required, or for every initial segment matching r1, no initial segment of the corresponding final segment matches r2, in which case the inner recursive call yields false on every call, and hence the outer call yields false, as required, or else every pair of successive initial segments of cs matching r1 and r2 successively results in k evaluating to false, in which case the inner recursive call always yields false, and hence the continuation k' always yields false, and hence the outer recursive call yields false, as required.  Be sure you understand the reasoning involved here, it is quite tricky to get right!

We seem to be on track, with one more case to consider, r=r1*.   This case would appear to be a combination of the preceding two cases for alternation and concatenation, with a similar argument sufficing to establish correctness.   But there is a snag: the second recursive call to match_is leaves the regular expression unchanged!  Consequently we cannot apply the inductive hypothesis to establish that it behaves correctly in this case, and the obvious proof attempt breaks down.  (Write out the argument to see where you get into trouble.)  What to do?   A moment's thought suggests that we proceed by an inner induction on the length of the string, based on the theory that if some initial segment of cs matches L(r), then either that initial segment is the null string (base case), or cs=cs'@cs'' with cs' in L(r1) and cs'' in L(r) (induction step).  We then handle the base case directly, and handle the inductive case by assuming that match_is behaves correctly for cs'' and showing that it behaves correctly for cs.   But there is a flaw in this argument!  The string cs'' need not be shorter than cs in the case that cs' is the null string!  In that case the inductive hypothesis does not apply, and we are once again unable to complete the proof.  But this time we can use the failure of the proof to obtain a counterexample to the specification!  For if r=1*, for example, then match_is r cs k does not terminate!  In general if r=r1* with "" in L(r1), then match_is r cs k fails to terminate.  In other words, match_is does not satisfy the specification we have derived for it!  Our conjecture is false!

We have used the failure of an attempt to prove that match_is satisfies a reasonable specification of its behavior to discover a bug in the code --- the matcher does not always terminate.  What to do?  One approach is to explicitly check for failure to make progress when matching against an iterated regular expression.   This will work, but at the expense of cluttering the code and imposing additional run-time overhead.  You should write out a version of the matcher that works this way, and check that it indeed satisfies the specification we've given above.  An alternative is to observe that the proof of correctness sketched above goes through, provided that the regular expression satisfies the condition that no iterated regular expression matches the null string.  That is, r* is admitted as a valid regular expression only if "" is not in L(r).  Call a regular expression satisfying this condition standard.  As an exercise check that the proof sketched above goes through under the additional assumption that r is a standard regular expression.

Thus the matcher behaves correctly for all standard regular expressions.   But what about those non-standard ones?  A simple observation is that every regular expression is equivalent to one in standard form.  That is, we never really need to consider non-standard regular expressions.  Instead we can pre-process the regular expression to put it into standard form, then call the matcher on the standardized regular expression.  This pre-processing is based on the following definitions.  First, we define null(r) to be the regular expression 1 if r accepts the null string, and the regular expression 0 if not.   Then we define nonnull(r) to be a regular expression r' in standard form such that L(r') = L(r) \ {""} --- that is, r' accepts the same strings as r, except for the null string.  Thus for every regular expression r, we have

L(r) = L(null(r)+nonnull(r)).

Moreover, the regular expression null(r)+nonnull(r) is in standard form.

Here is the definition of null:

null(0) = 0
null(1) = 1
null(a) = 0
null(r1+r2) = null(r1) ++ null(r2)
null(r1r2) = null(r1) ** null(r2)
null(r*) = 1

where we define 0++1=1++0=1++1=1 and 0++0=0 and 0**1=1**0=0**0=0 and 1**1=1.

Here is the definition of nonnull:

nonnull(0) = 0
nonnull(1) = 0
nonnull(a) = a
nonnull(r1+r2) = nonnull(r1)+nonnull(r2)
nonnull(r1r2) = null(r1)nonnull(r2) + nonnull(r1)nonnull(r2)
nonnull(r*) = null(r) + nonnull(r)*

Check that the stated properties of these regular expressions indeed hold true, and use these equations to define a pre-processor to put every regular expression into standard form.

This chapter is based on the paper entitled Proof-Directed Debugging, which is scheduled to appear as a Functional Pearl article in the Journal of Functional Programming.

Sample Code for this Chapter