15-212-ML : Homework Assignment 3

Due Wed Oct. 7, 12 noon (electronically); papers at recitation.

Maximum Points: 100 (+30 extra credit)

Guidelines

Strive for elegance! Not every program which runs deserves full credit. Make sure to state invariants in comments which are sometimes implicit in the informal presentation of an exercise. If auxiliary functions are required, describe concisely what they implement. Do not reinvent the wheel and try to make your functions small and easy to understand. Use tasteful layout and avoid longwinded and contorted code.
Make sure that your file compiles and runs. A program which doesn't run will not get full credit and is likely to incur a heavy penalty.
Home works must be all your own work.
Late home works will be accepted only until start lecture on Thursday, with a 25% penalty.
If you have any questions about this assignment, contact Philip Wickline at philipw@cs.cmu.edu, Adam Megacz at megacz@usa.net , or use cmu.andrew.academic.cs.15-212-ML.discuss.

Problem 1: Dictionaries (35 points + 10 extra credit)

Binary trees are a particularly useful and versatile data structure. As described in the lecture notes, a binary search tree is a binary tree with values of an ordered type at the nodes arranged in such a way that for every node in the tree, the value at that node is greater than the value at any node in the left child of that node, and smaller than the value at any node in the right child. Thus an inorder traversal of the tree will yield an enumeration of the values in the tree from least to greatest. This representation invariant makes makes binary search trees efficient structures to search, taking time proportional to lg(n) to find element in a balanced tree, where n is the number of elements in the tree.

However, without taking special care, trees that are built from arbitrary insertions of data are not necessarily balanced, meaning that searches may take time O(n) in the worst case. There are many mechanisms for building balanced binary search trees. One such mechanism, red-black trees, is described in the lecture notes. For problem 1 you will implement balanced binary trees using a balancing criterion based upon tree height.

We define the height, h(t), of trees inductively on the structure of the tree:

h(empty) = 0
h(t) = 1 + max(h(t.left), h(t.right))

where t.left and t.right are the left and right children of t, empty is the empty tree, and max is the function which returns the greater of its two arguments. In other words h(t) is zero if the tree is empty, and the one more than the height of the largest child otherwise.

The idea is to store at each node the height of that node in the tree. Then, when new trees are constructed from old trees, the new tree can be balanced by means of one of the rotations described below. In addition to the binary search tree invariant described above, you must maintain the following invariant at each node with height greater than 2:

h(l) <= h(r) * w and h(l) * w >= h(r)

where w is a constant called the weight ratio, l is the left child, and r is the height of the right child. In other words, at every node, neither child is more than w times the height of the other.

There are four rotations that you will use to keep trees balanced: single left, single right, double left, and double right. The two left rotations are presented pictorially below; the right rotations are just the mirror images of these.

A single rotation lifts Z relative to X and Y, while a double rotation lifts Y. Therefore, when balancing trees, you should use a double rotation if Y is taller, or a single rotation if Z is taller.

Question 1.1 (35 points)

Write a structure Dict with the basic types described in the DICTsignature (5 points) and the empty, lookup, insert (20 points), and fold (10 points) values.

signature DICT =
sig
type key = string
type 'a entry = key * 'a
type 'a dict

val empty : 'a dict
val lookup : 'a dict * key -> 'a option
val insert : 'a dict * 'a entry -> 'a dict
val fold : ('a * 'b * 'b -> 'b) -> 'b -> 'a dict -> 'b

(*val delete : 'a dict * key -> 'a dict (* extra credit *) *)
end; (* signature DICT *)

Use w = 3 and use the following datatype as the representation type of an 'a dict.

datatype 'a dict =
E
| T of 'a entry * int * 'a dict * 'a dict; (* data * height * left child * right child *)

(* Invariants (writing height(t) for the height of tree t):
   1. t : s dict is a binary search tree
   2. For every tree t = T (e, h, left, right), h = height(t)
   3. For every tree t = T (e, h, left, right),
      height(left) <= w*height(right) and height(right) <= w*height(left)
*)

empty should be the empty tree
lookup (tree, key) should return SOME of the value associated with key in tree if it is in the tree, and NONE otherwise
insert (tree, (key, value)) should take a balanced tree tree and an entry (key, value) and return a balanced tree with the new entry in it. If key already exists in the tree, the old entry should be replaced with the new.
fold f init d should go through the binary tree dand return init if it encounters a leaf (i.e. the empty tree). For every other node in the tree it should return f(datum,vLeft,vRight) where datum is the datum of the node, and vLeft is the result of applying fold on the left subtree of d, vRight is the result of applying fold on the right subtree of d.

NOTE: Make sure to annotate all your functions with the invariants they expect for the arguments! Otherwise the graders may not be able to understand and assess the correctness of your implementation!

Hint: Your code will be much easier to understand if you use a helper function balance (not visible in the signature!) which takes an entry a and two balanced trees, X and Y, whose heights are not too much out of balance (you need to decide how much that is, it will be an invariant of the function), and constructs the balanced tree derived from the tree

using one of the rotations, if necessary. Again, be sure to state all of the invariants balance expects from its arguments.

Question 1.3 (Extra Credit, 10 points)

Write the function delete which takes a key and a tree and returns a balanced version of the tree created by removing the entry with the given key from the tree. No exception should be raised if the key is not in the tree.

Problem 2: An Address Book (30 points)

For this problem you will write a simple address book utility using the dictionary structure implemented in problem 1. The purpose is to become familiar with programming using a strict interface. You will need to know how to use record types for this exercise. Read pp. 32-35 in Paulson if you don't already know about records.

The following signature ADDRESSBOOK describes address books:

signature ADDRESSBOOK =
    sig
        type key = string
        type entry =
            {Key : key,
             LastName : string,
             FirstName : string,
             Nickname : string option,
             Email : string,
             OfficeBuilding : string,
             OfficeNumber : int}
        type address_book

        val empty : address_book
        val addEntry : address_book * entry -> address_book
        val lookup : address_book * key -> entry option

        exception NotFound of key
        val updateNickName : address_book * key * string -> address_book
        val updateEmail : address_book * key * string -> address_book
        val updateOfficeBuilding : address_book * key * string -> address_book
        val updateOfficeNumber : address_book * key * int -> address_book
       (* question 2.2 & 2.3 *)
        val getRange address_book * key * key -> entry list
    end

The address book will store entry records with information about the people listed in the address book. The entries should be stored with the concatenation of the last name and the first name as keys. For instance the entry with LastName = "Wickline" and FirstName = "Philip" should have Key = "WicklinePhilip", and should be stored with key "WicklinePhilip" in the collection of entries.

Question 2.1 (10 points)

Implement a structure AddressBook with the types described in ADDRESSBOOK and the empty, addEntry, lookup, and various updateFoo functions.

empty should be the empty address book
addEntry(book, entry) creates a new address book extending book with the key #Key entry mapped to entry
lookup(book, key) returns SOME of the entry associated with key in book, if one exists, and NONE otherwise
updateFoo(book, key, value) should return a new address book which is identical to book except that the Foo field of the record associated with key contains value. If key does not occur in book, then the exception NotFound key should be raised.

Question 2.2 (10 points)

Write the function getRange which when given an address book and a pair of keys, returns the list of all entries in the book whose keys are alphabetically between the two keys, inclusive. For example, if the address book book has entries with keys "WatkinSarah", "SmithJose", and "SmithJoseph", the call getRange(book, "SmithJose", "Taylor") should return the entries associated with the keys "SmithJose" and "SmithJoseph", in no particular order. What is the time complexity of this function? Why is it less efficient than a simple lookup?

Question 2.3 (10 points)

Extend the signature and implementation of dictionaries to add facilities to make the getRange function more efficient. Call the the new signature DICT1 and the structure Dict1. Modify your address book structure to use this new dictionary structure, and call the new address book structure AddressBook1.

Problem 3: Regular expressions (35 points + 20 extra credit)

In class five different operators have been introduced to describe regular expressions R:

Characters a
Concatenation r₁ r₂
The empty string One
Alternative r₁ + r₂
Empty set Zero
Repetition r^*

This has been implemented as the following datatype.

datatype regexp =
    Char of char
  | Times of regexp * regexp
  | One
  | Plus of regexp * regexp
  | Zero
  | Star of regexp;

From time to time it is helpful to have some more constructs available to form regular expressions, such as

A wildcard symbol _ which accepts any character

L(_) = {a | a is a character in the alphabet}

Underscore

Intersection r₁ & r₂ which accepts a string only if it is contained in r₁ and simultaneously in r₂.

L(r₁ & r₂) = {s | s in L(r₁) and s in L(r₂)}

And

A wildcard T which matches any string.

L(T) = {s | s is a string over the alphabet}

Top

Negation ~ r which contains every string which is not contained in r.

L(~ r) = {s | s not in r}

Not

The regular expression matcher accept is available in the file regexp.sml and should be extended to deal with these new constructors. Remind yourself of the specification of the acc function and make sure your new cases fit this specification. You should also generate some test data to validate your implementation. Note that for the purposes of our implementation, the alphabet is simply any ML value of type char.

Question 3.1 (5 points)

Implement the case for the one-character wildcard _.

Question 3.2 (15 points)

Implement the case for intersection r₁ & r₂.

Question 3.3 (15 points)

Implement the case for T.

Question 3.4 (Extra credit, 10 points)

Prove the correctness of the case for intersection in your implementation, following the pattern in the handout.

Question 3.5 (Extra credit, 10 points)

Implement the case for Negation.

Handin instructions

Put your SML code into a single file named ass3.sml in your ass3 directory. All of your definitions should be in this one file. Please keep a backup for your records. Your handin directory for this assignment is

/afs/andrew/scs/cs/15-212-ML/studentdir/<your andrew id>/ass3/ass3.sml