15-212-X : Homework Assignment 6

Due Tue Nov 26, Noon (electronically).
No late assignments will be accepted!

Maximum Points: 100 [+20 extra credit]

Guidelines

While we acknowledge that beauty is in the eye of the beholder, you should nonetheless strive for elegance in your code. Not every program which runs deserves full credit. Make sure to state invariants in comments which are sometimes implicit in the informal presentation of an exercise. If auxiliary functions are required, describe concisely what they implement. Do not reinvent wheels and try to make your functions small and easy to understand. Use tasteful layout and avoid longwinded and contorted code.
Make sure that your file compiles and runs. A program which doesn't run will not get full credit and is likely to incur a heavy penalty.
Homeworks must be all your own work.
If you have any questions about the assignment, contact Iliano Cervesato at iliano@cs.cmu.edu or use cmu.andrew.academic.15-212-X.discuss.

Problem 1: Extraction of Documentation from ML Signatures (60 pts)

ML signatures provide typing information for the objects exported by structures. They are also an ideal place where to store informal specifications that should be fullfilled by the implementation of these objects: invariants, intended meaning, operational behavior, etc are all useful documentation that can be conveniently written as comments in the signature (or elsewhere). However, a signature quickly becomes hard to read if overcrowded with comments.

In this part of the assignment, you will be asked to write a program that reads a file containing an ML signature, gets rid of all the comments and write an HTML file containing slots to be filled in with documentation for the various specifications.

Question 1.1: Lexical Analysis (30 pts)

We will be interested in scanning SML files containing one signature and nothing else. For simplicity, we will consider a slightly restricted form of the grammar of ML signature. Although simplified, it is rich enough to handle most signatures you are likely to write, including the code for this assignment. The grammar is as follows:

<sigs>     ::= <sigdecl>
	     | <sigdecl> ;
<sigdecl>  ::= signature <idf> = <sigexp>
<sigexp>   ::= sig <spec> end
             | <idfseq>
<spec>     ::= val <idf> : <idfseq>
             | type <typedesc>
	     | eqtype <typedesc>
	     | datatype <datadesc>
	     | exception <idfseq>
	     | structure <idf> : <sigexp>
	     | include <idf>
	     | <spec> <spec>
<typedesc> ::= <idfseq>
             | <idfseq> = <idfseq>
<datadesc> ::= <idfseq> = <idfseq>
             | <datadesc> and <datadesc>
<idfseq>   ::= <idf>
             | <misc>
             | <idf> <idfseq>
	     | <misc> <idfseq>

The terminal symbols you are requested to recognize are highlighted. In addition to this, the lexical analysis should recognized and report identifiers (<idf>) and miscellaneous characters (<misc>), and handle (possibly nested) comments; we will come back to these points below. This grammar focuses the identifiers that are declared in a signature and glosses over the remaining parts of their declaration. For example, when analysing a val declaration, we look for this keyword, then for an identifier followed by a colon, but we ignore the structure of the type that comes afterward: we interpret it simply as a sequence of identifiers and special characters.

In the SML'96 specification, identiers are distinguished into alphanumeric and symbolic:

an alphanumeric identifier is any string of letters (a...zA...Z), digits (0...9), primes (') and underscores (_) that does not begin with a digit. In ML, only type variables can begin with a prime, but we will ignore this distinction.
a symbolic identifier is any string consisting of the following characters: !%&$#+-/:<=>?@\`^|*.

Alphanumeric identifiers are bounded by spaces, symbolic identifiers or special characters. Similarly, symbolic identifiers are bounded by spaces, alphanumeric identifiers and special characters.

A special character is any among ()[]{}.,".

A comment starts with (* and ends with *). Comments can be nested.

You are requested to write a lexer that, given a stream of characters from a file containing an SML signature, produces a stream of tokens as described above. More precisely, you are requested to implement a functor Lexer () :> LEXER realizing the following signature (it can be found in the file lexer.sml):

signature LEXER =
sig
  datatype token =
    IDF of string           (* Identifiers *)
  | DOC of string * string  (* (* EXPRESSION ... MEANING ... *) *)
  | MISC of string          (* Misc characters *)
  | SIGNATURE               (* signature *)
  | EQUAL                   (* = *)
  | SIG                     (* sig *)
  | END                     (* end *)
  | SEMICOLON               (* ; *)
  | INCLUDE                 (* include *)
  | VAL                     (* val *)
  | COLON                   (* : *)
  | EXCEPTION               (* exception *)
  | TYPE                    (* type *)
  | DATATYPE                (* datatype *)
  | EQTYPE                  (* eqtype *)
  | STRUCTURE               (* structure *)
  | AND                     (* and *)

  exception Error of string

  val lex : char MStream.stream -> token MStream.stream
  val toString : token MStream.stream -> string
end; (* signature LEXER *)

The constructors of type token represent the input tokens displayed on their right. Both alphanumeric and symbolic identifiers are scanned to the token IDF, which argument is the identifier itself. Ignore the token DOC unless you want to tackle Question 1.4. Special characters are returned as the argument of the token MISC.

The exception Error should be raised for any of the following two reasons: either a character not mentioned above has been encountered, or comments are not propertly nested (in particular, if the input stream ends while reading a comment)

The function lex performs lexical analysis as described above. The files parser.lex describes the stream of lexical tokens expected from running lex on the parser you are going to use (parser.sml). A smaller example is contained in example.lex.

The function toString rewrites a stream of tokens, one token per line, as a string. It is a great debugging tool.

You can find implementations of streams in the file stream.sml and of the functions to transform a file into a stream of character in the file mstream_io.sml.

Question 1.2: Output Generation (20 pts)

A parser for a superset of the grammar described in Question 1.1 can be found in the file parser.sml. You are not requested to implement it! Once you have properly realized the functor Parser (structure Lexer : LEXER) :> PARSER implemented in it, say to the structure Parser, you will be able to use the function Parser.parse. When given a stream of tokens produced by your lexer from Question 1.1, this function constructs a parse tree (of type Parser.Sigs) and the relative documentation (of type Parser.documentation). Ignore the documentation unless you go through Question 1.4. An error message is produced if the input stream of tokens is not allowed according to the above grammar.

The type of parse trees is given as follows:

datatype sigs = Sigs of sigdecl list
and sigdecl = Sigdecl of string * sigexp              (* one element *)
and sigexp = SigexpSpec of spec
           | SigexpIdf of idfseq
and spec = SpecVal of string * idfseq
         | SpecType of typedesc
         | SpecEType of typedesc
	 | SpecDType of datadesc
         | SpecEx of idfseq
	 | SpecStr of string * sigexp
	 | SpecIncl of string
         | SpecSpec of spec * spec
and typedesc = Typedesc of idfseq * idfseq option
and datadesc = Datadesc of (idfseq * idfseq) list     (* non empty *)
and idfseq = Idfseq of string list                    (* non empty *)

Each datatype corresponds to the non-terminal symbol with the same name in the grammar in Question 1.1. Each constructor corresponds to a grammatical production. Constructors and productions are given in the same order. There are the following exceptions to this rule:

Sigs takes a list of sigdecl's as an argument. The parser you are going to use can handle files containing several signatures. In your case, this list will always contain exactly one item.
The two productions for <typedesc> have been combined in the constructor Typedesc of type typedesc by making the second argument optional.
The iteration in <datadesc> is modeled by using lists. Such a list cannot be empty, and its elements are in the same order as they appear in the input token stream of Parser.parse.
Similarly, sequences of identifiers and miscellaneous characters are packaged in a list. Again, it cannot be empty and its elements are in the same order as they appear in the input token stream of Parser.parse.

Given a parse tree for an ML signature, you are requested to generate an HTML file reporting this signature according to a specific format described below and with documentation slots for the various declarations. An example of the expected output can be found in the file parser.html.

The resulting HTML document should be divided into three parts separated by horizontal lines (HTML tag <HR>): some header (something similar to or better than what is provided in the example), the formatted signature, and a documentation area.

The minimum formatting we require is as follows.

Signature-specific reserved words (sig, end, val, type, eqtype, datatype, and, exception, structure and include) should be highlighted. Use the HTML tags <B> ... </B> for this purpose.
Distinct declarations should appear on different lines (use the <P> tag for this purpose).
The identifiers being declared should have a link to an anchor with the same name in the documentation part of the HTML code. Local anchors are created by means of the <A NAME="...">...</A> tags. They are references through the <A HREF="#...">...</A> tags. Notice that in a declarations such as datatype 'a tree = ..., only tree should be highlighted.
Information slots should be items in a definition list (tags <DL>...</DL>). The key (tag <DT>) should be the identifier being documented, and the definition (tag <DD>) should mark some invitation to fill the documentation in.

Any additional formatting is clearly welcome!

You are requested to implement the the functor Ml2Html (structure Parser : PARSER) :> ML2HTML that realizes the signature ML2HTML below (you can find it in the file ml2html.sml).

signature ML2HTML =
sig
  structure Parser : PARSER		(* parameter *)

  exception Error of string

  val ml2html : Parser.sigs * Parser.documentation -> string
end;  (* signature ML2HTML *)

Given a structure Ml2Html realizing this signature, the function Ml2Html.ml2html generates HTML code as described above. As already said, you should not be concerned with the second argument of this function unless you tackle Question 4.1.

The exception Error should be raised if the input parse tree violates the previously stated constraints (for example if a list of identifiers is empty).

Question 1.3: Putting it together (10 pts)

We are almost there. You are now requested to write a functor Top (structure MStreamIO : MSTREAM_IO) :> TOP implementing the following signature TOP below. This signature can be found in the file top.sml.

signature TOP =
sig
  structure Parser : PARSER

  val document : string -> unit
end;  (* signature TOP *)

Once realized, the function document should take as input a file name ending with the extension .sig or .sml, scan it, parse it, and produce an HTML file according to the specifications in Question 1.2. The name of the output file should be identical to the name of the input file, with the extension changed to .html. An error message should be printed if any error occurs (the input file cannot be open or the lexer, the parser, or the output generator raise an error). Remember to close all your files in case something goes wrong.

Question 1.4 (Extra-credit): Documentation extraction (20 pts)

Augment the programs for the previous questions to handle special documentation comments in the input file. The form of these comments is

(* EXPRESSION key MEANING meaning *)

The lexer should recognize these comments and produce tokens of the form DOC(key,meaning). Assume these comments will not be nested in other comments.

The parser will take these tokens into account and return them together with the generated parse tree as a list of pairs (key,meaning) (of type documentation).

The output generation function accepts documentation in this format and fills the appropriate slots of the output HTML code with it. For example, assume that your input signature declares the value fact and that the documentation list contains an item of the form ("fact n", "computes the factorial of n"), then the key corresponding to fact should be set to the first component of this pair ("fact n") rather than simply to "fact", and the definition part should be set to the second component of the pair.

If you intend to answer this question, state it clearly at the beginning of all your modules.

Problem 2: Operator Precedence Parsing (40 pts)

In class we have seen how to build an operator precedence parser for a specific language. In this exercise we are concerned with how this can be generalized to provide a general module for operator precedence parsing. For this, we commit to a specific view of a parser as a stream transducer. You may also wish to consult Sections 9.2-9.4 in the textbook for some related code and explanations.

We think of a parser as transforming a stream of tokens into a stream of abstract syntax trees. This intuition can be captured by a function which reads some tokens from a stream, produces the first element of the output stream plus the remaining stream of tokens. Since a general module should be independent of the set of tokens or abstract syntax trees, we define

type ('a, 'b) parser = 'a MStream.stream -> ('b * 'a MStream.stream)

where we think of 'a as the type of tokens, and 'b as the type of abstract syntax trees.

The task of a precedence parser is to transform a stream of operators and operands and build the correct abstract syntax tree, according to the precedence of the operators and grouping constructs such as parentheses. For infix operators, we also need an associativity which determines if a @ b @ c is parsed as (a @ b) @ c (left associative) or a @ (b @ c) (right associative), if @ is an infix operator. If we think of an operand as simply an operator without arguments, we obtain the following definitions.

type prec = int				(* Precedence *)
datatype assoc = Left | Right | None	(* Associativity *)

datatype 'b operator =
  Infix of prec * assoc * ('b * 'b -> 'b) (* binary operator *)
| Prefix of prec * ('b -> 'b)		(* unary prefix operator *)
| Postfix of prec * ('b -> 'b)		(* unary postfix operator *)
| Atom of 'b				(* nullary operator = operand *)
| LeftDelimiter				(* left delimiter, often "(" *)
| RightDelimiter			(* right delimiter, often ")" *)
| Terminator				(* end of 'b operators *)

We see that the meaning of each operator or operand is supplied as a function on abstract syntax trees of appropriate arity. The precedence parser now has type

val precParse : ('a, 'b operator) parser -> ('a, 'b) parser

that is, given a parser for operators it returns a parser for abstract syntax trees.

Question 2.1: Parser generation (25 pts)

Implement a structure PrecParser :> PREC_PARSER (see file precedence.sml) for operator precedence parsing. It should take into account operator precedence and associativity and generate errors for illegal or ambiguous expressions. Expressions are ambiguous if neither precedence nor associativity clarifies the meaning of an expression. Make sure to cover ALL cases of error and ambiguity. See also the examples below.

Question 2.1: Scanning an example (5 pts)

Consider the language

                     Precedence   Associativity    Token
e ::= e1 & e2            1            right         AMPERSAND
    | e1 | e2            1            right	    BAR
    | ~ e                2			    TILDE
    | e1 = e2            3            none	    EQUAL
    | e2 < e2            3            none	    LESS
    | e1 + e2            4            left	    PLUS
    | e1 - e2            4            left          MINUS
    | e1 * e2            5            left          STAR
    | e1 / e2            5            left          SLASH
    | e1 ^ e2            6            right         UPARROW
    | # e                7                          HASH
    | e !                7                          EXCL
    | e %                7                          PERCENT
    | <integer>                                     INTEGER(n)
    | ( e )                                         LPAREN RPAREN

exp ::= e ;                                         SEMICOLON

Modify the lexer code from class (see the file /afs/andrew/scs/cs/15-212-X/code/lecture21.sml) for this revised language and set of tokens.

Question 2.3: Parsing the example (10 pts)

Use the PrecParser implementation to obtain a parser for this language with the indicated precedence and associativity. This requires you to write an operator parser first. There are no specific requirements regarding the error messages, but ambiguity should be reported with a different message from other kinds of errors. Note that MINUS is no longer overloaded as it was in the code for Lecture 21.

Some examples:

~ 3!! < 5^3^2/6 | ~~(#3 = 4 & 3/4*4!); => ((~(((3!)!) < ((5^(3^2))/6))) | (~(~(((#3) = 4) & ((3/4)*(4!))))))
3 = 4 = 5; => error (ambiguous, = is non-associative infix)
#3!; => error (ambiguous, could be (#3)! or #(3!), since precedence of # and ! are equal)

Hand-in instructions

Put your SML code in the handin directory is

/afs/andrew/scs/cs/15-212-X/studentdir/<your andrew id>/ass6,