15-212-X : Homework Assignment 6

Due Tue Nov 26, Noon (electronically).
No late assignments will be accepted!

Maximum Points: 100 [+20 extra credit]


Problem 1: Extraction of Documentation from ML Signatures (60 pts)

ML signatures provide typing information for the objects exported by structures. They are also an ideal place where to store informal specifications that should be fullfilled by the implementation of these objects: invariants, intended meaning, operational behavior, etc are all useful documentation that can be conveniently written as comments in the signature (or elsewhere). However, a signature quickly becomes hard to read if overcrowded with comments.

In this part of the assignment, you will be asked to write a program that reads a file containing an ML signature, gets rid of all the comments and write an HTML file containing slots to be filled in with documentation for the various specifications.

Question 1.1: Lexical Analysis (30 pts)

We will be interested in scanning SML files containing one signature and nothing else. For simplicity, we will consider a slightly restricted form of the grammar of ML signature. Although simplified, it is rich enough to handle most signatures you are likely to write, including the code for this assignment. The grammar is as follows: The terminal symbols you are requested to recognize are highlighted. In addition to this, the lexical analysis should recognized and report identifiers (<idf>) and miscellaneous characters (<misc>), and handle (possibly nested) comments; we will come back to these points below. This grammar focuses the identifiers that are declared in a signature and glosses over the remaining parts of their declaration. For example, when analysing a val declaration, we look for this keyword, then for an identifier followed by a colon, but we ignore the structure of the type that comes afterward: we interpret it simply as a sequence of identifiers and special characters.

In the SML'96 specification, identiers are distinguished into alphanumeric and symbolic:

Alphanumeric identifiers are bounded by spaces, symbolic identifiers or special characters. Similarly, symbolic identifiers are bounded by spaces, alphanumeric identifiers and special characters.

A special character is any among ()[]{}.,".

A comment starts with (* and ends with *). Comments can be nested.

You are requested to write a lexer that, given a stream of characters from a file containing an SML signature, produces a stream of tokens as described above. More precisely, you are requested to implement a functor Lexer () :> LEXER realizing the following signature (it can be found in the file lexer.sml):

The constructors of type token represent the input tokens displayed on their right. Both alphanumeric and symbolic identifiers are scanned to the token IDF, which argument is the identifier itself. Ignore the token DOC unless you want to tackle Question 1.4. Special characters are returned as the argument of the token MISC.

The exception Error should be raised for any of the following two reasons: either a character not mentioned above has been encountered, or comments are not propertly nested (in particular, if the input stream ends while reading a comment)

The function lex performs lexical analysis as described above. The files parser.lex describes the stream of lexical tokens expected from running lex on the parser you are going to use (parser.sml). A smaller example is contained in example.lex.

The function toString rewrites a stream of tokens, one token per line, as a string. It is a great debugging tool.

You can find implementations of streams in the file stream.sml and of the functions to transform a file into a stream of character in the file mstream_io.sml.

Question 1.2: Output Generation (20 pts)

A parser for a superset of the grammar described in Question 1.1 can be found in the file parser.sml. You are not requested to implement it! Once you have properly realized the functor Parser (structure Lexer : LEXER) :> PARSER implemented in it, say to the structure Parser, you will be able to use the function Parser.parse. When given a stream of tokens produced by your lexer from Question 1.1, this function constructs a parse tree (of type Parser.Sigs) and the relative documentation (of type Parser.documentation). Ignore the documentation unless you go through Question 1.4. An error message is produced if the input stream of tokens is not allowed according to the above grammar.

The type of parse trees is given as follows:

Each datatype corresponds to the non-terminal symbol with the same name in the grammar in Question 1.1. Each constructor corresponds to a grammatical production. Constructors and productions are given in the same order. There are the following exceptions to this rule: Given a parse tree for an ML signature, you are requested to generate an HTML file reporting this signature according to a specific format described below and with documentation slots for the various declarations. An example of the expected output can be found in the file parser.html.

The resulting HTML document should be divided into three parts separated by horizontal lines (HTML tag <HR>): some header (something similar to or better than what is provided in the example), the formatted signature, and a documentation area.

The minimum formatting we require is as follows.

Any additional formatting is clearly welcome!

You are requested to implement the the functor Ml2Html (structure Parser : PARSER) :> ML2HTML that realizes the signature ML2HTML below (you can find it in the file ml2html.sml).

Given a structure Ml2Html realizing this signature, the function Ml2Html.ml2html generates HTML code as described above. As already said, you should not be concerned with the second argument of this function unless you tackle Question 4.1.

The exception Error should be raised if the input parse tree violates the previously stated constraints (for example if a list of identifiers is empty).

Question 1.3: Putting it together (10 pts)

We are almost there. You are now requested to write a functor Top (structure MStreamIO : MSTREAM_IO) :> TOP implementing the following signature TOP below. This signature can be found in the file top.sml. Once realized, the function document should take as input a file name ending with the extension .sig or .sml, scan it, parse it, and produce an HTML file according to the specifications in Question 1.2. The name of the output file should be identical to the name of the input file, with the extension changed to .html. An error message should be printed if any error occurs (the input file cannot be open or the lexer, the parser, or the output generator raise an error). Remember to close all your files in case something goes wrong.

Question 1.4 (Extra-credit): Documentation extraction (20 pts)

Augment the programs for the previous questions to handle special documentation comments in the input file. The form of these comments is The lexer should recognize these comments and produce tokens of the form DOC(key,meaning). Assume these comments will not be nested in other comments.

The parser will take these tokens into account and return them together with the generated parse tree as a list of pairs (key,meaning) (of type documentation).

The output generation function accepts documentation in this format and fills the appropriate slots of the output HTML code with it. For example, assume that your input signature declares the value fact and that the documentation list contains an item of the form ("fact n", "computes the factorial of n"), then the key corresponding to fact should be set to the first component of this pair ("fact n") rather than simply to "fact", and the definition part should be set to the second component of the pair.

If you intend to answer this question, state it clearly at the beginning of all your modules.

Problem 2: Operator Precedence Parsing (40 pts)

In class we have seen how to build an operator precedence parser for a specific language. In this exercise we are concerned with how this can be generalized to provide a general module for operator precedence parsing. For this, we commit to a specific view of a parser as a stream transducer. You may also wish to consult Sections 9.2-9.4 in the textbook for some related code and explanations.

We think of a parser as transforming a stream of tokens into a stream of abstract syntax trees. This intuition can be captured by a function which reads some tokens from a stream, produces the first element of the output stream plus the remaining stream of tokens. Since a general module should be independent of the set of tokens or abstract syntax trees, we define

where we think of 'a as the type of tokens, and 'b as the type of abstract syntax trees.

The task of a precedence parser is to transform a stream of operators and operands and build the correct abstract syntax tree, according to the precedence of the operators and grouping constructs such as parentheses. For infix operators, we also need an associativity which determines if a @ b @ c is parsed as (a @ b) @ c (left associative) or a @ (b @ c) (right associative), if @ is an infix operator. If we think of an operand as simply an operator without arguments, we obtain the following definitions.

We see that the meaning of each operator or operand is supplied as a function on abstract syntax trees of appropriate arity. The precedence parser now has type that is, given a parser for operators it returns a parser for abstract syntax trees.

Question 2.1: Parser generation (25 pts)

Implement a structure PrecParser :> PREC_PARSER (see file precedence.sml) for operator precedence parsing. It should take into account operator precedence and associativity and generate errors for illegal or ambiguous expressions. Expressions are ambiguous if neither precedence nor associativity clarifies the meaning of an expression. Make sure to cover ALL cases of error and ambiguity. See also the examples below.

Question 2.1: Scanning an example (5 pts)

Consider the language Modify the lexer code from class (see the file /afs/andrew/scs/cs/15-212-X/code/lecture21.sml) for this revised language and set of tokens.

Question 2.3: Parsing the example (10 pts)

Use the PrecParser implementation to obtain a parser for this language with the indicated precedence and associativity. This requires you to write an operator parser first. There are no specific requirements regarding the error messages, but ambiguity should be reported with a different message from other kinds of errors. Note that MINUS is no longer overloaded as it was in the code for Lecture 21.

Some examples:

Hand-in instructions