Programming Assignment: SL Parser

15-212, Spring 1998


This assignment is due before ?? on ??. This assignment is arguably the most intricate (but also the most interesting) of the three SL assignments. Get started early!

Follow the instruction on the course web page for submitting your solution. Submit one solution per group. All your source files should be in the same directory. Do not use subdirectories. To submit your solution, you should copy just your source files to the submission directory. In particular, do not carry out your program development in the submission area.

Points will be given for good documentation and style. See the coding style document on the course web page. For this assignment, there'll be a 80-20 split for correctness and style, respectively, and there will be another 40 points allocated for the written answers from part 1, for a total of 140 points.
See the course web page on collaboration and late submission policies. Make sure your working directory is private and cannot be read by others!

Overview of the Assignment:

You already know that given a context-free grammar G and a string T, a parser can do the following: Your job will be to construct a parser to do both of the above tasks. Your program will take as its input a program in SL; if the program is syntactically valid, your parser will produce a parse tree for it. We will grade your program by feeding it certain SL programs--some valid, some not--and check that your program outputs a valid parse for syntactically valid programs, and a meaningful error message for invalid programs.

The Grammar for SL:

Let us first examine the context-free grammar (SIGMA, V, S, R) for SL:

SIGMA = { PROGRAM, FUNCTION, PARAMLIST, PARAMREST, BLOCK, STATEMENT, BREAK, CALL, ARGLIST, ARGREST, IF, ELSE, LET, READ, RETURN, WHILE, WRITE, EXPR, BINOP, UNOP, number, identifier, function, break, call, if, else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}

V = { number, identifier, function, break, call, if, else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}

S = PROGRAM

R is given by:

As always, e is epsilon, the empty string. Also, note that an SL program is a sequence of function declarations followed by the main body.

This description describes the syntax (how to write valid programs) of SL, but not the semantics (what a syntactically valid program actually does). We'll get to the semantics in the next assignment, but drawing on your knowledge of Java, the semantics of SL should be fairly obvious just from the grammar.

Your first task is to answer the following problems based on the grammar above. Put your answers in a text file called proj3-grammar.txt . Just one write-up per group is needed. See the note at the very end of this document concerning the header information you need at the top of this file.

  1. Write the grammatically correct SL expressions for the following Java arithmetic expressions (watch your operator precedence for some of these!), and also give their final values. You may want to run it through Java to double-check. Assume that a==1, b==2, c==3, d==4, e==0.

  2. Mr. Bar is confused about the difference between Java grammar and SL grammar. Locate everything that is wrong with his not-quite-SL program below, and for each: 1) indicate which SL grammar rule(s) it disobeys; and 2) rewrite it correctly in SL.

    Extra credit (1 pt): What is the final output of this program?

  3. (Note: This question is actually a parser question, not a grammar question, so you may want to wait until you're writing the parser to answer this one.)

Setting Up the Parser

You will implement a top-down parser for this assignment. First, you must create a Parser class, the main class for this assignment: This is the primary class for this assignment. When you create a Parser object, you pass it a stream that it needs to parse. The parser should create for itself a lexer so that it can chop up the stream into tokens. When parse() is called, the parser actually performs the parsing of the stream, then holds onto the root of the resulting parse tree in tree.

print() simply prints out the parse tree in a readable text format. The tree should be printed out using pre-order, depth-first traversal. (The criterion for readability is that someone who is familiar with this project -- e.g., a TA -- can stare at the output and deduce what your printout means without having to ask you for an explanation.) This method will also, without a doubt, be your friend in debugging.

So far so good. The only tricky part seems to be parse(). Before we get into that, however, we need to look at this mysterious class called ParseNode. Notice that tree is of that type, so perhaps you have already guessed that ParseNode is what the parse tree is made of; they are in fact the nodes of the parse tree, each containing pointers to children nodes. Let's look at it a bit more closely:

ParseNode is an abstract class, so you can't instantiate it. What you will do is create many subclasses of ParseNode (more details below). Each node maintains a list of pointers to its children. numChildren() returns the number of children that this node has. getChild(i) returns the ith child of this node. addChild(node) appends node to the list of children of this node. getType() returns the type of the node, as defined by one of the constants in GrammarTypes (again, see below for details). print() prints out the contents of the node in a readable format; this will be useful for debugging.

You'll need some data structure to store an ordered list of children. Once you have this, you can implement the necessary subclasses of ParseNode. There is a subclass of each node type defined in GrammarTypes, for a total of 15 subclasses (each in a separate file, of course):

The getType() method for each of these will be simple. For example:

So writing these classes will take a lot of cutting and pasting, but it'll be pretty simple. Here are a few twists, though, to keep things interesting:

Now there's one last issue to take care of: what do we do with children? For each of the 15 subclasses, you'll need to come up with a layout of what its children are. For example, the children of SL_WhileNode are (in order): SL_ExprNode, followed by SL_BlockNode. Why? Because a while statement is made up of two things in this order: first, a conditional expression, followed by a body to be executed. The first child of SL_WhileNode contains a conditional expression, while the second child of SL_WhileNode contains a block of statements.

Another example is SL_FunctionNode, which has 1+n+1 children. The first is of type SL_IdentNode and represents the name of the function; next are n SL_IdentNodes, each naming the parameters for the function, in order; finally, the last child is of type SL_BlockNode, and it represents the body of the function. You should work out such schematics for the 13 remaining node types.

From Grammar to Code

And now, we get to the heart of the project, the Parse.parse() method. To construct the parser, you should basically write a bunch of mutually recursive routines that look like the grammar. Say, just as an example, that you're interested in parsing a for statement with this grammar (which is not legal SL): Then you'd write something like:

You will have a bunch of private methods that look like the above. Some will match terminal tokens, which you can check by calling getToken() on your lexer and doing a comparison; others will match nonterminals, which translate into more recursive calls on the appropriate functions.

As written above, this parser won't be very useful. Why? Because it doesn't construct the parse tree as it goes along. So at the end of the parsing, the parser will be able to tell you whether your program is syntactically well-formed, but it won't be able to tell you which grammar rules it used to parse it. This is exactly like the CYK algorithm: it's a language recognizer, and nothing more. In this case, you'll want to build up a tree-like data structure (rooted at the Parser's tree object) as you parse the input.

You'll notice as you're constructing your parse tree that GrammarTypes doesn't define a node type for every possible nonterminal and terminal in the grammar. This is because not all of them are really needed. For example, STATEMENT nodes are not needed, because we can separately instantiate each of the possible types of statements: BREAK, CALL, IF, etc. You should be able to implement SL quite handily with just the 15 types of nodes defined above.

One last note: consider a case where you have an arbitrarily long list of something, like an argument list or a parameter declaration list. Here's an example using a hypothetical grammar for a list of numbers:

Do you see why this fails? The problem is in calling lex.getToken() to test to see if the next token is the list termination character, '$'. If it is, we're happy, because we can break out of the while loop. But what if it isn't? That means that there's at least one more number in the list, and as a matter of fact, we just read it when we called lex.getToken(); because that gave me the next number in the list, it obviously didn't match '$'. But now we go to parse the number, except we've lost it, because it was read for testing and is no longer in the stream!

The problem here is that we cannot detect the end of an arbitrarily long list of elements unless we can "peek" ahead one token. You've already seen this happen in your lexer, where you had to peek ahead one character to distinguish between the ">" and ">=" operators. Now you must look ahead by a single token, instead of a single character; but the idea is still the same. This is why we asked you to implement the pushBack() method in the lexer.

Loose Ends

One issue of some concern is that of error detection. If you have a syntactically flawed program, you should output a message that includes the line number and actual line text, and as useful a diagnosis as possible. You should use exceptions wisely for this purpose (see below for more details on exceptions). Of course, a lexically flawed program should behave as before: the lexer should take care of lexical errors. You should halt execution after the first lexical or syntactic error. As always, you must not throw any exceptions at the main() level.

As with the previous project, you will need to write a front-end main class to be able to test drive your parser. Call it Proj3.java. It should read in an SL source code file from standard input, parse it, and print out the final parse tree using pre-order, depth-first traversal. Nonexistent or inscrutable print output will be penalized.

Your final tally of files to hand in:

These files should be in the same submission directory. Do not use subdirectories.

Don't forget that all your files must begin with a header that contains:


Sample SL programs and parse trees.
Parse tree format.

Keep an eye on the class bboard for clarifications and updates.
END OF ASSIGNMENT