Programming Assignment: SL Parser

15-212, Spring 1998

This assignment is due before ?? on ??. This assignment is arguably the most intricate (but also the most interesting) of the three SL assignments. Get started early!

Follow the instruction on the course web page for submitting your solution. Submit one solution per group. All your source files should be in the same directory. Do not use subdirectories. To submit your solution, you should copy just your source files to the submission directory. In particular, do not carry out your program development in the submission area.

Points will be given for good documentation and style. See the coding style document on the course web page. For this assignment, there'll be a 80-20 split for correctness and style, respectively, and there will be another 40 points allocated for the written answers from part 1, for a total of 140 points.

See the course web page on collaboration and late submission policies. Make sure your working directory is private and cannot be read by others!

Overview of the Assignment:

You already know that given a context-free grammar G and a string T, a parser can do the following:

Decide whether G can generate T (i.e. "T is in the grammar G").
Produce a parse tree of T, which shows how to rewrite the start symbol S in G to produce T.

Your job will be to construct a parser to do both of the above tasks. Your program will take as its input a program in SL; if the program is syntactically valid, your parser will produce a parse tree for it. We will grade your program by feeding it certain SL programs--some valid, some not--and check that your program outputs a valid parse for syntactically valid programs, and a meaningful error message for invalid programs.

The Grammar for SL:

Let us first examine the context-free grammar (SIGMA, V, S, R) for SL:

SIGMA = { PROGRAM, FUNCTION, PARAMLIST, PARAMREST, BLOCK, STATEMENT, BREAK, CALL, ARGLIST, ARGREST, IF, ELSE, LET, READ, RETURN, WHILE, WRITE, EXPR, BINOP, UNOP, number, identifier, function, break, call, if, else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}

V = { number, identifier, function, break, call, if, else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}

S = PROGRAM

R is given by:

PROGRAM -> FUNCTION* BLOCK
FUNCTION -> function identifier ( PARAMLIST ) BLOCK
PARAMLIST -> identifier PARAMREST | e
PARAMREST -> , identifier PARAMREST | e
BLOCK -> { STATEMENT* }
STATEMENT -> BREAK | CALL ; | IF | LET | READ | RETURN | WHILE | WRITE
BREAK -> break ;
CALL -> call identifier ( ARGLIST )
ARGLIST -> EXPR ARGREST | e
ARGREST -> , EXPR ARGREST | e
IF -> if EXPR BLOCK ELSE
ELSE -> else BLOCK | e
LET -> let identifier = EXPR ; | let identifier = CALL ;
READ -> read identifier ;
RETURN -> return EXPR ;
WHILE -> while EXPR BLOCK
WRITE -> write EXPR ;
EXPR -> number | identifier | ( EXPR ) | ( UNOP EXPR ) | ( BINOP EXPR EXPR )
BINOP -> + | - | * | / | % | & | | | < | > | <= | >= | == | !=
UNOP -> ~ | !

As always, e is epsilon, the empty string. Also, note that an SL program is a sequence of function declarations followed by the main body.

This description describes the syntax (how to write valid programs) of SL, but not the semantics (what a syntactically valid program actually does). We'll get to the semantics in the next assignment, but drawing on your knowledge of Java, the semantics of SL should be fairly obvious just from the grammar.

Your first task is to answer the following problems based on the grammar above. Put your answers in a text file called proj3-grammar.txt . Just one write-up per group is needed. See the note at the very end of this document concerning the header information you need at the top of this file.

Write the grammatically correct SL expressions for the following Java arithmetic expressions (watch your operator precedence for some of these!), and also give their final values. You may want to run it through Java to double-check. Assume that a==1, b==2, c==3, d==4, e==0.
- 1+2+3+4+5
- 1*(2+3)-4/5
- -1+2+-3+4+-5
- a && (!b || c) || !d && e
- (a==1 && b==-2 || c!=-3) && !d!=!-e
- a>=1 || b<=c || d>3 || e%5<4
Mr. Bar is confused about the difference between Java grammar and SL grammar. Locate everything that is wrong with his not-quite-SL program below, and for each: 1) indicate which SL grammar rule(s) it disobeys; and 2) rewrite it correctly in SL.
Extra credit (1 pt): What is the final output of this program?
(Note: This question is actually a parser question, not a grammar question, so you may want to wait until you're writing the parser to answer this one.)
- A) As we've noted before, a parser is really just a push-down automaton (DPDA). Briefly demonstrate this with regard to the top-down parser that you are now writing by answering these questions. In the parser:
  - What corresponds to the DPDA states?
  - What corresponds to the push-down store?
  - What corresponds to the input tape?
  - What is in each square of the input tape?
- B) The one lexeme lookahead feature that we use in our parser may strike you as being bogus, because in DPDAs, you can't push an already read input back into the input stream. Explain briefly why this is actually acceptable; i.e., explain how you would simulate this lookahead feature using a DPDA.

Setting Up the Parser

You will implement a top-down parser for this assignment. First, you must create a Parser class, the main class for this assignment:

public class Parser { public Parser(InputStream s) { /* to be filled in by you */ } public void parse() { /* t.b.f.i.b.y. */ } public void print() { /* t.b.f.i.b.y. */ } private ParseNode tree; /* add other stuff as you need them */ }

This is the primary class for this assignment. When you create a Parser object, you pass it a stream that it needs to parse. The parser should create for itself a lexer so that it can chop up the stream into tokens. When parse() is called, the parser actually performs the parsing of the stream, then holds onto the root of the resulting parse tree in tree.

print() simply prints out the parse tree in a readable text format. The tree should be printed out using pre-order, depth-first traversal. (The criterion for readability is that someone who is familiar with this project -- e.g., a TA -- can stare at the output and deduce what your printout means without having to ask you for an explanation.) This method will also, without a doubt, be your friend in debugging.

So far so good. The only tricky part seems to be parse(). Before we get into that, however, we need to look at this mysterious class called ParseNode. Notice that tree is of that type, so perhaps you have already guessed that ParseNode is what the parse tree is made of; they are in fact the nodes of the parse tree, each containing pointers to children nodes. Let's look at it a bit more closely:

public abstract class ParseNode implements GrammarTypes { public ParseNode() {...} public int numChildren() {...} public ParseNode getChild(int i) {...} public void addChild(ParseNode node) {...} public abstract int getType(); public abstract void print(PrintStream ps); /* define other stuff as you need them */ }

public interface GrammarTypes {

public final static int G_NOP = 0; public final static int G_NUMBER = 1; public final static int G_IDENT = 2; public final static int G_PROGRAM = 3; public final static int G_FUNCTION = 4; public final static int G_BLOCK = 5; public final static int G_BREAK = 6; public final static int G_CALL = 7; public final static int G_IF = 8; public final static int G_LET = 9; public final static int G_READ = 10; public final static int G_RETURN = 11; public final static int G_WHILE = 12; public final static int G_WRITE = 13; public final static int G_EXPR = 14; }

ParseNode is an abstract class, so you can't instantiate it. What you will do is create many subclasses of ParseNode (more details below). Each node maintains a list of pointers to its children. numChildren() returns the number of children that this node has. getChild(i) returns the ith child of this node. addChild(node) appends node to the list of children of this node. getType() returns the type of the node, as defined by one of the constants in GrammarTypes (again, see below for details). print() prints out the contents of the node in a readable format; this will be useful for debugging.

You'll need some data structure to store an ordered list of children. Once you have this, you can implement the necessary subclasses of ParseNode. There is a subclass of each node type defined in GrammarTypes, for a total of 15 subclasses (each in a separate file, of course):

public class SL_NopNode extends ParseNode {...}
public class SL_NumberNode extends ParseNode {...}
public class SL_IdentNode extends ParseNode {...}
public class SL_ProgramNode extends ParseNode {...}
public class SL_FunctionNode extends ParseNode {...}
public class SL_BlockNode extends ParseNode {...}
public class SL_BreakNode extends ParseNode {...}
public class SL_CallNode extends ParseNode {...}
public class SL_IfNode extends ParseNode {...}
public class SL_LetNode extends ParseNode {...}
public class SL_ReadNode extends ParseNode {...}
public class SL_ReturnNode extends ParseNode {...}
public class SL_WhileNode extends ParseNode {...}
public class SL_WriteNode extends ParseNode {...}
public class SL_ExprNode extends ParseNode {...}

The getType() method for each of these will be simple. For example:

public class SL_NopNode extends ParseNode {

public int getType() { return G_NOP; } }

So writing these classes will take a lot of cutting and pasting, but it'll be pretty simple. Here are a few twists, though, to keep things interesting:

To SL_NumberNode, add a constructor that allows you to set what number or value this node represents. Also add a getValue() method which returns the number. (Remember that the value is an arbitrary precision integer, which can be implemented using the BigInteger class in java.math.)
To SL_IdentNode, add a constructor that allows you to initialize the node with a string name. Also add a String getName() method.
To SL_FunctionNode, add several methods. int numArgs() should return the number of arguments this function accepts. String getArgName(i) should return the name of the ith parameter. ParseNode getBody() should return the body block of the function.
To SL_ExprNode, add a constructor that allows you to initialize the node with any operator type (both binary and unary allowed). Also add an int getOperator() method.

Now there's one last issue to take care of: what do we do with children? For each of the 15 subclasses, you'll need to come up with a layout of what its children are. For example, the children of SL_WhileNode are (in order): SL_ExprNode, followed by SL_BlockNode. Why? Because a while statement is made up of two things in this order: first, a conditional expression, followed by a body to be executed. The first child of SL_WhileNode contains a conditional expression, while the second child of SL_WhileNode contains a block of statements.

Another example is SL_FunctionNode, which has 1+n+1 children. The first is of type SL_IdentNode and represents the name of the function; next are n SL_IdentNodes, each naming the parameters for the function, in order; finally, the last child is of type SL_BlockNode, and it represents the body of the function. You should work out such schematics for the 13 remaining node types.

From Grammar to Code

And now, we get to the heart of the project, the Parse.parse() method. To construct the parser, you should basically write a bunch of mutually recursive routines that look like the grammar. Say, just as an example, that you're interested in parsing a for statement with this grammar (which is not legal SL):

FOR -> for IDENT = EXPR to EXPR do BLOCK Then you'd write something like:

parse_FOR() { match_for(); // match the keyword "for" parse_IDENT(); // match an identifier match_assn(); // match the assignment symbol parse_EXPR(); // match an expression match_to(); // match the keyword "to" parse_EXPR(); // match an expression match_do(); // match the keyword "do" parse_BLOCK(); // match a block }

You will have a bunch of private methods that look like the above. Some will match terminal tokens, which you can check by calling getToken() on your lexer and doing a comparison; others will match nonterminals, which translate into more recursive calls on the appropriate functions.

As written above, this parser won't be very useful. Why? Because it doesn't construct the parse tree as it goes along. So at the end of the parsing, the parser will be able to tell you whether your program is syntactically well-formed, but it won't be able to tell you which grammar rules it used to parse it. This is exactly like the CYK algorithm: it's a language recognizer, and nothing more. In this case, you'll want to build up a tree-like data structure (rooted at the Parser's tree object) as you parse the input.

You'll notice as you're constructing your parse tree that GrammarTypes doesn't define a node type for every possible nonterminal and terminal in the grammar. This is because not all of them are really needed. For example, STATEMENT nodes are not needed, because we can separately instantiate each of the possible types of statements: BREAK, CALL, IF, etc. You should be able to implement SL quite handily with just the 15 types of nodes defined above.

One last note: consider a case where you have an arbitrarily long list of something, like an argument list or a parameter declaration list. Here's an example using a hypothetical grammar for a list of numbers:

LIST = ELEMENT $

number

parse_LIST() { while (lex.getToken() != '$') { // This is a bug! parse_ELEMENT(); } } parse_ELEMENT() { match_number(); }

Do you see why this fails? The problem is in calling lex.getToken() to test to see if the next token is the list termination character, '$'. If it is, we're happy, because we can break out of the while loop. But what if it isn't? That means that there's at least one more number in the list, and as a matter of fact, we just read it when we called lex.getToken(); because that gave me the next number in the list, it obviously didn't match '$'. But now we go to parse the number, except we've lost it, because it was read for testing and is no longer in the stream!

The problem here is that we cannot detect the end of an arbitrarily long list of elements unless we can "peek" ahead one token. You've already seen this happen in your lexer, where you had to peek ahead one character to distinguish between the ">" and ">=" operators. Now you must look ahead by a single token, instead of a single character; but the idea is still the same. This is why we asked you to implement the pushBack() method in the lexer.

Loose Ends

One issue of some concern is that of error detection. If you have a syntactically flawed program, you should output a message that includes the line number and actual line text, and as useful a diagnosis as possible. You should use exceptions wisely for this purpose (see below for more details on exceptions). Of course, a lexically flawed program should behave as before: the lexer should take care of lexical errors. You should halt execution after the first lexical or syntactic error. As always, you must not throw any exceptions at the main() level.

As with the previous project, you will need to write a front-end main class to be able to test drive your parser. Call it Proj3.java. It should read in an SL source code file from standard input, parse it, and print out the final parse tree using pre-order, depth-first traversal. Nonexistent or inscrutable print output will be penalized.

Your final tally of files to hand in:

proj3-grammar.txt
all the *.java files needed for the parser

These files should be in the same submission directory. Do not use subdirectories.

Don't forget that all your files must begin with a header that contains:

your names
your group ID
the project name (SL Parser)
your grader's name
the name of the file

Sample SL programs and parse trees.
Parse tree format.

Keep an eye on the class bboard for clarifications and updates.

END OF ASSIGNMENT