Programming Assignment: SL Parser
15-212, Spring 1998
|
This assignment is due before ?? on ??. This assignment is
arguably the most intricate (but also the most
interesting) of the three SL assignments. Get started early!
|
|
Follow the instruction on the course web page for
submitting your solution. Submit one solution per
group. All your source files should be in the same
directory. Do not use subdirectories. To submit your
solution, you should copy just your source files to the
submission directory. In particular, do not carry out
your program development in the submission area.
|
|
Points will be given for good documentation and style.
See the coding style document on the course web page. For
this assignment, there'll be a 80-20 split for correctness
and style, respectively, and there will be another 40
points allocated for the written answers from part 1, for
a total of 140 points.
|
|
See the course web page on collaboration and late
submission policies. Make sure your working directory is
private and cannot be read by others!
|
Overview of the Assignment:
You already know that given a context-free grammar G and a string T, a parser
can do the following:
- Decide whether G can generate T (i.e. "T is in the grammar G").
- Produce a parse tree of T, which shows how to rewrite the start symbol S
in G to produce T.
Your job will be to construct a parser to do both of the above tasks. Your
program will take as its input a program in SL; if the program is syntactically
valid, your parser will produce a parse tree for it. We will grade your program
by feeding it certain SL programs--some valid, some not--and check that your
program outputs a valid parse for syntactically valid programs, and a
meaningful error message for invalid programs.
The Grammar for SL:
Let us first examine the context-free grammar (SIGMA, V, S, R) for SL:
SIGMA = { PROGRAM, FUNCTION, PARAMLIST, PARAMREST, BLOCK,
STATEMENT, BREAK, CALL, ARGLIST, ARGREST, IF, ELSE, LET, READ,
RETURN, WHILE, WRITE, EXPR, BINOP, UNOP, number,
identifier, function, break, call, if, else, let, read,
return, while, write, (, ), {, }, ,, ;, =, +, -, *, /, %, <,
>, <=, >=, ==, !=, &, |, ~, !}
V = { number, identifier, function, break, call, if,
else, let, read, return, while, write, (, ), {, }, ,, ;, =, +, -,
*, /, %, <, >, <=, >=, ==, !=, &, |, ~, !}
S = PROGRAM
R is given by:
- PROGRAM -> FUNCTION* BLOCK
- FUNCTION -> function identifier ( PARAMLIST ) BLOCK
- PARAMLIST -> identifier PARAMREST | e
- PARAMREST -> , identifier PARAMREST | e
- BLOCK -> { STATEMENT* }
- STATEMENT -> BREAK | CALL ; | IF | LET | READ
| RETURN | WHILE | WRITE
- BREAK -> break ;
- CALL -> call identifier ( ARGLIST )
- ARGLIST -> EXPR ARGREST | e
- ARGREST -> , EXPR ARGREST | e
- IF -> if EXPR BLOCK ELSE
- ELSE -> else BLOCK | e
- LET -> let identifier = EXPR ; | let
identifier = CALL ;
- READ -> read identifier ;
- RETURN -> return EXPR ;
- WHILE -> while EXPR BLOCK
- WRITE -> write EXPR ;
- EXPR -> number | identifier | ( EXPR ) |
( UNOP EXPR ) | ( BINOP EXPR EXPR )
- BINOP -> + | - | * | / | % | & | | | < | > |
<= | >= | == | !=
- UNOP -> ~ | !
As always, e is epsilon, the empty string. Also, note that an SL program
is a sequence of function declarations followed by the main body.
This description describes the syntax (how to write valid programs) of
SL, but not the semantics (what a syntactically valid program actually
does). We'll get to the semantics in the next assignment, but drawing on
your knowledge of Java, the semantics of SL should be fairly obvious just from
the grammar.
Your first task is to answer the following problems based on the
grammar above. Put your answers in a text file called
proj3-grammar.txt . Just one write-up per group is
needed. See the note at the very end of this document concerning
the header information you need at the top of this file.
- Write the grammatically correct SL expressions for the
following Java arithmetic expressions (watch your operator
precedence for some of these!), and also give their final
values. You may want to run it through Java to double-check.
Assume that a==1, b==2, c==3, d==4, e==0.
- 1+2+3+4+5
- 1*(2+3)-4/5
- -1+2+-3+4+-5
- a && (!b || c) || !d && e
- (a==1 && b==-2 || c!=-3) && !d!=!-e
- a>=1 || b<=c || d>3 || e%5<4
- Mr. Bar is confused about the difference between
Java grammar and SL grammar. Locate everything that is wrong
with his not-quite-SL program below, and for
each: 1) indicate which SL grammar rule(s) it disobeys; and 2)
rewrite it correctly in SL.
{
function A(a,b) {
if (a < b)
else
return 0;
}
main {
}
function B(a) {
n = - A(a,3) + 2 * A(a,4);
return n+6;
}
}
Extra credit (1 pt): What is the final output of this
program?
- (Note: This question is actually a parser question, not a
grammar question, so you may want to wait until you're writing
the parser to answer this one.)
- A) As we've noted before, a parser is really just a
push-down automaton (DPDA). Briefly demonstrate this
with regard to the top-down parser that you are now writing
by answering these questions. In the parser:
- What corresponds to the DPDA states?
- What corresponds to the push-down store?
- What corresponds to the input tape?
- What is in each square of the input tape?
- B) The one lexeme lookahead feature that we use in our
parser may strike you as being bogus, because in DPDAs, you
can't push an already read input back into the input stream.
Explain briefly why this is actually acceptable;
i.e., explain how you would simulate this lookahead feature
using a DPDA.
Setting Up the Parser
You will implement a top-down parser for this assignment. First, you must
create a Parser class, the main class for this assignment:
public class Parser {
public Parser(InputStream s) { /* to be filled in by you */ }
public void parse() { /* t.b.f.i.b.y. */ }
public void print() { /* t.b.f.i.b.y. */ }
private ParseNode tree;
/* add other stuff as you need them */
}
This is the primary class for this assignment. When you create a
Parser object, you pass it a stream that it needs to
parse. The parser should create for itself a lexer so that it can
chop up the stream into tokens. When parse() is called,
the parser actually performs the parsing of the stream, then holds
onto the root of the resulting parse tree in tree.
print() simply
prints out the parse tree in a readable text format. The tree should
be printed out using pre-order, depth-first traversal. (The
criterion for readability is that someone who is familiar with this
project -- e.g., a TA -- can stare at the output and deduce what your
printout means without having to ask you for an explanation.) This
method will also, without a doubt, be your friend in debugging.
So far so good. The only tricky part seems to be parse().
Before we get into that, however, we need to look at this
mysterious class called ParseNode. Notice that
tree is of that type, so perhaps you have already guessed
that ParseNode is what the parse tree is made of; they are
in fact the nodes of the parse tree, each containing pointers to
children nodes. Let's look at it a bit more closely:
public abstract class ParseNode implements GrammarTypes {
public ParseNode() {...}
public int numChildren() {...}
public ParseNode getChild(int i) {...}
public void addChild(ParseNode node) {...}
public abstract int getType();
public abstract void print(PrintStream ps);
/* define other stuff as you need them */
}
public interface GrammarTypes {
public final static int G_NOP = 0;
public final static int G_NUMBER = 1;
public final static int G_IDENT = 2;
public final static int G_PROGRAM = 3;
public final static int G_FUNCTION = 4;
public final static int G_BLOCK = 5;
public final static int G_BREAK = 6;
public final static int G_CALL = 7;
public final static int G_IF = 8;
public final static int G_LET = 9;
public final static int G_READ = 10;
public final static int G_RETURN = 11;
public final static int G_WHILE = 12;
public final static int G_WRITE = 13;
public final static int G_EXPR = 14;
}
ParseNode is an abstract class, so you can't instantiate it. What you
will do is create many subclasses of ParseNode (more details below).
Each node maintains a list of pointers to its children. numChildren()
returns the number of children that this node has. getChild(i)
returns the ith child of this node. addChild(node)
appends node to the list of children of this node. getType()
returns the type of the node, as defined by one of the constants in
GrammarTypes (again, see below for details). print() prints
out the contents of the node in a readable format; this will be useful for
debugging.
You'll need some data structure to store an ordered list of children. Once you
have this, you can implement the necessary subclasses of ParseNode. There is a
subclass of each node type defined in GrammarTypes, for a total of 15
subclasses (each in a separate file, of course):
- public class SL_NopNode extends ParseNode {...}
- public class SL_NumberNode extends ParseNode {...}
- public class SL_IdentNode extends ParseNode {...}
- public class SL_ProgramNode extends ParseNode {...}
- public class SL_FunctionNode extends ParseNode {...}
- public class SL_BlockNode extends ParseNode {...}
- public class SL_BreakNode extends ParseNode {...}
- public class SL_CallNode extends ParseNode {...}
- public class SL_IfNode extends ParseNode {...}
- public class SL_LetNode extends ParseNode {...}
- public class SL_ReadNode extends ParseNode {...}
- public class SL_ReturnNode extends ParseNode {...}
- public class SL_WhileNode extends ParseNode {...}
- public class SL_WriteNode extends ParseNode {...}
- public class SL_ExprNode extends ParseNode {...}
The getType() method for each of these will be simple. For
example:
public class SL_NopNode extends ParseNode {
public int getType() { return G_NOP; }
}
So writing these classes will take a lot of cutting and pasting,
but it'll be pretty simple. Here are a few twists, though, to keep
things interesting:
- To SL_NumberNode, add a constructor that allows
you to set what number or value this node represents. Also add
a getValue() method which returns the number.
(Remember that the value is an arbitrary precision
integer, which can be implemented using the
BigInteger
class in java.math.)
- To SL_IdentNode, add a constructor that allows
you to initialize the node with a string name. Also add a
String getName() method.
- To SL_FunctionNode, add several methods. int
numArgs() should return the number of arguments this
function accepts. String getArgName(i) should return
the name of the ith parameter. ParseNode
getBody() should return the body block of the function.
- To SL_ExprNode, add a constructor that allows you
to initialize the node with any operator type (both binary and
unary allowed). Also add an int getOperator()
method.
Now there's one last issue to take care of: what do we do with children? For
each of the 15 subclasses, you'll need to come up with a layout of what its
children are. For example, the children of SL_WhileNode are (in
order): SL_ExprNode, followed by SL_BlockNode. Why? Because
a while statement is made up of two things in this order: first, a
conditional expression, followed by a body to be executed. The first child of
SL_WhileNode contains a conditional expression, while the second child
of SL_WhileNode contains a block of statements.
Another example is SL_FunctionNode, which has 1+n+1 children. The
first is of type SL_IdentNode and represents the name of the function;
next are n SL_IdentNodes, each naming the parameters for the function,
in order; finally, the last child is of type SL_BlockNode, and it
represents the body of the function. You should work out such schematics for
the 13 remaining node types.
From Grammar to Code
And now, we get to the heart of the project, the Parse.parse() method.
To construct the parser, you should basically write a bunch of mutually
recursive routines that look like the grammar. Say, just as an example, that
you're interested in parsing a for statement with this grammar (which
is not legal SL):
FOR -> for IDENT = EXPR to EXPR do BLOCK
Then you'd write something like:
parse_FOR() {
match_for(); // match the keyword "for"
parse_IDENT(); // match an identifier
match_assn(); // match the assignment symbol
parse_EXPR(); // match an expression
match_to(); // match the keyword "to"
parse_EXPR(); // match an expression
match_do(); // match the keyword "do"
parse_BLOCK(); // match a block
}
You will have a bunch of private methods that look like
the above. Some will match terminal tokens, which you can check by
calling getToken() on your lexer and doing a comparison;
others will match nonterminals, which translate into more recursive
calls on the appropriate functions.
As written above, this parser won't be very useful. Why? Because
it doesn't construct the parse tree as it
goes along. So at the end of the parsing, the parser
will be able to tell you whether your program is syntactically
well-formed, but it won't be able to tell you which grammar rules
it used to parse it. This is exactly like the CYK algorithm: it's a
language recognizer, and nothing more. In this case,
you'll want to build up a tree-like data structure (rooted at the Parser's
tree object) as you parse the input.
You'll notice as you're constructing your parse tree that
GrammarTypes doesn't define a node type for every possible
nonterminal and terminal in the grammar. This is because not all
of them are really needed. For example, STATEMENT nodes are not
needed, because we can separately instantiate each of the possible
types of statements: BREAK, CALL, IF, etc. You should be able to
implement SL quite handily with just the 15 types of nodes
defined above.
One last note: consider a case where you have an arbitrarily long
list of something, like an argument list or a parameter declaration
list. Here's an example using a hypothetical grammar for a list of
numbers:
ELEMENT = number ELEMENT | e
parse_LIST() {
while (lex.getToken() != '$') { // This is a bug!
}
}
parse_ELEMENT() {
}
Do you see why this fails? The problem is in calling
lex.getToken() to test to see if the next token is the
list termination character, '$'. If it is, we're happy, because we
can break out of the while loop. But what if it isn't? That means
that there's at least one more number in the list, and as a matter
of fact, we just read it when we called lex.getToken();
because that gave me the next number in the list, it obviously
didn't match '$'. But now we go to parse the number, except we've
lost it, because it was read for testing and is no longer in the
stream!
The problem here is that we cannot detect the end of an arbitrarily
long list of elements unless we can "peek" ahead one token. You've
already seen this happen in your lexer, where you had to peek ahead
one character to distinguish between the ">" and ">="
operators. Now you must look ahead by a single token, instead of a
single character; but the idea is still the same. This is why we asked you
to implement the pushBack() method in the lexer.
Loose Ends
One issue of some concern is that of error detection. If you have a
syntactically flawed program, you should output a message that includes the
line number and actual line text, and as useful a diagnosis as possible. You
should use exceptions wisely for this purpose (see below for more details on
exceptions). Of course, a lexically flawed program should behave as before: the
lexer should take care of lexical errors. You should halt execution after the
first lexical or syntactic error. As always, you must not throw any
exceptions at the main() level.
As with the previous project, you will need to write a front-end main class to
be able to test drive your parser. Call it Proj3.java. It
should read in an SL source code file from standard input, parse it, and print
out the final parse tree using pre-order, depth-first
traversal. Nonexistent or inscrutable print output will be penalized.
Your final tally of files to hand in:
- proj3-grammar.txt
- all the *.java files needed for the parser
These files should be in the same submission directory. Do not use
subdirectories.
Don't forget that all your files must begin with a header that contains:
- your names
- your group ID
- the project name (SL Parser)
- your grader's name
- the name of the file
Sample SL programs and parse trees.
Parse tree format.
Keep an eye on the class bboard for clarifications and
updates.
END OF ASSIGNMENT