| This assignment is due before ??am on ?? |
| Follow the instruction on the course web page for submitting your solution. Submit one solution per group. All your source files should be in the same directory. Do not use subdirectories. To submit your solution, you should copy just your source files to the submission directory. In particular, do not carry out your program development in the submission area. |
| Points will be given for good documentation and style. See the coding style document on the course web page. For this assignment, there'll be a 60-40 split for correctness and style, respectively. |
| See the course web page on collaboration and late submission policies. Make sure your working directory is private and cannot be read by others! |
A compiler consists of three broad stages, which will roughly correspond to the three assignments:
if else while), delimiters (such as
+ - * / = ; { >= ), identifiers (such as x foo
nStudents), and integers (such as 8 11 0 178123222432).
Tokens form the terminal symbols of the context free grammar for the language.
KEYWORDS = (if|else|while|break|read|write|function|let|call|return)
IDENTIFIERS=([a-z] | [A-Z]) ([a-z] | [A-Z] | [0-9])*
- KEYWORDS
INTEGERS = [0-9][0-9]*
# + - * / % ~ < > != == >= <= =
logical: & | !
others: { } ( ) ; ,
BigInteger in java.math.)
Whitespace characters (tabs, newlines, spaces) may be present between, but not
within lexemes. That is, re turn (two identifiers) is different
from return (a keyword), and 123 456 is two integer
strings, not one. Also, one or more whitespaces are required between
any adjacent pair of keywords, identifiers, and integers. (Otherwise
return 23 would be indistinguishable from
return23.)
To allow for comments in SL code, the rest of the line including and following a # should be ignored by the lexer.
This brief description of SL doesn't describe the syntax (how to write valid programs) or semantics (what a program actually does) of SL, but only the lexicon (the valid lexeme strings) of SL. However, this is all you need to write a lexer for SL.
InputStream. The
lexical analyzer (or lexer) breaks up the input text stream into
lexemes and returns a token object for each one. Basically, the lexer
is a variant of an FSA: it is a finite state transducer. Its FSA
reads the input characters, and whenever it recognizes or accepts a
certain lexeme, it outputs or returns a corresponding token object. You can think of a token as a type or class of lexemes. SL has the following tokens:
The major part of this assignment is to construct the classes described below. (You may, of course, construct further ones if deemed necessary). Classes Token and Lexer will be needed in subsequent assignments.
Token getToken():
scans for and returns a token object for the next lexeme found in the input.
Also, this function should return an appropriate indication upon end of input
(such as null).
void putBack(Token t): puts the token t back into
the stream, so to speak. Notice that this backing-up capability is only one
token deep; in other words, if you do multiple putBack() calls, only
the most recent one matters. For example:
123a is illegal. Upon any such
error, the getToken() method should print an error message giving
the line number where the error occurred, and the offending lexeme or substring
itself. It should then immediately exit, ending program execution.
main method for this assignment. It is
just a wrapper for exercising the lexer. It will not be used in future
assignments. It is here that you should repeatedly call getToken and
generate the required output.
getToken, it should print one line containing two
strings, the lexeme and the token.
To illustrate, let's say the file sample.sl contains the following:
{
while (i < 0)))) {i = anIdent234 12
if i22!=0 return 4;
}
Then, after compiling your code into Proj2.class and
typing cat sample.sl | java Proj2, you should get the following
output:
{ {
while while
( (
i Ident
< <
0 Number
) )
) )
) )
) )
{ {
i Ident
= =
anIdent234 Ident
12 Number
if if
i22 Ident
!= !=
0 Number
return return
4 Number
; ;
} }
(We have deliberately chosen the above example input to emphasize that the
lexical analyzer is concerned merely with separating the tokens, not with
syntax checking.)
Some programs are not lexically correct. For example, since SL doesn't allow string literals, the following SL code is lexically incorrect:
{
while (i < 0)))) {i = anIdent234 12
if i22!=0 return "foobar";
}
Your lexical analyzer should, if it detects a
lexically illegal input, print out an error message. This should include the
source line number containing the error, and the erroneous text. It should
then halt execution immediately.
You may not use any of the tokenization tools supplied with the Java class library (such as StringTokenizer or StreamTokenizer) to break the input into tokens.
You should never, under any condition, throw an Exception from main.