Programming Assignment: SL Lexer

15-212, Spring 1998

This assignment is due before ??am on ??

Follow the instruction on the course web page for submitting your solution. Submit one solution per group. All your source files should be in the same directory. Do not use subdirectories. To submit your solution, you should copy just your source files to the submission directory. In particular, do not carry out your program development in the submission area.

Points will be given for good documentation and style. See the coding style document on the course web page. For this assignment, there'll be a 60-40 split for correctness and style, respectively.

See the course web page on collaboration and late submission policies. Make sure your working directory is private and cannot be read by others!

Prelude to the next few assignments:

This assignment is the first in a series of three on building a compiler for a small language we will call SL. As you tackle these projects, you will get to put the theory you have learned into practice. The projects will build on one another, and thus it will be to your advantage to write clean, modular, and well documented code.

A compiler consists of three broad stages, which will roughly correspond to the three assignments:

A lexical analyzer that breaks up the raw character stream in a source program into a stream of tokens, which include keywords (such as if else while), delimiters (such as + - * / = ; { >= ), identifiers (such as x foo nStudents), and integers (such as 8 11 0 178123222432). Tokens form the terminal symbols of the context free grammar for the language.
A parser that takes the stream of tokens generated by the lexical analyzer and constructs a parse tree for it, according to the CFG for the programming language.
A code generator that turns the parse tree into compiled code that can be executed or interpreted.

The three stages can be related to FSMs, PDAs, and Turing machines. You will learn more about these relationships in the forthcoming weeks. In this assignment you will build the lexical analyzer. By the end of the semester, you'll have built an SL interpreter you can call your own.

The language SL

SL is a procedural language, a simplified version of C. SL supports only one type, namely integers. The basic lexical units or lexemes in SL are given by the following set of regular expressions.

KEYWORDS = (if|else|while|break|read|write|function|let|call|return)
IDENTIFIERS=([a-z] | [A-Z]) ([a-z] | [A-Z] | [0-9])^* - KEYWORDS
INTEGERS = [0-9][0-9]^*
Also (dropping the regular expression notation for clarity), SL contains various operator or delimiter symbols:
comment: #
numeric: + - * / % ~
relation: < > != == >= <=
assignment: =
logical: & | !
others: { } ( ) ; ,

Note that according to the above definition, both identifiers and integers can be arbitrarily long strings. That is, you may not assume a bound on the length of variable names or integer values. (Hint: take a look at the class BigInteger in java.math.)

Whitespace characters (tabs, newlines, spaces) may be present between, but not within lexemes. That is, re turn (two identifiers) is different from return (a keyword), and 123 456 is two integer strings, not one. Also, one or more whitespaces are required between any adjacent pair of keywords, identifiers, and integers. (Otherwise return 23 would be indistinguishable from return23.)

To allow for comments in SL code, the rest of the line including and following a # should be ignored by the lexer.

This brief description of SL doesn't describe the syntax (how to write valid programs) or semantics (what a program actually does) of SL, but only the lexicon (the valid lexeme strings) of SL. However, this is all you need to write a lexer for SL.

Overview of the Assignment

The lexical analyzer will take input from an InputStream. The lexical analyzer (or lexer) breaks up the input text stream into lexemes and returns a token object for each one. Basically, the lexer is a variant of an FSA: it is a finite state transducer. Its FSA reads the input characters, and whenever it recognizes or accepts a certain lexeme, it outputs or returns a corresponding token object.

You can think of a token as a type or class of lexemes. SL has the following tokens:

A distinct token corresponding to each keyword string in SL.
The token Ident, representing the entire set of identifier strings.
The token Number, representing the entire set of integer strings.
A distinct token for each operator or delimiter symbol string in SL.

Tokens have associated attributes. For example, the Ident token has an associated identifier string. The Number token has an associated numeric value.

The major part of this assignment is to construct the classes described below. (You may, of course, construct further ones if deemed necessary). Classes Token and Lexer will be needed in subsequent assignments.

Class Token

This is the main class for all tokens. Different kinds of tokens may, as stated above, have different data associated with them, and you need to set up a class structure to handle this appropriately.

Class Lexer

The lexical analyzer. Among its members it should include the following public methods:

Token getToken(): scans for and returns a token object for the next lexeme found in the input. Also, this function should return an appropriate indication upon end of input (such as null).
void putBack(Token t): puts the token t back into the stream, so to speak. Notice that this backing-up capability is only one token deep; in other words, if you do multiple putBack() calls, only the most recent one matters. For example:

The input stream might contain data that cannot be recognized as any lexeme. For example, a quote character is illegal input (there is no quote symbol in SL). Likewise, the "lexeme" 123a is illegal. Upon any such error, the getToken() method should print an error message giving the line number where the error occurred, and the offending lexeme or substring itself. It should then immediately exit, ending program execution.

Class Proj2

This class contains the main method for this assignment. It is just a wrapper for exercising the lexer. It will not be used in future assignments. It is here that you should repeatedly call getToken and generate the required output.

Output From Your Program

Your code should print the following information. For each token returned when it invokes getToken, it should print one line containing two strings, the lexeme and the token.

To illustrate, let's say the file sample.sl contains the following:

	{
	    while (i < 0)))) {i = anIdent234 12
	    if i22!=0 return 4;
        }

Then, after compiling your code into Proj2.class and typing cat sample.sl | java Proj2, you should get the following output:

	{		{
	while		while
	(		(
	i		Ident
	<		<
	0		Number
	)		)
	)		)
	)		)
	)		)
	{		{
	i		Ident
	=		=
	anIdent234	Ident
	12		Number
	if		if
	i22		Ident
	!=		!=
	0		Number
	return		return
	4		Number
	;		;
	}		}

(We have deliberately chosen the above example input to emphasize that the lexical analyzer is concerned merely with separating the tokens, not with syntax checking.)

Some programs are not lexically correct. For example, since SL doesn't allow string literals, the following SL code is lexically incorrect:

	{
	    while (i < 0)))) {i = anIdent234 12
	    if i22!=0 return "foobar";
        }

Your lexical analyzer should, if it detects a lexically illegal input, print out an error message. This should include the source line number containing the error, and the erroneous text. It should then halt execution immediately.

Ground rules

What you may not do is store the entire input before processing it. Why impose this restriction? Let's imagine you have a computer with 16MB of physical memory and a 100MB disk drive. There's no reason whatsoever that you shouldn't be able to run a 80MB SL program, say, through your lexer -- after all, the lexer only needs to look ahead a small, fixed number of characters, so certainly any size input should be ok. But by storing the entire input in memory before processing, the above approach can not handle inputs larger than the amount of physical memory in the computer. What you're doing by storing the entire input in the lexer is imposing an unnecessary, artificial limitation on the user. (In the parser and interpreter, however, the entire input will have to be stored in memory.)

You may not use any of the tokenization tools supplied with the Java class library (such as StringTokenizer or StreamTokenizer) to break the input into tokens.

You should never, under any condition, throw an Exception from main.

Keep an eye on the class bboard for clarifications and updates.

END OF ASSIGNMENT