24 Apr 1996

Outline

How to build a lexical analyzer

Lexical Analyzers and Transition Diagrams

A lexical analyzer is part of a compiler. In fact, the first thing that a compiler does to a program (after stripping out comments and extra blank space) is to pass it through the lexical analyzer. The lexical analyzer breaks the program into "tokens" which include the following types of things:

keywords
identifiers
constants
relational operators

Others are: arithmetic operators, logical operators, punctuation (colon, semicolon), parenthesis, brackets, squiggly brackets, etc.

[ In case anyone is curious, the output of the lexical analyzer is then fed into the next stage, which is called a syntactic analyzer. The job of the syntactic analyzer is to parse the context-free grammer that specifies the language. ]

We can specify, or define, each of these types of tokens using a "transition diagram" which is essentially a finite automaton. In fact, the only time I have ever coded up a finite automaton was in a lexical analyzer (sorry Jim -- I've never coded up the string matching algorithm).

Here are the transition diagrams for the four types of tokens listed above. We'll look at a subset of the tokens that appear in C.

keywords

I want you to draw a picture of a transition diagram that will recognize the following keywords: while, for, if, else. Base your diagram on the table given below.

The states will be numbers 0-18. State 0 is the start state.

Transition Table for Keywords:

State   Input    Next State           Action
  0       w           1
          f           7
          i          11
          e          14
      - other -   - fail -

  1       h           2
      - other -   - fail -

 2-4 recognize i, l, e

  5   blank or (      6        - return 1, retract -
      - other -   - fail -
 
etc.

The rest of the table treats the identifers "for", "if", and "else" in a similar way.

Return means that we have found a keyword. We'll return 1 for "while", 2 for "for", 3 for "if", and 4 for "else". Retract means to back-up by one position the pointer to the next input character.

Fail means that the current token is not a keyword. In this case we move the pointer to the next input character all the way back to where it was when we entered state 0, and then we'll try a different transition diagram (for identifiers, constants, and operators).

identifiers

State   Input    Next State           Action
 19   - letter -     20
      - other -   - fail -

 20   - letter or    20           
        digit -

      - not letter   21        - return (5,insert()), retract -
        or digit -

Whenever we reach the end of an identifier, we insert it into a table and return a pointer to the entry for that identifier.

constants

The table for constants is similar (a constant is a sequence of digits). It uses states 22, 23, 24.

relational operators

State   Input    Next State           Action
 25       <         26
          >         29
          !
          =
 26       =         27         - return (6,2) -
      - other -     28         - return (6,1), retract -
 29       =         30         - return (6,4) -
      - other -     31         - return (6,3), retract -
        digit -
 32       =         33         - return (6,5) -
      - other -  - fail -
 34       =         35         - return (6,6) -
      - other -  - fail -

When we recognize a relational operator, we return two values, the type (6), and which particular operator we've recognized, <, <=, >, > =, !=, ==.

Implementation Issues

How would we implement these transition diagrams? One way is to write a little piece of code for each state. These pieces of code might be part of a large switch statement in C.

So far we've seen three basic types of states.

state 0 -- many transitions out to different states. (There would be more if we had included all of the keywords.)

To implement this type of state, set up (in advance) an array that indicates where to go on each input.
state 1 -- transition on one character only.

Here we just check to see if the right character is there.
state 19 -- test for a letter or a digit

Here we just check if the character lies in the right range.

One last issue: How do we implement the retract and back-up operations? We need to buffer the input.