A lexical analyzer is part of a compiler. In fact, the first thing that a compiler does to a program (after stripping out comments and extra blank space) is to pass it through the lexical analyzer. The lexical analyzer breaks the program into "tokens" which include the following types of things:
[ In case anyone is curious, the output of the lexical analyzer is then fed into the next stage, which is called a syntactic analyzer. The job of the syntactic analyzer is to parse the context-free grammer that specifies the language. ]
We can specify, or define, each of these types of tokens using a "transition diagram" which is essentially a finite automaton. In fact, the only time I have ever coded up a finite automaton was in a lexical analyzer (sorry Jim -- I've never coded up the string matching algorithm).
Here are the transition diagrams for the four types of tokens listed above. We'll look at a subset of the tokens that appear in C.
The states will be numbers 0-18. State 0 is the start state.
Transition Table for Keywords:
State Input Next State Action
0 w 1
f 7
i 11
e 14
- other - - fail -
1 h 2
- other - - fail -
2-4 recognize i, l, e
5 blank or ( 6 - return 1, retract -
- other - - fail -
etc.
The rest of the table treats the identifers "for", "if", and "else" in a similar way.
Return means that we have found a keyword. We'll return 1 for "while", 2 for "for", 3 for "if", and 4 for "else". Retract means to back-up by one position the pointer to the next input character.
Fail means that the current token is not a keyword. In this case we move the pointer to the next input character all the way back to where it was when we entered state 0, and then we'll try a different transition diagram (for identifiers, constants, and operators).
State Input Next State Action
19 - letter - 20
- other - - fail -
20 - letter or 20
digit -
- not letter 21 - return (5,insert()), retract -
or digit -
Whenever we reach the end of an identifier, we insert it into a table
and return a pointer to the entry for that identifier.
State Input Next State Action
25 < 26
> 29
!
=
26 = 27 - return (6,2) -
- other - 28 - return (6,1), retract -
29 = 30 - return (6,4) -
- other - 31 - return (6,3), retract -
digit -
32 = 33 - return (6,5) -
- other - - fail -
34 = 35 - return (6,6) -
- other - - fail -
When we recognize a relational operator, we return two values, the
type (6), and which particular operator we've recognized, <, <=, >, >
=, !=, ==.
How would we implement these transition diagrams? One way is to write a little piece of code for each state. These pieces of code might be part of a large switch statement in C.
So far we've seen three basic types of states.
To implement this type of state, set up (in advance) an array that indicates where to go on each input.
Here we just check to see if the right character is there.
Here we just check if the character lies in the right range.
One last issue: How do we implement the retract and back-up operations? We need to buffer the input.