Earlier, I posted a description of a Regular Expression -> DFA algorithm.
With some slight alterations, this algorithm has been modified somewhat and
clarified so that it now generates DFA's basically by directly applying the
subset construction on the regular expression itself.

The minimilization step is still kept separate.  It's possible to merge
this step with the other two, but in contrast to the above, there will not
be any significant gains as a result.

   You can get a copy of the software by anonymous ftp at csd4.csd.uwm.edu
in the directory pub/regex.  You WILL need an ANSI-C compiler to compile it.
The library functions in <stdlib.h> (particularily realloc()) are assumed to
have the semantics indicated by ANSI.  The program regex.c is the older
version, and regex2.c is the newer version.
 
THE ALGORITHM
   Regular Expressions are defined using the following syntax:
 
           (0) 0 ------- to denote the empty set
           (1) 1 ------- to denote the empty string
           (2) x ------- any valid C identifier
           (3) L ------- string literal
           (3) [A] ----- denotes the regular expression 1 | A.
           (4) A+ ------ denotes the regular expression A A*.
           (5) A* ------ ITERATION (the set 1 | A | A A | A A A | ...)
           (6) A | B --- ALTERATION (the union of A and B)
           (7) A B ----- CONCATENATION
 
For example, these are valid regular expressions:
 
            (a [b+ a*])+ | c* a b
 
            a* (b a*)*
 
   If E is a regular expression, then it will be reduced to the following normal
form:
                     E = 1 | E1 | E2 | ...
 
where each alterand, E1, E2, ... has the form x E', where x is a symbol.
The following reduction rules are applied to carry out this transformation:
 
           x         -> x 1
           [A]       -> 1 | A
           A+        -> A A*
           A*        -> 1 | A+
           A | B     -> merge(a, b) where A reduces to a, and B to b.
           0 A       -> 0
           1 A       -> A
           [A] B     -> B | A B
           A+ B      -> A (A* B)
           A* B      -> B | A+ B
           (A | B) C -> A C | B C
           (A B) C   -> A (B C)
 
The merge operation combines expressions starting with the same symbol,
e.g.:
 
        merge (a 0 | b 1, b 2 | c 3) = a 0 | b (1 | 2) | c 3.
 
The formal definition is fairly straightforward.
 
The expression, A, in each term of the form x A in the sum is considered a
state in the DFA.  Each newly generated state is reduced to normal form in the
same way, and this in turn may generate more new states.  When all the states
that were generated from the top-level expression have been fully reduced, the
result is an DFA representing the original expression.  This process will
always halt.

For example, take this expression
 
E = a* (b a*)*
 
for convenience, represent it as follows:
 
E = x0 x1, where x0 = a*, x1 = x2*, x2 = b x0
 
then E reduces as follows:
 
E = x0 x1 = a* x1 -> x1 | a x3, where x3 = a* x1 = x0 x1 = E
   x1 = x2* -> 1 | x2+
      x2+ -> x2 x0* = x2 x1 = (b x0) x1 -> b (x0 x1) = b E
   thus: x1 -> merge(1, b E) = 1 | b E.
thus: E -> merge(1 | b E, a E) = 1 | a E | b E
 
Therefore E reduces to 1 | a E | b E.
 
In any implementation of this algorithm, it is ABSOLUTELY essential to avoid
the formation of duplicate expressions.  Subexpressions are therefore hashed
and a table lookup precedes the formation of any expression.  Without this,
the example above would get into an infinite loop creating, for instance, an
infinite number of copies of E = (x0 x1).
 
The software I wrote takes these steps to derive an DFA from the regular
expresison E.  And it just so happens that for this example, E is
the only "state" in the DFA, so this is the minimal DFA.
 
Compare with the method described in Berry and Sethi ("From Regular
Expressions to Deterministic Automata" Theoretical Computer Science 1986).

After the conversion, standard techniques are used to minimlaize the DFA.
