Newsgroups: comp.lang.scheme.scsh,comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!news.acsu.buffalo.edu!news.uoregon.edu!news.mathworks.com!howland.erols.net!worldnet.att.net!cbgw2.lucent.com!nntphub.cb.lucent.com!alice!allegra!akalice!baldy.research.att.com!user
From: pereira@research.att.com (Fernando Pereira)
Subject: Re: Regexp notation
X-Nntp-Posting-Host: baldy.research.att.com
Message-ID: <pereira-1201972217510001@baldy.research.att.com>
Sender: news@research.att.com (netnews <9149-80593> 0112740)
Organization: AT&T Research
References: <qijg20caw6l.fsf@lambda.ai.mit.edu> <qijenfubntr.fsf@lambda.ai.mit.edu> <qijd8vebe9x.fsf@lambda.ai.mit.edu> <5b4kgn$b8h@news.jf.intel.com>
Date: Mon, 13 Jan 1997 02:17:51 GMT
Lines: 41
Xref: glinda.oz.cs.cmu.edu comp.lang.scheme.scsh:371 comp.lang.scheme:17921

In article <5b4kgn$b8h@news.jf.intel.com>, haertel@ichips.intel.com (Mike
Haertel) wrote:

> In article <qijd8vebe9x.fsf@lambda.ai.mit.edu>,
> Olin Shivers <shivers@ai.mit.edu> wrote:
> >OK, I'm convinced it's handy. But it's also tricky. I do not think it is
> >trivial to do complement or intersection on RE's using the standard
> >notation. Systems that work by translating to classical string notation
can no
> >longer do a simple translation. The only way that I can think of to do
this is
> >to (1) compute the NDFA for the regexp(s), (2) do the complement or
> >intersection op on the NDFA, (3) run the state-minimisation algorithm on the
> >result (because intersection make a big NDFA), (4) run the NDFA -> regexp
> >algorithm. This is a mess! Very slow, too. Could you preserve submatches info
> >across all these operations? Anybody volunteering to write the code?
> 
> I strongly recommend that you abandon any attempt to do intersection
> or complement of regular languages.  The reason is that intersection
> can shorten the length of a regexp by an exponential amount, or
> equivalently, grow the automaton by an expontential amount.
You are wrong about intersection. The number of transitions of the
intersection of two automata is at most the product of the numbers of
transitions of the automata, and the intersection of two DFAs is a DFA. As
for complementation, the problem is not in complementation itself, but
rather in the fact that it needs a DFA, and the subset construction for
determinization can cause an exponential blowup. However, there are many
very useful cases in which a combination of specialized algorithms and
representations can avoid such blowups. A particularly useful case in
pattern matching are automata for Sigma* A and their complements (A is
some DFA), which can be represented compactly with failure functions (an
extension of the techniques in the Aho-Corasik pattern matching
algorithm). My experience building various text and speech-processing
applications using full regular algebra is that the worst case can often
be avoided, and that people often give up in advance intimidated by the
worst case before they examine the problem closely enough to know whether
it could be solved with the best current techniques, which are much more
sophisticated than those in standard textbooks (even converting a regexp
to a DFA can be done *much* better than what one can find in any textbook
I know, and techniques such as failure functions or lazy automata
algorithms are much less well-known than they deserve to be).
