Newsgroups: comp.lang.scheme.scsh,comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!newstand.syr.edu!news.maxwell.syr.edu!news.bbnplanet.com!cpk-news-hub1.bbnplanet.com!worldnet.att.net!cbgw2.lucent.com!nntphub.cb.lucent.com!alice!allegra!akalice!baldy.research.att.com!user
From: pereira@research.att.com (Fernando Pereira)
Subject: Re: Regexp notation
X-Nntp-Posting-Host: baldy.research.att.com
Message-ID: <pereira-1201971318410001@baldy.research.att.com>
Sender: news@research.att.com (netnews <9149-80593> 0112740)
Organization: AT&T Research
References: <qijg20caw6l.fsf@lambda.ai.mit.edu> <qijenfubntr.fsf@lambda.ai.mit.edu>
Date: Sun, 12 Jan 1997 17:18:41 GMT
Lines: 81
Xref: glinda.oz.cs.cmu.edu comp.lang.scheme.scsh:370 comp.lang.scheme:17906

In article <qijenfubntr.fsf@lambda.ai.mit.edu>, shivers@ai.mit.edu wrote:
>     From: Alan@lcs.mit.EDU (Alan Bawden)
>     Subject: Regexp notation
>     Newsgroups: comp.lang.scheme.scsh
>     Date: 8 Jan 1997 19:51:00 -0500
> 
>             `(let* ((any (* "dog"))) ,(computed-regexp-goes-here))
> 
>     Looking at examples like this make me realize that what I really
want isn't
>     an S-expression notation for regular languages.  What I really want is a
>     toolkit of procedures that operate on regular languages.  E.g.
> 
>       (let ((digit (re:range #\0 #\9)))
>         (re:concatenate (re:or "+" "-" "")
>                         (re:one-or-more digit)))
> 
> I considered this, as well. But I rejected it because I don't think you
> do want the ability to operate on regular languages. For example, regular
> languages are closed under intersection and set-complement. But when was
> the last time you thought to yourself "I'd like to match all the strings that
> *aren't* matched by regexp R," or "I'd like to match all the strings that
> match pattern R1 *and* pattern R2"?

All the time. The Boolean closure of regular sets allows the concise
expression of very complex patterns. This is especially useful when
combined with the calculus of "rational transductions," which are to
finite-state transducers as regular sets are to finite-state acceptors.
For example, let (re:id reg) be identity transduction restricted to the
regular set reg, (re:prod in out) the transduction that maps any string in
regular set in to any string in regular set out, pat a regular set, left a
suitable left bracket and right a suitable right bracket, sigma the input
alphabet and eps the empty string. Then

   (let* ((sigmastar (re:zero-or-more sigma))
          (notpat (re:id (re:diff sigmastar (re:concat sigmastar pat
sigmastar))))
          (match (re:concat (re:prod eps left) (re:id pat) (re:prod eps
right))))
     (re:concat
        (re:zero-or-more 
           (re:concat notpat match))
        notpat))

is a transduction that brackets every occurrence of pat in the input with
left and right. The idea is that notpat matches any string not containing
pat and transduces it to itself while match transduces a string in pat to
that string preceded by left and followed by right. Then the body of the
let* parses any string as n1 p1 n2 ... pk-1 nk where each pi is an
occurrence of pat and each ni does not contain an occurrence of pat, and
transduces it to n1 left p1 right n2 ... left pk-1 right nk. This is just
a simple example, many more complex pattern matching and rewriting tasks
can be done this way.

> A further, serious difficulty is the
> issue of how you would translate such a spec into a classic regexp string
> to interface to traditional matching engines. Also, how would you do
> sub-expression matching with complement and intersection in the mix?

Indeed. That's why Michael Riley, Mehryar Mohri and I have developed at
AT&T Labs a
general-purpose weighted rational transduction package that has been
successfully used in speech recognition, text processing and OCR. The
package is written in C but we also created a Scheme binding for it that
allows us trivially do stuff like the example I gave above. Because of its
generality and orientation (its main application is speech
processing) the package is not as efficient for character string
processing as more restricted re-matchers, but it is practical for many
applications, including some of the largest speech recognizers ever built,
which involve finite-automata with more than 10^7 transitions. A crucial
implementation feature is that all operations that can be so are lazy, so
that intermediate results do not need to be fully constructed unless
required. In the example above, none of the operations would actually look
inside their arguments until the result transduction is applied to some
input, and then only the relevant paths in the underlying transducers and
acceptors would be expanded.

We don't have a full document on the package and we haven't released it
yet to the outside, but for a preview check out
<http://xxx.lanl.gov/ps/cmp-lg/9603001> and
<http://xxx.lanl.gov/ps/cmp-lg/9608018>.
