Newsgroups: comp.lang.prolog
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!newsxfer.itd.umich.edu!news.mathworks.com!tank.news.pipex.net!pipex!howland.reston.ans.net!EU.net!sun4nl!freya.let.rug.nl!vannoord
From: vannoord@let.rug.nl (Gertjan van Noord)
Subject: Announcing FSA Utilities 1.00
Sender: news@let.rug.nl (News system at let.rug.nl)
Message-ID: <1995Oct4.154110.27114@let.rug.nl>
Date: Wed, 4 Oct 1995 15:41:10 GMT
Nntp-Posting-Host: saga.let.rug.nl
Organization: Faculteit der Letteren, Rijksuniversiteit Groningen, NL
Lines: 604


                        ANNOUNCING FSA UTILITIES 1.00

   this announcement is also available as:

       http://www.let.rug.nl/~vannoord/fsa/

   A few months ago there was some discussion whether or not finite state
   automaton operations (intersection, determinization, minimalization,
   intersection..) could be efficiently implemented in Prolog.

   I implemented a few of these things to see what can be done (and
   mostly as an exercise for myself). The package is implemented in
   SICStus Prolog, but it should not be too difficult to adapt it to
   other Prologs. Upon installation of the package a Prolog saved state
   is built that can be used either as an interactive shell (as usual) or
   as a Unix filter. You could also view the package as a kind of library.

   In trying to obtain (some) efficiency I implemented a kind of a
   hash-table in Prolog, cf. library/hash.pl. Comments very welcome
   (especially of the kind how this could be improved).

   The Man page of the utilities is given below. 

   The package is available by anonymous ftp from ftp.let.rug.nl in 
   directory pub/prolog-app/FSA. The following functionality is implemented:
     * Compilation of a FSA or FST into an efficient Prolog program. If
       the input FSA/FST is deterministic, the resulting Prolog program
       also is.
     * Complement. Computes a FSA for the complement of the language
       defined by an input FSA.
     * Concat. Computes a FSA for the concatenation of the languages
       defined by the input FSA's.
     * Determinization. A given FSA is determinized (using subset
       construction). There is a limited functionality to determinize
       finite state transducers as well; this procedure is not guaranteed
       to terminate though. Note that in general FST cannot be
       determinized.
     * Intersection of two FSA.
     * Kleene closure of a given FSA.
     * Composition of two given FST.
     * Composition of a FSA and a FST.
     * Check a string for acceptance for a given FSA; produce the
       transduced string for a given FST.
     * Minimalization of a FSA
     * Produce all strings accepted by a FSA (if desired in increasing
       length); produce all pairs of strings accepted by a FST.
     * Limited capabilities to attach probabilities to a FSA, such as
       empty edge removal.
     * Limited capabilities to translate regular expressions into FSA /
       FST.
     * NEW! NEW! NEW! The program is able to produce a representation of
       a finite state automaton compatible with the daVinci 1.4 graph
       visualisation program. This program automatically computes the
       most optimal way to view the finite-state automaton by minimizing
       the number of crossing edges. Postscript output can easily be
       generated from the result. Kaplan and Kay's example is shown by
       daVinci 1.4 as follows.
       
       [IMAGE REMOVED]
     * The package contains an interface to a TK Widget to pretty-view
       finite state automata. This is Figure 4 from Kaplan and Kay's
       famous paper in Computational Linguistics after composition by the
       FSA program:
       
       [IMAGE REMOVED]
       
       Other examples can be found here... and here. Note that this
       extension uses ProTcl. If you don't have ProTcl, you can still use
       the package without the possibility of browsing finite state
       automata.
     * The same method for producing a TK Widget is now used to produce
       LaTeX (picture) output. The latter two capabilities (Tk and Latex)
       were implemented before I became aware of the possibilities of
       daVinci (and in fact of the field of `graph drawing'), so don't
       expect much in terms of automatic visualization.
       



The MAN page:   
   
 FSA(Utilities)                                               FSA(Utilities)




 NAME
      fsa - various utilities to manipulate finite state automata

 SYNOPSIS
      fsa

      fsa [Operation] [-v[erbose]]

      NB for older Sicstus versions options have to be introduced with +
      instead of -.

 DESCRIPTION
      The fsa program can be used to manipulate fsa automata. The Operation
      option indicates which manipulation (determinization, intersection,
      etc.) is chosen. If no options are given, the program acts as an
      interpreter, using the SICStus Prolog top-level. The -v option implies
      that some debugging information is written to standard error.

      FSA can produce representations of finite state automata that can be
      used with the graph visualisation program daVinci 1.4 (cf. the
      -daVinci option below).

 OPTIONS
      fsa accepts the following mutually exclusive set of options:

      -a[ccepts] Words
              A finite state automaton is read from standard input (either
              in compiled or interpreted mode, cf. below). The program
              determines whether Words is accepted by this automaton or not.
              If it is, the program returns succesfully, otherwise it exits
              with 1.

      -c[ompile] -ct
              Compile a fsa automaton (-ct for finite state transducer) into
              a set of Prolog clauses. Standard input consists of the fsa
              automaton to be compiled. The compiled clauses are written to
              standard output. If the input is a deterministic FSA then the
              compiled clauses can be used by Prolog deterministically for
              recognition (clauses are indexed such that the Prolog
              interpreter `sees' the determinism).

      -complement
              A fsa representing the complement of the language defined by
              the deterministic fsa read from standard input is written to
              standard output.

      -c[oncat] File1 File2
              A fsa representing the  concatenation of the languages defined
              by the fsa in File1 and File2 is written to standard output.





                                    - 1 -        Formatted:  October 2, 1995






 FSA(Utilities)                                               FSA(Utilities)




      -d[eterminize] -td
              Determinize a finite state automaton (d) or a finite state
              transducer (td). Standard input consists of the fsa automaton
              to be compiled. The determinized FSA is written to standard
              output.

      -daVinci, -davinci
              A term-representation compatible with the requirements of the
              daVinci 1.4 graph-visualisation program is written to standard
              output, on the basis of the finite-state automaton read in
              from standard input.

      -da     To the deterministic finite state automaton (read from
              standard input) transitions are added such that there will be
              a transition for each symbol and each state (the resulting
              transition function will be total).

      -dr     From the finite state automaton (read from standard input) all
              transitions are removed that cannot possibly be continued to a
              final state.

      -g[enerate]
              An arbitrary finite state automaton is written to standard
              output.

      -i[ntersect] File File
              The two files are assumed to contain finite state automata.
              The (determinized) intersection is written to standard output.

      -k[leene]
              A fsa representing the Kleene closure of the language defined
              by the fsa read from standard input is written to standard
              output.

      -compose File File
              The two files are assumed to contain finite state transducers.
              The composition is written to standard output.

      -compose_fsa File File
              The first file contains a deterministic finite state
              automaton. The second file contains a finite state transducer.
              The composition is written to standard output (not
              determinized).

      -compose_string Words
              A finite-state transducer is read from standard input and
              composed with the (FSA representation of) the string Words.
              The composition is written to standard output (not
              determinized).





                                    - 2 -        Formatted:  October 2, 1995






 FSA(Utilities)                                               FSA(Utilities)




      -m[inimalize]
              The (deterministic) finite state automaton (read from standard
              input) is minimalized.  The result is written to standard
              output. It is now not neccessary anymore to use the -da option
              first.

      -n,-rename
              A finite state automaton is read from standard input. An
              equivalent one is written to standard output, but all state
              names are of the form qi, where i is an integer. This is
              useful to simplify complicated state names that result from
              various other manipulations.

      -p[roduce] -tp[roduce] [-o[rder]]
              A finite state automaton is read from standard input. All
              possible strings accepted by this FSA are written to standard
              output. The input need not be deterministic. The -tp option is
              for transducers, in which case pairs of sentences are written
              to standard output. If the -o option is present then strings
              are produced in increasing length.

      -pj     A probabilistic finite state automaton is read from standard
              input. An equivalent one is written to standard output in
              which all jumps have been removed. Jumps are written as
              ordinary transitions with symbol $E. Note that cyclic jumps
              are not allowed. Final states have an associated probability
              too (chance of halting here) in a second argument. Transitions
              have a fourth argument indicating probability. Probabilities
              can be given either in negative logs, or in probabilities
              proper. By default the system assumes negative logs. Use the
              directives

              :- user:flag(scores,_,log)

              or

              :- user:flag(scores,_,prob)

              in your input file to indicate which kind of numbers are being
              used.

      -r [RegExp]
              A regular expression (either RegExp, or read from standard
              input) is translated into an equivalent FSA. This FSA is
              written to standard output.

      -cr     A regular expression (read from standard input) is translated
              into a compiled regular expression (of the sort expected by
              the -r switch). Preliminary.





                                    - 3 -        Formatted:  October 2, 1995






 FSA(Utilities)                                               FSA(Utilities)




      -tex [-q[uality Integer] -angle Angle -xd
              Produces a LaTeX picture to standard output, in much the same
              spirit as the Tk output produces on the screen. Higher quality
              is slower but might produce better results.

      -transduce Words, -ftransduce File Words
              Produces all possible transductions of Words on the basis of
              the finite-state transducer read from standard input. The
              latter can be either compiled (a .pl file) or interpreted. The
              second version of this command takes an extra file name
              argument (can be the root of a

      -transitive
              A binary relation is read in from standard input. The
              transitive closure of that relation is written to standard
              output.

      -tk [[-q Integer] [-xd Dist] [-angle
              The finite state automaton is read from File and shown in a Tk
              Widget (only available under ProTcl). Experimental. The -xd
              option can be used to alter the default X distance of nodes
              (default: 120). The -angle option indicates the angle
              (default: 0.25). The quality option indicates the required
              quality. Default is 0, which indicates that no attempt is made
              at reducing the number of crossing branches; larger integers
              will (dramatically) increase processing time, resulting
              (hopefully) in slightly better output.

      -u[nion] File1 File2
              A fsa representing the union of the languages defined by the
              fsa in File1 and File2 is written to standard output.


 REPRESENTATION
      The files are all in Prolog notation. Finite state automaton are
      defined by Prolog clauses for:

      a single start state:

      start(State)

      a number of final states:

      final(State)

      a number of transitions:

      trans(State0,Sym,State)

      and a number of jumps:




                                    - 4 -        Formatted:  October 2, 1995






 FSA(Utilities)                                               FSA(Utilities)




      jump(State0,State)

      The transition relation does not need to be total.  The -da option can
      be used to transform a fsa into an equivalent one with a total
      transition relation, but this is now not needed anymore for the
      minimalizer.

      States and symbols must be ground Prolog terms. Finite state
      transducers are represented in the same way, except that the second
      argument position of the trans/3 relation is a pair A/B of symbols
      where A and B are ground Prolog terms.

      A regular expression is a single Prolog term built up with the
      functors: atom/1 argument is a symbol

      kleene/1 argument is a regular expression term. Defines Kleene
      closure.

      or/2 where both arguments are regular expressions. Defines
      disjunction.

      concat/2 where both arguments are regular expressions. Defines
      concatenation.


 EXAMPLE FILES
      A finite state automaton defining the regular language a^2nb^2n

      start(q0).

      final(q2).

      trans(q0,a,q1).  trans(q1,a,q0).  trans(q2,b,q3).  trans(q3,b,q2).

      jump(q0,q2).  Suppose the finite state automaton above lives in a file
      called ex1.nd. In that case the command fsa -d <ex1.nd prints a
      determinized fsa to standard output:

      start(q0).

      final(q0).

      final(q1).

      final(q2).

      trans(q0,b,q3).

      trans(q0,a,q4).

      trans(q3,b,q1).



                                    - 5 -        Formatted:  October 2, 1995






 FSA(Utilities)                                               FSA(Utilities)




      trans(q1,b,q3).

      trans(q4,a,q2).

      trans(q2,b,q3).

      trans(q2,a,q4).

      An equivalent regular expression is defined as:

      concat(kleene(concat(atom(a),atom(a))),
             kleene(concat(atom(b),atom(b)))).

      Now there is a more friendly notation, but very much Alpha. Cf. the
      file cregex.pl. The previous example could be written:

      regex(main, (a and a)* and (b and b)*, []).

      We allow constrained regular expressions and arbitrary Prolog
      constraints, again refer to cregex.pl for `documentation'.


 MAKEFILE dependencies
      Consider the Examples directories for examples and Makefiles where
      file suffixes are used to indicate whether the finite state automaton
      is deterministic, minimal etc. The following conventions are used:

       suffix          meaning

       fsa:

       .rx            regular expression (user notation)
       .crx           regular expression (internal notation)
       .nd            non-deterministic fsa
       .d             deterministic fsa
       .m             minimal fsa

       fst:
       .trx           regular expression (user notation)
       .tcrx          regular expression (internal notation)
       .tnd           non-deterministic fst
       .td            deterministic fst

       fsa + probabilities:
       .p_d           probabilistic fsa
       .p_dne         probabilistic fsa (e-free)

      A Makefile with straightforward dependencies is given in the source
      directory as Makefile.examples





                                    - 6 -        Formatted:  October 2, 1995






 FSA(Utilities)                                               FSA(Utilities)




 EXAMPLE SCRIPTS (recognizers)
      If the file t1.nd consists of a definition of a non-deterministic
      finite state automaton, then the following computes the deterministic
      equivalent, and writes it to t1.d:

      fsa -d < t1.nd > t1.d

      More complex examples:

      This one takes a non-deterministic fsa and writes a compiled Prolog
      program to standard output on the basis of the minimal (determinized)
      equivalent:

      fsa -d < t1.nd | fsa -m | fsa -c > t1.pl

      This checks whether the string "aaa" is accepted by the program in
      t1.pl fsa -a a a a < t1.pl

      Similarly, but now for the non-deterministic fsa in t1.nd: fsa -a a a
      a < t1.nd

      fsa -p <t1.pl fsa -p <t1.nd fsa -p <t1.d Produces all strings
      recognized by the compiled fsa in the file t1.pl (t1.nd, t1.d).

      Intersection, followed by compilation of the result followed by a
      check whether the string "aaaaaa" is accepted:

      fsa -intersect d2.nd d3.nd | fsa -d | fsa -c | fsa -a a a a a a a

      Generates an arbitrary fsa, determinize and minimalize it, and give
      simple names to the states:

      fsa -g | fsa -d | fsa -m


 EXAMPLE SCRIPTS (transducers)
      Transducers are often specified by a regular expression. Such regular
      expressions often live in a file with a .rx suffix. In order to build
      a deterministic transducer from such a file test.rx, we can do:

      fsa -cr <test.rx | fsa -r | fsa +td > test.d

      Finite-state transducers can also be compiled into efficient Prolog
      programs, e.g.:

      fsa -td <test.d >test.pl

      Two finite-state transducers can be composed with the option -compose:

      fsa -compose fst1.nd fst2.nd > fst12.nd




                                    - 7 -        Formatted:  October 2, 1995






 FSA(Utilities)                                               FSA(Utilities)




      A finite state automaton can be composed with a finite state
      transducer:

      fsa -compose_fsa ex1.nd fst12.nd > result.nd

      Finally, a string can be `composed' with a finite state transducer.
      The resulting finite-state automaton is written to standard output.

      fsa -compose_string a a a b b b <fst12 >result.nd

      If you want to obtain all possible resulting strings, then use the
      -transduce option:

      fsa -transduce a a a b b b <fst12



 EXAMPLE SCRIPTS (recognizers with probabilities)
      The option -pj can be used to remove all jumps:

      fsa -pj <test.p_d >test.p_dne


 BUGS
      Many. Please let me know.



 REFERENCES
      John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata
      Theory, Languages and Computation. Addison Wesley 1979.

      Ronald M. Kaplan and Martin Kay. Regular Models of Phonological Rule
      Systems. Computational Linguistics 20 (3) 1994.

      For information on the daVinci program, cf.
      http://www.informatik.uni-bremen.de/~inform/forschung/daVinci/


 COPYRIGHT
      Copyright c 1995 by Gertjan van Noord.

 AUTHOR
      Gertjan van Noord, vannoord@let.rug.nl, Comments welcome.










                                    - 8 -        Formatted:  October 2, 1995
