.de VS
.KS
.nf
.\\$1D \\$2 \\$1
.ft 1
.ps 8 
.if \\n(VS>=40 .vs \\n(VSu
.if \\n(VS<=39 .vs \\n(VSp
.cs 1 24
.cs 2 24
.cs 3 24
.lg 0
..
.de VE
.ce 0
.if \\n(BD .DF
.nr BD 0
.in \\n(OIu
.KE
.ps \\n(PS
.lg 1
.cs 1
.cs 2
.cs 3
.if \\n(TM .ls 2
.sp \\n(DDu
.fi
..
.ND
.nh
.nr PS 11
.nr VS 13
.nr PO 1i
.ds CH 
.LP
.DS C 
.sp 6
\s+3 The Edinburgh/Cambridge\s-3
\s+3 Morphological Analyser and Dictionary\s-3
\s+3 System\s-3
.sp 0.5v
[Version 3.0]

.sp 0.5v
\s+4System Description\s-4
.sp 5
G.D. Ritchie
A.W. Black
.sp
Department of
Artificial Intelligence,
University of Edinburgh
.sp 3
S.G. Pulman
G.J. Russell
.sp
Computer Laboratory,
University of Cambridge
.sp 8 
This work was supported by SERC/Alvey grant GR/C/79114.
.sp 2
\s-2COPYRIGHT\s+2:\ \(co G.D. Ritchie, A.W. Black, S.G. Pulman, G.J. Russell
.sp 3
July 1987
.DE
.bp 1
.ds CF - % -
.SH
CONTENTS
.sp 2
.LP
.ft 3
1. Introduction
.sp
2. Spelling Rules
.RS
.nf
.ft 3
2.1 Historical Development
2.2 Compilation of Spelling Rules
.RS
.nf
.ft 3
2.2.1 An Example
.RE
.nf
.ft 3
2.3 Interpretation
2.4 Implementation
.RE
.nf
.ft 3
.sp
3. Word Grammar and Unification
.RS
.nf
.ft 3
3.1 Historical Development
3.2 Categories and Features
3.3 Word Grammar Rules
3.4 Unrestricted Unification Grammar
3.5 Term Unification Grammar
3.6 Implementation
.RE
.nf
.ft 3
.sp
4. Lexicon
.RS
.nf
.ft 3
4.1 Lexical Entries
4.2 Lexical Rules
4.3 Implementation
.RE
.nf
.ft 3
.sp
5. Analysis
.RS 
.nf
.ft 3
5.1 Analysis Process
5.2 Data Structures
5.3 Some Functions
.RE
.nf
.ft 3
.sp
6. Implementation
.sp
7. Enhancements
.RS
.nf
.ft 3
7.1 Spelling Rules
7.2 Word Grammar
7.3 Lexical Entries
7.4 Analysis
.RE
.nf
.ft 3
.sp
References
.fi
.ft 1
.ds RH Section 1
.bp
.NH 0
Introduction
.LP
This document describes the implementation of the Edinburgh/Cambridge
Morphological Analyser and Dictionary System.  This document is
not intended as a general user manual and the user is referred to the 
User Manual (Ritchie et al 1987) for details on how to actually use the
system and develop
morphological analysers and lexicons.  The intended readers of this
document are those who wish to know about the low level implementation
of the system, rather than the pragmatics of using it.  This document
describes the algorithms used in the actual analyser as well as the 
various compilers (spelling rules and word grammar) used in the system.
As well as general descriptions of the processes involved, details of
internal LISP functions used in the system are given.  The information
in this document (with the help of the comments in the code)
should be sufficient for reimplementation and/or
significant modification of the current system.
.LP
It should not be necessary for those people who are developing 
lexicons and analysers to read this.  Details of how to write
and debug lexicons and analyser are described in the User Manual.
.LP
A lexicon description consists of three sections: a set of spelling
rules, and word grammar, and a list of lexical entries.  Each of these
three parts must be compiled and then all three loaded before any 
analysis may be done.
.LP
The next three sections describe each of the three compilation processes
and then section 5 describes the analysis process and  section 6 gives some
details of the implementation.
.ds RH Section 2
.bp
.NH 1
Spelling Rules
.LP
The system supports a form of spelling rule which allows the user to 
describe orthographic changes that occur between morphemes.  For example,
when the morpheme \f2move\f1 and \f2+ed\f1 combine the string \f2e+\f1
is deleted to form \f2moved\f1.  The spelling rule formalism lies in
the paradigm of two-level morphology (Koskenniemi 1983).  In this model
there are two distinct \f2tapes\f1 \(em a surface tape and a lexical tape.
The surface tape represents the word as it would appear in a sentence
while the lexical tape represents a \*Qnormal form\*U.
Spelling rules define matching relationships between these two levels.
For example we may have the following
.VS L
   lexical tape: m o v e + e d
   surface tape: m o v 0 0 e d
.VE
The \f20\f1 symbol represents a null character which actually does not
appear in the surface (or lexical) word merely on the matching tape.
.LP
In our current system users specify rules in a notation very close to that
of Koskenniemi's high level formalism as described in his thesis (Koskenniemi
1983) and in Koskenniemi (1985). A rule consists
of a \f3rule pair\f1 (which consists
of a lexical and a surface character), an \f3operator\f1,
a \f3left context\f1 and a \f3right context\f1.
There are three types of rule:
.XP
\f2Context Restriction\f1:
These are of the form
.DS L
pair \(->    LeftContext --- RightContext
.DE
This specifies that the rule pair may appear \f2only\f1 in the given context.
.XP
\f2Surface Coercion\f1:
These are of the form
.DS L
pair \(<-    LeftContext --- RightContext
.DE
This specifies that if the given
contexts \f2and\f1 lexical character appear then the surface character
\f2must\f1 appear.
.XP
\f2Combined Rule\f1:
This final rule type is a combination of the above two forms and
is written
.DS L
pair \(<-\(->    LeftContext --- RightContext
.DE
This form of rule specifies that the surface character of the rule
pair \f2must\f1 appear if
the left and right context appears and the lexical character appears, and
also that this is the \f2only\f1 context in which the rule pair is allowed.
.LP
The operator types may be thought of as a form of implication.  Contexts
are specified as regular expressions of lexical and surface pairs.  For
example the following rule:
.DS L
Epenthesis
    +:e \(<-\(->    {s:s x:x z:z < {s:s c:c} h:h>} --- s:s
.DE
specifies (some of) the cases when an \f2e\f1 is inserted at the
conjunction of a stem morpheme and the suffix \f2+s\f1 (representing plurals
for nouns and third person singular for verbs).  The braces in the left 
context denote optional choices, while the angled brackets denote sequences.
The above rule may be summarised as \*Qan \f2e\f1 must be inserted in the
surface string when it has \f2s\f1, \f2x\f1, \f2z\f1, \f2ch\f1 or \f2sh\f1
in its left context and \f2s\f1 in its right\*U.  (see the User Manual
section 6 for more details).
.NH 2
Historical Development
.LP
At the start of our project, users had to write spelling rules as transducers
as in the KIMMO system (Karttunen 1983).  Apart from being
extremely difficult to do, this can be relatively unprincipled.  Although
Koskenniemi defined a high level rule notation for describing
morphographemic relationships, which he describes in his thesis (Koskenniemi
1983), implementation was by hand compilation into low level automata.
This could work but it meant the user had to have two layers of representation,
one at the high level and one at the low level.
.LP
After Koskenniemi's later work (Koskenniemi 1985) we implemented a rule 
compiler that allowed the user to specify rules of the form described in 
the previous section.  The problem with original implementation of
this basic compiler was its speed \(em
it was very slow.  It took around twenty minutes to compile our
description into automata.
.LP
The compiler was in essence very simple.  It first converted the
regular expressions in the contexts of rules in to standard finite state
automata where symbols were made from a lexical and surface character and
then in a simple way an automaton was built for each rule in a way dependent on
the rule type.  The automata built were (or should be) the same as the ones
the user specified in previous versions of the system.  The main
problem was that the built automata had to run continually thus reapplying
themselves rather than simply matching a particular pattern.  This meant
that instead of a simple translation to a regular expression which recognised
the particular pattern described by the rule, arcs had to be added which
allowed the rule to restart at each stage in the match.  This meant 
that rather than just a pattern matcher being created a matcher
for the \f2closure\f1 of the pattern had to be created.
.LP
Because of this, the sort of automaton that came out of the compilation process
was very large (each state typically had arcs for all pairs in the 
alphabet).  These automata were non-deterministic.  After generating the 
automata they were determinised, which is a computationally expensive task.
In our compilation of around 16 spelling rules conversion to basic automata
took about 2 minutes and determinising took 18 minutes.  After 
determinising it is advisable to minimise the automata, which again is
a standard and simple transformation.  In our system this was never done,
not because it wasn't needed but we were already looking for alternative
ways of compilation.  Minimisation would never reduce the number of 
state transitions needed to recognise (or reject) a string but would 
only reduce the number of states in an automaton.
.LP
The reason the automata produced by the compiler were significantly larger
than those that the user would hand specify was that the hand specified
ones contained set notations that were not expanded until run time while
the compiler expanded all sets thus producing far more arcs.
.LP
In the compiler described in (Karttunen et al 1987), a technique
similar to our earlier implementation is used. A major difference is the 
addition of an \*QElsewhere principle\*U 
which allows many arcs to be grouped together and hence reduce the size
of the automata significantly thus allowing determinising and minimising to 
take less time.
.LP
In the description of the implementation of the KIMMO system (Karttunen 1983)
a way of merging automata was described.  This was experimented with but
still a large automaton that took an unacceptably long time to determinise
was built.
.LP
After some experimentation we decided to follow the work of Bear (1985)
so that instead of closures of patterns being created simple patterns were
created and each rule is \*Qre-started\*U at each stage in a match.  Also 
following Bear's work the generated automata from the compilation are not
simple classical automata but are a special form of automaton where
the states are marked and hence require a special interpretation (see next
section for a detailed description).
.LP
One very important criterion we had when looking for better compilation
processes is that the semantics (and syntax) of the high level rules
must not change.  So although we now use a different compilation (and
interpretation) method to the one first described by Koskenniemi (1985),
no change was required in our rule notation and our actual description
in spite of our implementation changes.
.LP
We also spent some time looking at alternative spelling rule notations.
They all fell into the paradigm of two level morphology but had 
a different specification notations. The only one carried through to full 
implementation was that described in Black et al. (1987).  However it was
decided not to incorporate this fully into a released version of the system
and it exists only in an internal version (2.5).
.NH 2
Compilation of Spelling Rules
.LP
This compilation produces a form of automaton which is not
the classical type but a slightly more complex form instead.  In this
formalism each state is marked with typing information which is used
by the special interpreter of the automaton.
.LP
Before the actual compilation starts a certain amount of pre-processing
takes place.
The spelling rules are first checked for syntactic errors.   All \f2where\f1
clauses are then expanded by simple duplication and replacement of the
\f2where\f1 variable with each possible value.  This
means that each user-written
rule is now represented by a \f2group\f1 of rules.
.LP
Next, the set of \f2feasible pairs\f1 is found.  Feasible pairs are all
pairs of lexical and surface characters that are used in the
description.  This set is found by finding all \f2concrete pairs\f1
(pairs not containing sets) and the \f2defaults\f1 set.  The
\f2defaults\f1 set is made from the identity pairs of the intersection
of the lexical and surface alphabets and any pairs specifically
declared as default pairs.  The feasible pairs set is important to
the expansion of variables in other pairs, and is effectively
the alphabet of the automaton.
.LP
The new structure built from the rules is effectively a finite state machine.
It does however have a slightly more complex interpretation with special
classes of state.  Any state can be marked with any of three properties,
TERMINAL, LICENCE, and FINAL.  The machine is non-deterministic as
determinising would probably make the compilation stage take a number of hours.
Even then non-determinism would still exist in the continuous re-application
of the rules.
.LP
Compilation depends on the rule type.
.XP
\f2Context Restriction Rules\f1:
.VS L
     pair => LC --- RC
.VE 
The compilation basically produces patterns which recognise 
the string \f2LC pair RC\f1.  \f2LC\f1 is converted into an automaton
in which each state is marked with FINAL.  \f2RC\f1 is also converted to
an automaton, in which the starting state is marked LICENCE, the final state
is marked TERMINAL and all other states have no markings.  Lastly an arc,
labelled with \f2 pair\f1, is added to join the two contexts.
.XP
\f2Surface Coercion Rules\f1:
.VS L
     pair <= LC --- RC
.VE
This effectively looks for erroneous patterns.  It does not license any
pairs.  A pattern is made from this which recognises strings that have
the given contexts and lexical character but do not have the specified surface
character.  Automata are built for \f2LC\f1 and \f2RC\f1, the last state of
\f2RC\f1 is the special state ERROR.  The two contexts are joined by arcs
for all feasible pairs that have the same lexical character as the rule
pair but a different surface character.  All states in the automaton are marked
with FINAL (except ERROR), none of the states are marked LICENCE or
TERMINAL.
.XP
\f2Combined Rules\f1:
.VS L
     pair <=> LC --- RC
.VE
This is effectively made up of the combination of the two above rules.
An automaton is made from \f2LC\f1 in which all states are marked with FINAL.
Two different automata are created for \f2RC\f1. First
an arc labelled with \f2pair\f1 is added from the end of \f2LC\f1 to
one automaton representing  \f2RC\f1 starting
with a LICENCE state and
ending in a TERMINAL state.  Secondly arcs for all feasible pairs that have the 
same lexical character as \f2pair\f1 but different surface characters, are
added from the end of \f2LC\f1 to the start of another automaton representing
\f2RC\f1.  This second \f2RC\f1 automaton ends in ERROR and has each 
state marked with FINAL (except ERROR).
.LP
Although the automaton for each rule are created independently they are
actually
all added into the same structure.  Duplicate arcs are removed as the arcs
are added.  This means that there is one structure representing all the
spelling rules.  This automaton's starting state is held in the global
variable D-INITSTATE.
.LP
In addition to the rules, transitions are also added to this automaton
for all \f2unrestricted pairs\f1, from the initial state to a state marked
as LICENCE, FINAL and TERMINAL. \f2Unrestricted pairs\f1 are all pairs 
in the feasible
set that do not have an associated context restriction (or combined) rule.
This means that during matching, restricted pairs are only licensed when
in their appropriate context.
.NH 3
An Example
.LP
To illustrate the process here is a example of the compilation
of a simple spelling rule into an automaton.  The feasible set is
.VS L
   {a:a, b:b, c:c, d:d, e:e, ... y:y, z:z, +:0, +:e}
.VE
And the rule to be compiled is
.VS L
   Example
      +:e <=> {<{ c:c s:s } h:h> s:s x:x z:z } --- s:s
.VE
This rule is a \f2combined rule\f1 so we use the third compilation 
described above.  The automaton built for this rule effectively deals
with recognising two forms of patterns \(em one that allows the pair
\f2+:e\f1 to appear only in specified context and one that checks to
see that if the context exists, and only \f2+:e\f1 is within it.
.LP
The first task is to convert the left context into an automaton.  This 
automaton will be shared between the two pattern matchers.  The algorithm
for changing a regular expression can be found in many text books on 
formal language theory and automata (for example Hopcroft and Ullman (1979)
p 32).  Therefore an automaton which represents the above left context
is
.VS L
             c:c  s:s  h:h  x:x  z:z  +:e  +:0

         s1  s2  s2,s3      s3   s3
         s2            s3
         s3
.VE
Note the nondeterminism in the two possible transitions from \f2s1\f1 for
the symbol \f2s:s\f1.
In addition to this set of simple transitions type information must be
added about each state.
.VS L
         s1  FINAL 
         s2  FINAL 
         s3  FINAL 
.VE
The next stage is to built \f2two\f1 automata for the right context
.VS L
             c:c  s:s  h:h  x:x  z:z  +:e  +:0

         s4       s5
and
         s6      ERROR
.VE
and these states are marked as follows
.VS L
         s4  LICENCE
         s5  TERMINAL FINAL
         s6  FINAL
.VE
\f2ERROR\f1 is a special state, the usage of which is described below in the 
interpretation section.  Now we must join the left context automaton to the 
first right context one by adding a transition from the end of the left
context one to the start of the right labelled with the rule pair
.VS L
             c:c  s:s  h:h  x:x  z:z  +:e  +:0

         s3                            s4
.VE
And for detecting places where the left and right context exist but the 
rule pair does not.  The end of the left context must be connected to the 
start of the second right context automaton.  A transition is added for all
feasible pairs that have the same lexical character but different surface 
pair to the rule pair.  In this example the only feasible 
pair which matches this criteria is \f2+:0\f1, so we must add the 
transition
.VS L
             c:c  s:s  h:h  x:x  z:z  +:e  +:0

         s3                                 s6
.VE
So the complete automaton that represents the sample rule is
.VS L
             c:c  s:s  h:h  x:x  z:z  +:e  +:0

         s1  s2  s2,s3      s3   s3
         s2            s3
         s3                            s4   s6
         s4       s5
         s6      ERROR
where
         s1  FINAL
         s2  FINAL
         s3  FINAL
         s4  LICENCE
         s5  TERMINAL FINAL
         s6  FINAL
.VE
All other rules are compiled in a similar way and are compiled into the 
same automaton.
.NH 2
Interpretation
.LP
A configuration of a match consists of a list
of states of the automaton.  The initial configuration of the machine starts
with only the initial state.  At each stage of the match all states in the
configuration are tested against the given pair to find a set of new states.
A configuration typically consists of only 3 or 4 states when our 15 spelling
rules are used.  It could be said that 3 or 4 rules are potentially
active at any time.  There are two types of state, \f2simple states\f1 and
\f2commit groups\f1.  A commit group is a collection of simple states.
After a move, the new configuration must meet some simple 
conditions to be valid, then some post-processing is done before the 
configuration is returned ready for the next match.  A new configuration
must pass the follow conditions:
.IP
It must contain at least one simple state that is
marked LICENCE.
.IP
It must not contain the state ERROR
.IP
All commit groups in a configuration must contain at least one
state.
.LP
After it has passed these tests the following modifications are made before
it is passed on to the next stage in the match.
.IP
Simple states which are marked TERMINAL are removed.  These states
represent the completion of a rule.
.IP
Commit groups containing a state marked TERMINAL are removed.
.IP
All remaining simple states marked LICENCE are collected into a
new commit group.
.LP
A complete surface to lexical match is valid if the terminal configuration
contains no commit groups.   The markings FINAL are not actively used
in this algorithm.  All states other than those in commit groups will be
marked FINAL.  The reason for the marking is partly historic and partly
so it is possible to understand the compilation process.  There is however
no time lost because of it at run time.  At compile time there is some loss
of speed but this is not so important (probably less than a second).
.LP
The interpretation means that each pair must be licensed in a match before
it is accepted and any post-conditions (from the right context) on the 
licence must also be met (via
the commit groups) before the complete strings are accepted.
.NH 2
Implementation
.LP
The compilation functions are in the file \f3makesp.l\f1.  The coding is
relatively straightforward.  The first task is reading in the spelling
rule file.  This does take a significant amount of the time and a more
efficient reader could reduce the time for compilation considerably.
However, the advantage of a very portable reader and the fact that compilation
takes only around 30 second for our 15 rule description does not make 
the time of compilation an important factor.  However in the Common LISP
implementation there could be a significant advantage.
.LP
One part of the coding that appears to be important is that the
automaton is built up by destructively updating a structure rather than 
simply collecting the transitions together.  This saves a lot of garbage
collection and gave a speed up of around 5 fold.
.LP
The set of transitions produced from the compilation are held in a tree 
and written
to the compiled file as a function that when evaluated will produce
a (probably circular) structure where a state is represented by 
an assoc list.
.LP
The compilation produces 7 structures (LISP s-expressions) which are
written to the 
output file <name>.sp.ma.  These are
.XP
Version stamp: This contains the version number, the LISP system that
the file was written under and whether the unification type is
unrestricted unification (\f3UU\f1) or term (\f3TU\f1) unification 
(though the unification
type is not important in the spelling rules).
This is used to stop incompatible compiled files being used.  
.XP
Lexical Alphabet: a list of symbols representing the lexical alphabet.
.XP
Surface Alphabet: a list of symbols representing the surface alphabet.
.XP
Transition List: a LISP function that when evaluated will generate 
a tree structure
containing the automaton transitions.  In this tree structure a \f2state\f1 is
represented by an assoc list whose first element contains information
about the the state type (i.e. TERMINAL, LICENCE, and/or FINAL, spelling
rule this state originated from and a name); and the 
rest of the assoc pairs are automaton symbols (concatenations of the
symbol \f3'D\f1 and the lexical and surface characters) and new states.
This means the structure generated by this function (which is saved in
\f3D-TRANSITIONSLIST\f1) will normally be a circular structure.
.XP
Unusual Feasibles: a list of symbols (concatenations of \f3'D\f1 and a
lexical and surface character) where the lexical and surface characters
are not the same.  This is no longer actually used but was used to
find out quickly during matching what pairs are feasible.
.XP
Surface to lexical sets:  this is an assoc list where a pair is a
surface character and the rest is a list of all lexical characters
that can correspond to that surface character.  This allows the search
strategy used in matching a surface string with the lexicon to be more
efficient in that only feasible pairs are tested with the spelling
rules.
.XP
Automaton Symbols:  this is an assoc list (indexed by surface
characters) of assoc lists (indexed by lexical characters).  This is used
to find the symbol that is the result of concatenating the surface and lexical
characters with \f3'D\f1.  It was found that looking this result up during
analysis was quicker than generating it.
.LP
The above structures are all read in by the function \f3D-LoadSpRules\f1
(defined in \f3mafuncs.l\f1) into the global variables.
.LP
The functions involved in the interpretation of the rules are defined
in the file \f3spmoveau\f1 which
is included in the files \f3autorun.l\f1 (morpheme segmentation) and
\f3mconcat.l\f1 (morpheme concatenation).  There are basically three
functions that are called from outside this file.
.XP
\f3D-Tokenize\f1: this function takes a symbol made up of surface
characters and returns an list consisting of a list of surface characters
which it contains and an initial configuration (which is actually \f3nil\f1).
.XP
\f3D-CheckPairMatch\f1: this takes a lexical and surface character
and a spelling rule configuration and returns a new configuration if it is
valid or the symbol \f3'ERROR\f1. 
.XP
\f3D-Final\f1: this takes a configuration and will return \f3t\f1 if
the configuration is a valid final configuration with respect to the 
rules and returns \f3nil\f1 otherwise.
.ds RH Section 3
.bp
.NH 1
Word Grammar and Unification
.LP
The word grammar in a lexical description describes which morphemes
are allowed to join together.  It is basically a unification feature grammar 
and
may be of two types.  There is an option when building the system to
choose between two forms of unification (and hence category types).  The
first is called unrestricted unification which follows the form of unification
found is theories like GPSG (Gazdar et al. 1985), FUG (Kay 1985), LFG
(Kaplan and Bresnan 1982) and other formalisms  The second
option is term unification which is found in Prolog and has been used
as a base formalism for implementing general feature grammars (Thompson 1987,
Briscoe et al. 1986).
.NH 2
Historical Development
.LP
Given a basic formalism which segments words into morphemes, as
described in the previous section there still remains the task of 
describing which morphemes combinations are valid.  Obviously
\f2walk\f1 combined with \f2+ly\f1 to form \f2walkly\f1 is
in some sense \f2morphographemically\f1 correct (i.e. it complies
with the spelling rules) but is not \f2morphotactically\f1 correct.
These two processes, segmentation into morphemes and defining what are
valid combinations are distinct and we have made this distinction explicit
in our rule formalisms.
.LP
In the work of Koskenniemi (1983) this morphotactic component is described
in terms of continuations classes held within the lexicon.  In his 
description each word class is held in a separate lexicon.
During analysis when the end of a morpheme is found on the lexical string
the lexical entry also contains the list of names of feasible continuing
lexicons.  This means that at the end of a verb stem entry there could be a 
continuation list consisting of the name of the verb inflectional suffix
lexicon and the name of the lexicon containing derivational suffixes that 
attach to verbs.
.LP
This means that his system has a finite state description for
morphotactics. \**
.FS
Note this is not the reason that the Koskenniemi 
spelling rule formalisms is sometimes termed \*Qfinite-state morphology\*U.
The term \*Qfinite-state morphology\*U comes from the use of transducers
in morphographemics.
not the finite state morphotactics being discussed in this section.
.FE
In English, which has significantly simpler morphology (both morphographemics
and morphotactics) than Finnish (Koskenniemi's example language), there
are examples which cannot readily be described using a finite-state 
morphotactics.  These occur when for example a prefix can only combine
with a noun, but the next morpheme is a verb which only later becomes a noun
by derivation.  Thus if the continuation class for prefix contained only
\f2noun\f1 it would not find derived nouns.  An example of this is the 
prefix \f2non-\f1 which combines with nouns yet the string segment 
\f2non-conform\f1 is acceptable when it is followed by the suffix
\f2+ist\f1.
.LP
Because of this we decided to allow the user to specify the morphotactics
by an explicit context-free grammar rather than a finite-state
grammar implicit within the lexicon structure.  However there are
advantages in the finite-state model - when morphemes are short (one character
or so) and have the same realisation on the surface.
.LP
Our grammar formalism is a feature based unification grammar which is 
in essence context-free. \**
.FS
The requirement that there is a finite number of categories, (i.e. categories
can only hold themselves to a pre-defined finite depth), is not actually
checked for because it is computationally very expensive but it is 
\*Qunderstood\*U
to be the case.
.FE
Originally we had a very simple mechanism that used no feature conventions
or defaults.  Our initial word grammar consisted of a rule for each type
of inflection/derivation.  Later we introduced feature passing
conventions which allowed us to reduce the number of rules to only a very
few (two).  Later as we dealt with more phenomena (such as compounding)
this crept up to 5.  We feel that having a formalism as powerful as 
a full unification feature grammar is probably not necessary to describe
English morphotactics but in a general morphological tool it is justified.
.NH 2
Categories and Features
.LP
The main method of identifying syntactic classes in the system is
by the use of features and categories.
All features and their values must be declared.  Features can be
declared with a defined set of atomic values or as category valued.
Categories are sets of features and values.  Features may also have
variable values.  Variables must also be declared either as having a 
finite atomic range of ranging over categories.  For a variable to be
a feature value, the range of the variable must be a subset of the range
of the feature.  Thus a form of typed-unification is used as the 
combining function during analysis (see section 5.3).
.LP
There are two basic ways to write categories (this is true in either
the unrestricted or term unification versions of the system).  One 
is a simple LISPish notation where categories are simply lists of
feature value pairs where feature value pairs are two-element list
containing an atomic feature name and its value (atomic, category or variable).
For example
.VS L
     ((N +) (V -) (BAR 0) (PLU +) (COUNT -))
.VE
Alternatively categories are written in a style closer to that in GPSG
(Gazdar et al. (1985)), where categories consist of a name (which is
an alias for some features) followed by a \f2feature bundle\f1 in 
square brackets.  For example assuming \f2Noun\f1 is an alias for
\f2((N +) (V -) (BAR 0))\f1 a category in this alternate form may be
.VS L
     Noun[PLU +,COUNT -]
.VE
Note that the order in which features are given is arbitrary.
.LP
All categories returned or printed by the system are given in the 
LISPish form as finding the appropriate alias to use is a non-trivial
task in general.  In fact during analysis the GPSG form is transformed
into the LISPish form.
.LP
In both the versions of the system this LISPish form is further
translated into
an internal category form \(em see the next two sections for details of these
internal forms - and converted back before being returned to the caller.
.LP
Variables (in both versions) are treated in the same way.  When a
category is being converted into its internal form variables are normalised.
Variables are represented by a list where the first element is
an integer number (which is unique for the variable \f2within\f1 the 
category) followed by the range of the variable.  The range will be a list of
atomic values or the atom \f3category\f1.\**
.FS
This means category valued variables are represented by LISP dotted pairs
.FE
The distinguishing feature of the variable is not the integer number but
the cons cell of which it is the car.  Thus variables 
are the same when they are actually the  same structure (i.e. LISP \f3eq\f1)
and not just when they are a structure with the same values.
.LP
Feature names, values and user variables names are atomic; even if the user
uses numbers these are treated by the system as symbols that have a print
name which is a number but are not integers themselves.
.NH 2
Word Grammar Rules
.LP
Word grammar rules consist of a mother category and one or more daughter
categories.  Rules with a null right hand side are not allowed.
User rules may contain \f2rule category variables\f1 which are declared
to range over a set of aliases.  These can be used to make rules refer
to a number of different categories.  This form of variable is expanded
during grammar compilation thus producing multiple grammar rules per user
written rule.  The feature value variables are not expanded at compile
time but are bound during analysis by unification.
.LP
During compilation a grammar rule is transformed into a list consisting of
.XP
variable flag: this is \f3t\f1 if the grammar rule contains variables, and
\f3nil\f1 otherwise.  This is used by the analyser at run time to decide 
whether a rule has variables in it that have to be made unique.
.XP
rule name: the grammar rule name as specified by the user.  This user name
may be suffixed by a number if the user form of the rule contained
rule category variables, as these are expanded at compile time.
.XP
rule categories: a list of categories where the first one is the mother
and the rest are the daughters of the rule.
.NH 2
Unrestricted Unification Grammar
.LP
The important distinction in unrestricted unification is that there is not
an explicit set of category types.  In this version categories are effectively
shorthand for any valid categories that do not clash.
.LP
Within the system the functions \f3D-MakeCategory\f1 and \f3D-MakePCategory\f1
convert from the printable form of a category into the internal format 
and vice versa.  In the unrestricted unification version the internal 
form is virtually the
same as the printable form but it is useful to view the two types of category
(printable and internal) as distinct types.  The only change that occurs
in the unrestricted unification version is the conversion of variables 
into the form described in the previous section. 
.LP
The result from converting the internal form into a printable form is not
suitable as input to the category building function again.  This is because
of variables.  The function \f3D-MakePCategory\f1 inserts into
all variables the symbol \f3'<UNBOUND-VARIABLE>\f1 but otherwise leaves
variables as they are.  These variables are not standard user declared 
variables and hence the \f3D-MakeCategory\f1 will not convert it properly.
.LP
Most of the word grammar compilation process involves setting up structures
that will allow the analyser to run faster.
.LP
One bottle neck in a chart parser is finding which grammar rules
are applicable and should be added to the chart.  This stage is
called \f2proposing\f1.  That is when a complete constituent
is found it is necessary to search the grammar for all rules that this
constituent might expand (assuming the parser is running bottom-up).  This
can be an expensive process as if just done naively it requires a unification
operation for each rule in the grammar.  To get round this problem a
different access method has been implemented for indexing the grammar.
At compile time a special discrimination list is build.
.LP
The discrimination list is indexed by feature name, pointing to indices
which are indexed by feature value into lists of grammar rule names, such
that looking up a feature and then a value will give a list of grammar rules
which \f2cannot\f1 unify with a category that has that feature value pair.
.LP
For example given the two grammar rules
.VS L
   (VerbSuffix
      ((CAT Verb) (INFL ?INFL)) ->
         ((CAT Verb) (INFL -)),
         ((CAT VSuffix) (INFL ?INFL)) )

   (NounSuffix
      ((CAT Noun) (INFL ?INFL)) ->
         ((CAT Noun) (INFL -)),
         ((CAT NSuffix) (INFL ?INFL)) )
.VE
We construct a discrimination list from the left daughters.\**
.FS
We use the left daughter because the chart parser is running bottom up,
if we wished to run top-down the list should be created from the mother
categories
.FE
Thus producing
.VS L
   ((CAT
      (Verb NounSuffix)
      (Noun VerbSuffix)
      (NSuffix NounSuffix VerbSuffix)
      (VSuffix NounSuffix VerbSuffix)
    (INFL 
      (PLU NounSuffix VerbSuffix)
      (PAST NounSuffix VerbSuffix) 
      (ING NounSuffix VerbSuffix)))
.VE
This can read as \*Qif the proposing category has a marking \f2(CAT Noun)\f1
then the rule \f2VerbSuffix\f1 is not suitable\*U, likewise if the proposing
category has a marking \f2(CAT NSuffix)\f1 then both the rules 
\f2NounSuffix\f1 and \f2VerbSuffix\f1 are unsuitable.
.LP
The matching process that happens at run time is as follows.  Start with 
an initial \f2suitable rules\f1 list of all the grammar rules.
For each feature in the proposed category
find out what rules are unsuitable and reduce the list of suitable rules
accordingly at each stage.
This reduces the proposing stage to complexity of about one unification.
.LP
Some points
should also be mentioned at this time.  Note in the above example that
the \f2INFL\f1 sub-list contains no entry for \*Q\f3-\f1\*U.
This is because if a proposed category has that feature and value there
are no rules that are unsuitable.  All sublists
that can make no discrimination are removed from the list.  This emphasizes
the fact that in unrestricted unification \f2clashing\f1 features are the
distinguishing fact.
.LP
Also it should be explained how the discrimination deals with variables in
the rules and proposing category.  Variables are treated as if they
range over the whole feature range (which may not be true) thus they
cannot discriminate.  This means that matching against the discrimination
list may be over-general and more rules may be found than are actually
required.  This is acceptable because a real unification will be done later
in the parse so it will not change the number of parses though it may slow
the analysis down.  Also the time required to make the discrimination
list deal with variables (and category valued features) completely is
probably not worth the possible extra time that is lost when stray rules
are added to the chart.  Of course this does depend on the actual grammar
and if the first daughters of the grammar rules are actually under-constrained
then the analysis process will be slower \(em this would also be true if 
unification were used at propose time instead.
.LP
The unrestricted unification version of the system offers the use of feature
passing conventions which allow additional constraints to be made on the 
analysis trees.  These are simply declared as lists of feature names.  See
section 5.3 for details of how they are interpreted during parsing.
.LP
The other option that is available only in the unrestricted unification
version is
\f3LCategory\f1 definitions.  These are a form of defaults which are
applied after basic analysis of a word.  They are in some sense a
definition of category types but are not strong enough to reject analyses.
They consist of a minimum category for identifying a type and a list
of features that must also exist in such a category.  Categories 
in analyses that match the first part of an LCategory rule are checked
against each of the required features.  If it does not exist it is added
with a blank value.  The blank value is what ever is returned by the
macro \f3D-BlankVariable\f1 which is defined in the file \f2subrout\f1.
In the standard system this returns a symbol starting with \f3@\f1.  This
macro may be redefined (as a macro or function) depending on the requirements
of the user of the system \(em but this can only be done at install time.
.LP
The reason for LCategory definitions was to allow the system to be used 
with a term unification parser (Briscoe et al 1986) where there had to 
be a fixed set of features on varying category types.  This solution
was thought of as an immediate solution but by no means the best one.
It is possible and would be significantly better to write a description
such that the appropriate features were already added to the category types.
So it is wise to look upon LCategory Definitions as a dubious method and either
write a proper description or use the term unification version (which 
was not available when we introduced LCategory definitions).
.NH 2
Term Unification Grammar
.LP
At a late stage in the project it was decided to experiment with term 
unification instead of just unrestricted unification.  In term unification,
for categories to unify they must have the same number of features
(and their values must not clash).  Thus the unification becomes like
that used in the language Prolog where terms (hence the name) must have the 
same arity before they can successfully unify.
.LP
In our implementation of term unification we went a stage further in
that category types must be explicitly declared; types would exist implicitly
in any term unification system but it seems useful for both the
user and efficiency of the system to do this.  
.LP
Category definitions consist of an atomic name and a list of feature names.
All categories in the grammar (and lexicon) must be of one and 
only one of these
category types.  After alias expansion categories are converted into 
a internal form.  First the category type must be found.  The features names
of the given category are checked against the feature types and the \f2one\f1
which is a subset is the type.  Then for each feature in the type that does 
not appear in the user-specified category a variable value is added which 
ranges over the range of that feature.  The actual category structure
consists of a simple list where the car is the category type (the name
of the category type from the category definition) and the rest of the
list is the \f2values\f1 of the features in the order the features were 
specified in the category definition.  No feature names are necessary
because the category type is sufficient to distinguish the feature names
when necessary.
.LP
When categories are returned or printed by the system they are reconstructed
with their feature names.  Unbound variable values are returned as described 
above in the previous section.
.LP
Term unification categories are interesting in that they have a strong
type and therefore you cannot write rules that can range over varying 
category types without using rule category variables (see above).
.LP
Unification is now a simpler process in that there is no need to find the 
appropriate feature in each category as now their positions are known.
However there is now the problem that all categories are now fully specified
and (depending on the description) larger, even though most of the values
are probably variables.  This means that there is more feature value 
unification to be done.  The result of this is that for our current
descriptions term unification is slower than unrestricted unification.  This
may change if a description more suitable to term unification was written.
.LP
Apart from unification the other major advantage is in proposing.  Because
all categories have an atomic type the grammar can be treated as a form
of (modified) context-free grammar thus opening the possibility of
optimisations.  Given any category type proposed to the grammar
it is possible to find out which grammar rules it may match by merely looking
at the category types of first daughter of each rule.  This will of course
be over-general but very fast.  In fact one can go further and calculate
at grammar compile time for each category type which rules are appropriate
and then taking the category type of the mother of the rule find out
which rules are appropriate and so on.
.LP
This can be implemented by building a matrix of category types of first
daughter to category types of mother, then finding the transitive closure
(Aho and Ullman 1972 p7) of the matrix.  This then gives the relation category 
type \f2can make\f1 category type.
.LP
Using this relation we can replace the proposing stage of the parsing by
only one call when a new lexical entry is added to the chart.  Thus
at that stage all relevant rules can be added to the chart (if not already 
there) and no other proposing need be necessary at any other part of the
algorithm.
.NH 2
Implementation
.LP
The grammar compilation functions are basically held in the file 
\f3mkwgram.l\f1 though they include code that is shared with the lexicon
compiler.  The main part that is shared is the declarations parser, in
file \f3dclsconv\f1.  This
does have the unfortunate problem that some declarations that are specific
to the grammar may also be declared in the lexicon, but the lexicon declaration
takes no effect.  This is probably a bug and only valid declarations should
be allowed in a file.
.LP
The grammar compiler is probably the fastest compiler out of the three
compilers in the system.  It takes the file \f3<name>.gr\f1 and writes 
the file \f3<name>.gr.ma\f1 with a number of s-expressions which
are read in by the analyser.  The s-expressions represent
.XP
Version stamp: This contains the version number, the LISP system that
the file was written under and whether the unification type is
unrestricted unification (\f3UU\f1) or term (\f3TU\f1) unification.
This is used to stop incompatible compiled files being used.  
.XP
Grammar: a list of grammar rules consisting each of the form described
in 3.3 above.  Note that the categories in the grammar rules are not
yet converted into internal categories.  This is because it is necessary
that variables which are the same are the same cons cell, but if such a 
structure is printed out and and read in this will not still be true;  hence
the conversion of the categories is done when the grammar is loaded into the
analyser.
.XP
Aliases: this contains an assoc list of alias name to its value.  These are
not actually used in the analyser as all aliases are expanded during 
compilation.
.XP
Features: this contains an assoc list of feature names and their possible
values.  Their values are either an enumerated list or the atom 
\f3category\f1.
.XP
Variables: this is the same format as the features list.  This is an assoc
list of user declared variable name and its range.  This is used in the 
conversion of the variables to their internal form.
.XP
Category valued features:  this is a simple list of category valued features.
This will always contain at least the feature \f3STEM\f1.
.XP
Head features:  this is a simple list of features which are to be affected
by the head feature convention.  This will always be \f3nil\f1 in the 
term unification version as no feature passing conventions are allowed.
.XP
Daughter features:  this is a simple list of features which are to be affected
by the daughter convention.  This will always be \f3nil\f1 in the 
term unification version as no feature passing conventions are allowed.
.XP
Feature defaults: this is a list of pairs consisting of feature name
and value for default.
.XP
Distinguished category: this contains the distinguished category 
for the grammar.  In the case of the unrestricted unification
version this is a category
which has not yet been converted into the internal form (see reason 
above in grammar).  In the term unification version this contains a list of 
category types which are to be distinguished.  This should never be \f3nil\f1
in the term unification version.
.XP
Morphological only features: a simple list of the features that are
declared in the \f3MorphologyOnly\f1 feature class.  These features are
removed form the top category of analyses before being returned to 
caller.
.XP
LCategory definitions:  this is always \f3nil\f1 in the term 
unification version of the system.  In the unrestricted unification
version this
contains the list of declared LCategory definitions.  Each definition is
a pair where the first element is a category (not yet in internal
form) and the second is a simple list of features that must be in that 
category.
.XP
Discrimination list:  in the unrestricted unification 
version this contains the discrimination
list for the grammar as described above.  In the term unification version 
this will always be \f3nil\f1.
.XP
Category definitions:  this is always \f3nil\f1 in the unrestricted 
unification version.
In the term unification version this contains the user declared category
types.  These are in the form of a list of category definitions, where a
category definitions consists of a list where the first element is the name
of the category and the rest is the list of features it contains.  This 
should never be \f3nil\f1 (in the term unification version).
.XP
Can make tree: this is \f3nil\f1 in the unrestricted unification
version.  In the term
unification this contains the matrix for the relation  category type 
can make category type.  It is constructed as described above in section 
3.5 above.  The format is an assoc list, indexed by atomic category types
(as declared by the \f3CatDef\f1 definitions.  Each element has a category
type as its car and the cdr is a list of rule names which should be started
when that category type is found in the lexicon.
.LP
As with the other compiled files these above s-expressions are read in
by the loading functions into global variables.
.ds RH Section 4
.bp
.NH 1
Lexicon
.LP
The third part of a lexical descriptions consists of the entries themselves.
In addition to the list of entries this section also includes lexical 
redundancy rules.  These are of three types which can be used to manipulate
basic entries into fully specified ones by adding defaults etc.
.NH 2
Lexical Entries
.LP
Compiled entries are written to disk and not stored in core.  This is necessary
as the LISP system is not large enough to hold all the information
in core.  These entries are indexed by a tree structured index made from
the citation forms of entries.
.LP
The basic compilation process takes the lexical rules and then reads each
lexical entry and expands it according to the rules.  It then adds the 
citation form to the index tree.  The lexical entry is normally written
out to the file \f3<name>.en.ma\f1 and the current byte position of the 
start of the written expanded entry is held in the lexicon tree.  There
is however a flag \f3D-INCOREFLAG\f1 which when set will cause the compiler
to keep the entry in core rather than write it to disk.  This may be 
set on and off by the compiler directives \f3incore on\f1 and \f3incore off\f1.
This is intended for common words (typically closed class) which will
be accessed
often.  When an entry is compiled incore it will be faster to analyse.
.LP
The entries are processed one by one and not all read into core at once
(as there is not enough room).  The system can easily cope with dictionaries
of up to 15,000 entries but larger than that the LISP system (that is Franz)
runs out of symbol space.  There are a number of possible ways round this.
One way is to use a larger LISP system.  It is possible (and documented
in the Franz installation guide) to build a double sized LISP which would
reduce the problem so that at least 30,000 word dictionaries could be compiled.
This however is still a limitation.  Another more general solution is to 
compile separate sub-lexicons.  The system allows the loading of multiple
lexicons and hence it is possible to compile smaller lexicons separately
and load them together only at run time.  Conceptually it would possibly
be better to think of separate lexicons for separate word classes but that
produces a more inefficient set of lexicons.  Compiling separate lexicons
for different initial letters of entries is probably a better idea.  Once
large lexicons are being dealt with separate word classes will get difficult
to handle as many words will appear in different lexicons.  Also 
sub-lexicons for each initial letter loaded in via the \f3D-AddLexicon\f1
function produce almost as efficient a lexicon as compiling the 
whole lexicon together.
.LP
The lexicon tree is a standard efficient technique for indexing words
(Thorne et al. 1968, Karttunen 1983).  The constructed index is such
that each node is labelled with a character and the characters from the
nodes of any path from the root of the tree to a marked end node represent
a citation form.  For example an index tree for the citation forms
\f2car\f1, \f2carp\f1, \f2coat\f1, \f2cone\f1 and \f2care\f1 would be as
follows.  A \*Q\f3.\f1\*U marks an end node
.VS L
.eo
                 e.
                /
          a - r. 
         /      \\ 
        c        p.
         \\  
          o - a - t.
           \\  
            n - e.
.ec 
.VE
In LISP, trees are represented as assoc lists.  Each assoc pair has as
its car the character label and as its cdr the associated list of
sub-trees.  The sub-trees are ordered alphabetically.  End nodes are
identified by making the first sub-tree have a special character label
(defined by the macro \f3DK-ENDTREE\f1) which because it is longer than
one character can never appear in a citation form.  This sub-tree is
different in that its cdr is a list of entries which have that
citation form.  This list of entries is actually a list of structures.
These structure consists of two items.  The car is a flag (\f3t\f1 or
\f3nil\f1) which indicates whether the entry is \f2non-inflectable\f1 or
not.  That is if it has been declared in that class in the lexicon file.
Entries that are in that class can only appear at the end of a surface
string and if found before the end are not read in or passed back to
the chart parser for further analysis.  The rest of the entry structure
(cdr) is either a number, which is a byte index into the
\f3<name>.en.ma\f1 file where the start of the lexical entry is, or if
it is not a number then it is the fully expanded lexical entry actually
in core (that is if the flag \f3D-INCORE\f1 is set to \f3t\f1 when this
entry was compiled).
.LP
It was suggested that the ordering of sub-trees should actually be based
on statistical frequencies but this was never done.  It may be this would
given a significant speed up but it is unclear.  The accessing actually
does not depend on the order (except that the end marker must be first)
but did in an earlier version.
.NH 2
Lexical Rules
.LP
There are three types of lexical rules in the system, \f3Completion Rules\f1,
\f3Multiplication Rules\f1 and \f3Consistency Checks\f1.  Completion rules
add or modify entries.  Multiplication rules add new entries to the 
lexicon.  Consistency checks are not actually active (in that they cannot
change entries in any way) but are available
for checking the internal consistency of entries, for example checking
feature dependencies which are dependent on the lexical description.
.LP
All three rules sets are very similar, they all have the same form
\(em a matching pattern (to identify which entries they apply to) and
an action.
.LP
Entry completion rules and multiplication rules can appear in either order
(although each type of rule must be kept together), consistency checks
must come last.  The rules are applied in the order they are declared.  As
completion rules are usually used for fleshing out entries, adding 
defaults and multiplication rules add new entries, it is not clear whether
the user wants to expand entries then duplicate or duplicate entries
then expand.  So in spite of using fixed orders in earlier versions
of the system we now allow either order.
.LP
An important point to realise about these rules is that they are symbolic
manipulators of entries.  Although they can be used to match syntactic
entries the matching function has nothing to with unification.  The
rules act on the fully expanded entries after aliases have applied.  
They cannot
themselves include aliases (this is a restriction due to not getting
round to doing it rather than a fundamental reason).  Note also that 
the rules must be specified in a LISPish form rather than the GPSG style
of categories.
.LP
Patterns consist of (possibly conjunctions of) skeletal lexical entries.
Variables may be used within these patterns, bound and then used again
(in matching or rebuilding entries after matching).  A pattern is 
a 5-tuple where each field corresponds to a field in an entry.  The
values for the citation form, phonological form, semantic field
and user field are restricted to being either literals, variables (denoted
by atoms preceded with a \*Q\f3_\f1\*U)
or wild card variable (denoted by a single \*Q\f3_\f1\*U) (or negations
of these - denoted by \*Q\f3~\f1\*U).  The syntactic field however
can be a structure containing any of these.  For example
.VS L
     (_ _ ((N +) (V -) (BAR _bar) _rest) _ _)
.VE
matches the lexical entry
.VS L
     (men men ((INFL -) (V -) (BAR 0) (N +) (PLU +)) MAN NIL)
.VE
with the variable \f3_bar\f1 bound to \f30\f1 and \f3_rest\f1 bound
to \f3((INFL -)(PLU +))\f1.  Note the \f2rest\f1 may be of any name.  Matching
of the syntactic pattern to the syntactic field is done from left to
right - this can be significant.  If a \f2rest\f1 variable appears it may
only sensibly appear as the last thing in the pattern (as it
will match all the remaining features in the field leaving none for any
remaining part of the pattern).  Negative specifications are denoted
by a preceding \*Q\f3~\f1\*U on either whole skeletons or on feature pairs.
Variables bound within negative patterns are not passed on for later
use (cf. Prolog).
.LP
The results of a pattern matching are basically a set of bindings.  In 
the multiplication rules and completion rules this is used to rebuild
entries \(em either new additional ones (with the multiplication rules) or
replacing old entries (with the completion rules).  Both of these
types of rule specify skeletons which define new entries from the 
resulting bindings of the match.  The \*Q\f3&\f1\*U symbol
may be used to represent whatever
is in that field in the entry being matching.  Thus given the above
pattern and entry a skeleton
.VS L
   (& & ((N +) (V -) (BAR _bar) (COUNT -) _rest) & &)
.VE
would construct an entry thus
.VS L
   (men men ((N +)(V -)(BAR 0)(COUNT -)(INFL -)(PLU +) ) MAN NIL)
.VE
The consistency checks are different in that after the pattern and
the operator \f3demands\f1 another pattern is given.  The interpretation
is that if an entry (after expansion by other rules) matches the first
pattern of a consistency check it \f2must\f1 match the second part.  If it
does not a warning message is printed and the entry is not added to 
the lexicon tree.
.LP
Just before the entries are written to file the syntactic entry is converted 
to the internal category form.  Because of the representation of variables
(as cons cells) two variables although the same at compile time will \f2not\f1
be the same when the analyser reads them in.  This is unfortunate and is
not checked for (an over-sight on our part).  So entries should not
contain the variables which have to be the same.  The alternative of
expanding them at analysis time was thought be too slow and it was felt
better to do all expansion at compile time.
.NH 2
Implementation
.LP
As with the other two compilation process the lexical compilation reads
the file \f3<name>.le.ma\f1 and produces a number of s-expressions in
files.  The lexical compilation produces two files \f3<name>.en.ma\f1
(which contains the expanded lexical entries) and \f3<name>.le.ma\f1.
which contains the word tree and other structures (see below).
.LP
Note the entries file (\f3<name>.en.ma\f1) has direct byte indexes into
it from the word tree.  Do not change any of the information in this
entries file otherwise the word tree will cease to index properly.  Actually
the important point is that the position of the start of each entry must
not change - but it is safer never looking at the file.
.LP
The \f3<name>.le.ma\f1 contains.
.XP
Version stamp: This contains the version number, the LISP system that
the file was written under and whether the unification type is
unrestricted unification (\f3UU\f1) or term (\f3TU\f1) unification.
This is used to stop incompatible compiled files being used.  
.XP
The lexicon tree:  this consists of a structure where the car is the 
name of the file that contains the lexical entries (though this is not
actively used because the path name may be different at run-time).  The 
cdr is an assoc list where each element is a subtree.
A subtree's car is a lexical 
character and the cdr is a list of sublists.  Where a sublist's
car is \f3AA\f1 (as defined by the macro \f3DK-ENDTREE\f1 in file
\f3keywords\f1), the cdr is a list of lexical entry structures.  A
lexical entry structure's car
is \f3t\f1 if the lexical entry can only appear at the
end of a word and \f3nil\f1 otherwise.  The cdr is either a number (which
is a byte index into the file \f3<name>.en.ma\f1 or the actual lexical entry
itself (if it was declared \f3incore\f1.
.XP
Features: this contains an assoc list of feature names and their possible
values.  Their values are either an enumerated list or the atom
\f3category\f1.
.XP
Category valued features:  this is a simple list of category valued features.
This will always contain at least the feature \f3STEM\f1.
.XP
Category definitions:  this is always \f3nil\f1 in the unrestricted
unification version.
In the term unification version this contains the user declared category
types.  These are in the form of a list of category definitions, where a
category definitions consists of a list where the first element is the name
of the category and the rest is the list of features it contains.  This
should never be \f3nil\f1 in the term unification version.
.ds RH Section 5
.bp
.NH 1
Analysis
.LP
This section describes the analysis algorithm and how it is 
implemented as well as identifying some of the major functions and
data structures that are used in the analysis process.
.NH 2
Analysis Process
.LP
The analysis of a word takes place at two levels.  The lower level
is the segmenting of the word into morphemes.  Then these morphemes are fed
into a chart parser that checks them against the word grammar.
.LP
The segmentation function (\f3D-Recog\f1 defined in the file \f3autorun.l\f1)
finds the next morphemes
given a surface string and a spelling rule state.  The current 
remainder of the surface string is compared against the lexicon tree
(see previous section) for the next possible morphemes.  The basic 
searching of the lexicon tree is done by a recursive function. 
.VS L
    \f3Morpheme Segmentation Function\f1

    1  SearchTree(surface string,list of subtrees,spelling rule state)  
    2     if first subtree is end marker
    3        construct a structure for each morpheme that
    4        ends here and the current spelling rule state
    5     for each feasible lexical character that can correspond
    6     to the first character of the surface string
    7        find a subtree with that lexical character
    8        If this pair is valid wrt the current spelling rule state
    9           Call SearchTree with
   10             null concatenated to remainder of surface string
   11             the subtree
   12             the new spelling rule state
   13           Call SearchTree with
   14             the remainder of the surface string
   15             the subtree
   16             the new spelling rule state
.VE
The found morphemes are collected in the global variable \f3D-MORPHEMES\f1
\(em it was found collecting the results returned from functions was 
inefficient.  The returned form of morphemes is defined in the function
\f3D-CreateReturn\f1.  Another optimisation is that there are two 
\f2SearchTree\f1 routines one dealing with the case when null has been
added to the surface (\f3D-SearchRelevantSubtreesNull\f1) and the another
for the simple case (\f3D-SearchRelevantSubtrees\f1).  Also it should be
noted that when the feasible lexical character is a null it has to be
dealt with specially.
.LP
The word grammar is
used to parse the result from the segmenting using a chart parser (for
general description of chart parsing see Winograd (1983) pp 116-129 and 
Thompson and Ritchie (1984)).  The chart is hardwired to run bottom-up.
In earlier versions of the system it was found that running bottom-up was
faster than running top-down.  This was because there were many word grammar
rules and not a very good indexing method for them.  It is unclear now
whether there is an advantage in either strategy but we have stuck with
bottom-up.
.LP
The morpheme segmentation is only done on demand and 
all possible segmentations of a word are not found as this would 
be computationally expensive.
There are many tricks and optimisations to try to make this as efficient
as possible but for the initial part of this explanation these are
ignored.  See the later parts of this section for some of the optimising
techniques. 
.LP
The basic segmentation of a word is not like the classical well formed
sub-string table created for a sentence in a chart parser.  The problem
with word segmentation is that the basic analysis path is not linear but may
have many
branches.  The segmentation function takes the remainder of the surface string
and a spelling rule state and returns a list of possible next morphemes
(each with its own remainder and new spelling rule state - or the atom
\f3END\f1).  Where the lexical entries with the same citation form have
been found \f2and\f1 by the same matching process the entries are grouped
together.  It is not the case, however, that if the same entries are found
then they can be simply grouped together.  The important point is that 
segmentation with different spelling rule states must be treated as different
next morphemes. 
.LP
What actually occurs when looking up morphemes is that the base chart is
built as a tree structure rather than as the simple linear list.  Thus
given the surface form \f2preached\f1  we may get the basic chart
built as follows.
.VS L
.eo

                 ache(N)  
               /         \\
              2            3 - ed 
             / \\         /        \\
         pre     ache(V)           \\
       /                            \\
     1                               5
      \\                             /
        --- preach --- 4 -- ed ----/

.ec
.VE
Vertices are identified as numbers.  Note that the two edges joining
vertices 2 and 3 do so rather than split the search space again because
they both have the same citation form (\f3ache\f1).  Vertices 3 and 4
are different although they have the same following morpheme.  It was
decided that to find out where vertices in the chart could join up was
too difficult.  Vertices can be merged only if they have the same
remaining surface string and an equivalent spelling rule state \(em the
second condition is difficult to test.  It is probably not worth the
check especially as words are typically short and hence not very much
extra work would need to be done.  If the sentence segmentation option
is taken then such a split could cause a lot of duplicate work.  The
string always has only one end vertex (vertex 5 in this case).
Note because the search strategy is directed by the grammar not all
segmentations will be found - those paths which cannot lead to 
a complete analysis are not searched.
.LP
The basic analysis algorithm is as follows
.VS L
  \f3Parse Function\f1

   1  Initialise structures
   2  Find first morphemes of string and add to agenda
   3  while agenda not nil
   4    take top edge from agenda and make it current
   5    If current is complete
   6       check feature conventions and apply defaults
   7    If feature conventions are ok
   8       Combine current edge with chart adding any
   9          new edges to the agenda
  10       Add current edge to chart (various indices)
  11  Find full parses
.VE
This is encoded in the LISP function \f3D-Parse\f1 in the file \f3parser.u\f1
or \f3parser.t\f1.
Line 8 requires further expansion.  The combination with other chart
edges goes as follows
.VS L
  \f3Combine Function\f1

   1  If current edge is complete
   2    check each incomplete edge ending at the place
   3       where the current edge starts.  If current label
   4       can unify with the next required category on the 
   5       incomplete edge construct a new edge and add it to 
   6       the agenda
   7    Propose the label to the grammar (see below)
   8  else if current edge is incomplete
   9    if its end vertex has not been extended
  10       call the morpheme segmenter with the end vertex
  11          surface string remainder and spelling rule configuration
  12       for each next morpheme found 
  13          build a new vertex (with the remainder and spelling rule
  14             state from the segmentation) and a new edge and
  15             add it to the agenda
  16    check each complete edge starting at the place
  17       where the current edge ends.  If current next required 
  18       category can unify with the complete edge's label
  19       then construct a new edge and add it to the agenda
.VE
Thus the chart only extends itself on demand and hence will not 
blindly search possible dead ends in segmenting a string but only 
searches where it may give rise to a complete word.
.LP
The proposing of a category to the grammar (line 7) is that used by
the unrestricted unification version of the system (see section 3.4). 
The term unification
version deals with proposing only when new morphemes are found in the 
grammar (see section 3.5).  That is line 7 is deleted and proposing
categories to the grammar only occurs in line 14 when a new lexical edge
is added to the agenda. 
.LP
Note this is a simplified version of the algorithm and the actual 
implementation has many extras to help it run faster but the above is 
basically what happens.
.LP
When either of the string segment options are selected 
(\f3D-STRINGSEGMENTCAT\f1 or \f3D-STRINGSEGMENTWS\f1) the algorithm described
in the previous section does not work.  The algorithm does not work because
it is directed by what the grammar is looking for and in the string segment
option there is no grammar rule that is looking for the whole string.  
When one of these options is selected the initialisation of the chart
requires that \f2all\f1 segmentations must first be found and added to the
agenda.
.NH 2
Data Structures
.LP
A chart basically consists of two types of data \f2edges\f1 and \f2vertices\f1.
There are a number of global variables used to access parts of the chart
.XP
\f3D-ALLEDGES\f1: a list of all edges created during the analysis.
.XP
\f3D-ALLVERTICES\f1: a list of all vertices created during the analysis.
.XP
\f3D-AGENDA\f1: a list of edges on the agenda of the chart.
.XP
\f3D-INITVERTEX\f1: the initial vertex of the chart.
.XP
\f3D-ENDVERTEX\f1: the end vertex of the chart.
.LP
An edge is represented as a list of eight elements these are
.XP
\f3LABEL\f1: a category.
.XP
\f3START\f1: a vertex.
.XP
\f3END\f1: a vertex.
.XP
\f3REMAINDER\f1: a list of remaining required categories.  This is \f3nil\f1 in
complete edges.
.XP
\f3RECOG\f1: list of edges recognised so far.  In the unrestricted
unification version this
is actually a list of pairs consisting of the category formed as a result of
unification between the required category and the label of the daughter
edge and the daughter edge itself.  In the term unification version this
is not necessary and it only holds the list of daughter edges.  In term
unification no new structure is created as the result of a unification -
only a new set of bindings.
.XP
\f3RULENUM\f1:  this is the atomic name of the rule that was used to create 
the edge.  Where the rule is a lexical edge this field contains the actual
lexical entry (i.e a 5 element list).
.XP
\f3BIND\f1: list of bindings.  Bindings consists of a list of pairs
of which the elements are the variable and the value.  Bindings are never
\f3nil\f1 so a dummy pair binding \f3t\f1 to \f3t\f1 always exists.  Note
this does not restrict the names of variables as they will always be
normalised in the chart and hence be a list structure.
.XP
\f3NAME\f1: A name for the edge.  This is only used for internal debug 
purposes.  (The chart debugger creates its own name for the edges).
.LP
There are a set of macros which access and update edge structures.  These
are defined in the file \f3subrout\f1.  They have the prefix \f3D-getedge\f1
or \f3D-putedge\f1 in front of the name of the field.  Note the updating
is done destructively.
.LP
A vertex is represented as a 6 element list with the following fields
.XP
\f3CLASSES\f1: will contain \f3t\f1 or \f3nil\f1 to say whether the
morpheme segmenter has been called from this vertex yet.  That is
whether this vertex has been extended or not.
The reason the field name is \f3CLASSES\f1 is historic as originally there
were more than one lexicon in the system.
.XP
\f3EDGEINI\f1: a list of incomplete edges that end at this vertex.  This
is to help searching for suitable edges during the combining process.
.XP
\f3EDGEOUTC\f1: a list of complete edges that start at this vertex. This
is to help searching for suitable edges during the combining process.
.XP
\f3STATUS\f1: a pair containing the remainder of the surface string from
this vertex and the current state of the spelling rules, or the atom
\f3END\f1 if this is the end of string vertex.
.XP
\f3RULES\f1: a list of rule names that have been created from this vertex.
This is used by the left recursion check to ensure duplicate edges
are not formed.
.XP
\f3NAME\f1: an atomic name.  This is only used for internal debugging 
purposes.
.LP
As with the edge structure there are a set of routines for accessing and 
updating a vertex.  They are described in the file \f3subrout\f1.  They
are identified by the prefix \f3D-getvertex\f1 or \f3D-putvertex\f1
in front of the field name.
.LP
The two indices, \f3EDGEINI\f1 and \f3EDGEOUTC\f1 are the only two required
in running a chart left to right, so for the sake of efficiency the other
two indexes are not created - though they might have been useful for users
debugging word grammars.
.LP
It may be better in Common LISP to implement these structures as
actual \f2structures\f1 but because the system must also run on early
versions of Franz LISP (which do not have the \f2structure\f1 data type) it
was not done.  This could be changed easily as the macros
\f3D-MakeEdge\f1 and \f3D-MakeVertex\f1 (in file \f3subrout\f1) could
be changed as well as the access and update macros to produce the appropriate
access functions.
.LP
Because both edges and vertices are held as lists, rather then via names
of some sort, the chart cannot be readily printed to the screen.  At
this point the explanation of some design decisions may be helpful.
Using property lists on symbols, where symbols are edges and vertices,
is not a very good idea.  This is because accessing
lists is significantly more efficient than accessing property lists.
This is true for Franz LISP, in which the system was developed but it
is possible that this is not the case in other lisps.
.NH 2
Some Functions
.LP
This section mentions some of the functions that are used in the 
analysis process.
.LP
The function \f3D-Recog\f1 is the interface between the word grammar parser
and the morpheme segmenter.  This function is defined in the file 
\f3autorun.l\f1.
.LP
The word grammar parser is defined by the function \f3D-Parse\f1 in the
file \f3parser.u\f1 or \f3parser.t\f1.  It takes a basic word (surface
form) and returns a vertex which is the initial vertex of the resulting
chart.  The file \f3analyse.l\f1 contains a hotchpotch of functions
which call \f3D-Parse\f1 depending on the various options set in the
global variable \f3D-LOOKUPFORMAT\f1.  Also this file contains function
which extract the analysed word (and structure if required) from the
resultant chart.
.LP
The feature passing conventions (only in the unrestricted unification 
version) and
defaults (in both versions) are applied to complete edges by the
function \f3D-CheckConventions\f1 defined in the parser file.  All
edges are passed through this function and if the system were to change
to include some other form of feature passing conventions or even
semantics this function could be modified to cater for them.
.LP
In the unrestricted unification version three functions are used (within
\f3D-CheckConventions\f1) to deal with the feature passing conventions.
The first test is the WSister test (test that the value of \f3STEM\f1 on
a daughter is extended by its sister).  The is done by the function
\f3D-WSisterConventions\f1  - this function returns a set of bindings or
\f3nil\f1.  The WHead convention is checked by the function 
\f3D-WHeadConvention\f1 - this function sets the LABEL of the edge
to its new value or the atom \f3FAILED\f1 if it fails the conventions.
The final convention, WDaughter, is checked by the function 
\f3D-WDaughterConvention\f1 which again modifies the edge or sets
the LABEL to \f3FAILED\f1.
.LP
Unification is defined in the function \f3D-Unify\f1.  This is defined
in the file \f3unify.u\f1 in the unrestricted unification version and 
\f3unify.t\f1 in the term unification version.  This routine (in both 
versions) takes three arguments, two categories and \f2one\f1 set of bindings.
It is assumed that the second category has had its variables 
\f2dereferenced\f1
with respect to its bindings.  This is done to all labels on complete
edges during parsing in the function \f3D-CheckConventions\f1, by the 
function \f3D-DereferenceVariables\f1.  This saves the work of 
passing extra bindings into the unification routines.
.LP
The term unification function \f3D-Unify\f1 (defined in \f3unify.t\f1)
returns a list of bindings
or the symbol \f3FAILED\f1, while the unrestricted unification function
\f3D-Unify\f1 (defined in \f3unify.u\f1) returns a pair of the new category
and a list of bindings or the symbol \f3FAILED\f1.
This because all features will exist on term 
unification categories and hence no new structure need be built.
.ds RH Section 6
.bp
.NH 1
Implementation
.LP
The system was written over the whole 3 years of the project with
around 30 distinct versions.  The version numbering was not very
consistent and so some versions have very little change from previous
ones while others are quite major.  Major version number changes were
made when the current system was felt to be relatively stable. Version 2.0
was the first version that we felt was suitable for distribution while
3.0 is looked upon as the last version before the (official) end of the
project.
.LP
The system was developed and debugged using Franz LISP (opus 42.15) on a
SUN2/120 with 4Mb of memory thus optimisations were made primarily for 
that version of LISP.  Hence many of the lower level design decisions were 
constrained by that.  It was felt that the system must also be runnable
on both the Franz LISP available on vaxes under Berkeley 4.2 (opus 38.79)
and Common LISP.  A Common LISP version of the system was not produced until
quite late on in the project (version 2.2) but was maintained and all of the
later versions can run in Common LISP.  This document basically describes
version 3.0 of the system.
.LP
The basic distribution consists of a directory containing five sub-directories.
These are
.XP
\f3src\f1: the basic LISP source of the analyser system.
.XP
\f3common\f1: the files necessary for mapping the Franz LISP functions to 
Common LISP and other Common LISP specific files.  This the directory that 
the Common LISP version is built in.
.XP
\f3man\f1: contains UNIX manual pages for the unix level commands.
.XP
\f3examples\f1:  This contains example dictionaries.  There are two small
examples, a unrestricted unification version and a term unification version.
Also there are two sub-directories of this examples directory which contain
firstly a GPSG type description which has some 6800 lexical entries; and a
Simple description which has a simpler (and faster) analysis and a  
smaller coverage (around 3300 entries).
.XP
\f3doc\f1: contains the user manual and this system description.
.LP
The whole system is designed to be built on a UNIX system using the 
provided \f3makefile\f1.  There are a number of options which are available 
at make time.  The choices are version of LISP - Franz 42.15, Franz 38.79
or Common; and the choice of unification - unrestricted or term unification.
.LP
The Common LISP \f2make\f1 is set for Kyoto Common LISP, but has been tested 
for many different Common LISPs and runs successfully.  However in these
other Common LISPs a simple \f2make\f1 is not always successful.  See
section 9 of the User Manual for more details of installation in Common LISP.
.LP
An example program using the analyser system is given in 
the \f3src\f1 directory.
It is called \f3example.l\f1.  It is a very simple program which loads the
sample GPSG lexicon and analyser and then looks up all the words in a given
file printing the results in another.  The object of this is to show 
how the analyser can be used with in a larger LISP program.
.LP
The different versions of the system offering different forms of unification
are implemented by having four files which are specific to the version.
These files are suffixed with either \f3.u\f1 (unrestricted unification)
or \f3.t\f1 (term unification).  The four files are \f3unify\f1,
\f3specrouts\f1, \f3parser\f1 and \f3catrouts\f1.  At \f2make\f1 time the
appropriate suffixed files are copied to the non-suffixed form and
hence the system automatically includes the correct files for that
form of unification.
.ds RH Section 7
.bp
.NH 1
Enhancements
.LP
This section lists some enhancements that would improve the
system but because of lack of time these were not implemented.
.NH 2
Spelling Rules
.LP
On comparing the spelling rule formalism with that of Koskenniemi the 
main difference is that our system lacks the kleene star operator and
the optional operator.  This is because the way the compiler works is that
it cannot produce empty transitions which are necessary to deal with these
functions.  It is possible to get the same effect by 
duplicating the context and having one with kleene plus round the 
element and the other without the element (similarly with the optional
operator).  The place where this fails is when the optional (or kleene'd)
element is the only element in the context then one has to specify a 
pair standing for any character. 
Then again there is a problem where the phenomenon
one is trying to describe comes at the start or end of a word, when some
fake start/end markers have to be introduced.  This makes it messy.
.LP
The fix for this is to build a way for the interpreter to cope with 
empty transitions in the spelling rules.  Compiling them out would
not be trivial but in the long run would be better.
.LP
Another problem that was noticed when comparing our formalism
with Koskenniemi's is that we allow only single character symbols in our
lexical and surface alphabets.  This is fine for English but other languages
require more especially when the user wishes to distinguish phonological
effects in the lexicon.  Koskenniemi's system allows the use of multiple
character symbols in the alphabets.  This means the tokenizing function
has to be able to take a surface (or lexical) form and be able to find
the proper alphabet characters in it.  This should be relatively easy
but care would have to be made not to introduce any ambiguity in how 
the form should be tokenized.
.LP
The spelling rule compiler currently can cope with multiple character
symbols but the tokenize function cannot (\f3D-Tokenize\f1 in file 
\f3spmoveau\f1).  This would require the surface alphabet to be available
at analyse-time which it is not at present.
.LP
Another possible problem is the distinction between capital and lower case
letters.  The system currently treats them as being distinct.  It is the
responsibility of the lexicon writer to give the proper mapping between the
two.  We have not investigated this, though feel that there may be 
problems.  The system must allow the lexicon writer the freedom to 
distinguish between them as well as mapping them to the same.  Words like
\f2liberal\f1 and \f2Liberal\f1 are distinct.  Also information about 
proper nouns (and common nouns in German) may depend on case distinction.
It may be that the system should offer some way of dealing with this.
.LP
A point about implementation - the spelling rule automaton is 
implemented as a cyclic structure where a state is represented by an assoc
list indexed by symbol (concatenation of lexical and
surface character) followed by a list of new states.  It may be faster
if this look up could be indexed directly (i.e. by a vector) rather than 
by searching down the list as at present (via \f3assq\f1).
.NH 2
Word Grammar
.LP
In the term unification version no feature conventions apply.  This is 
because the interpretation of them becomes unclear in that both the 
WSister and WDaughter conventions depend on a notion of \*Qnot there\*U, 
which has little (declarative) meaning in term unification.  However the 
head convention does have a declarative reading and may in fact be useful.
It could be done by naming variables properly in the grammar rule at 
compile time and 
would not require the extra check during analysis.  
.NH 2
Lexical Entries
.LP
There still exists the problem of dealing with very large lexicons.  Currently
lexicons of up to 15,000 entries seem not too restrictive but it would be 
useful to 
deal with at least ten times that amount to allow lexicons generated from 
machine readable tapes to be used, like that in  (Boguraev et al. 1987).
Some work would need
to be done to modify the system to cope with significantly larger lexicons.
.LP
One investigation done was modifying the reader (in file \f3readatom\f1) so
that it returns strings when the user file contains them rather than symbols
as it does at present.  This meant that the LISP system did not get filled with
extra symbols on the oblist if the citation forms in a lexicon were specified
as strings.  This however introduced the problem that \f3"a"\f1 and \f3a\f1
were no longer equal to each other.  This was acceptable when dealing
with the citation forms as the tokenize (\f3D-Tokenize\f1) function 
ensured the correct mapping
but features etc would not map properly.  It was felt that this introduced
too many side-effects that would be difficult to explain in the user manual 
and hence was removed.  Note all symbols read by the reader are interned on the 
oblist even if they are specified as strings in the user file.
.LP
The problems would be in the building of the lexicon index tree - this can grow
very large.  One way to solve this might be at looking at ways of \*Qpaging\*U
parts of the lexicon tree in and out of a file.  The other problem would be in
symbol space.  Franz LISP as it is normally implemented cannot cope with
vast numbers of symbols so a way to hold the citation forms of entries in
a structure other than symbols is required.
.NH 2
Analysis
.LP
In the version used in the GPSG tools project (Briscoe et al. 1986) an
extra feature has been
added which speeds up the system significantly.  A cache has been introduced
so that words which have been looked up before are stored.  This of course
only works if relatively few words are to be looked up (a few hundred) after
that the indexing of the cache becomes more expensive than the re-analysis
of the word and closed class words can be analysed very quickly.
.ds RH References
.bp
.SH
References
.sp 1
.XP
Aho, A. and Ullman, J. (1972) \f2The Theory of Parsing, Translation and
Compiling, Volume 1: Parsing\f1 Englewoods Cliffs N.J.: Prentice-Hall.
.XP
Bear, J. (1985) \*QA Morphological Recogniser with Syntactic and Phonological
Rules.\*U  Unpublished paper. SRI International, Menlo Park, CA., USA.
.XP
Black, A.W., G.D. Ritchie, S.G. Pulman, and G.J. Russell (1987)
\*QFormalisms for Morphographemic Description\*U In: \f2Proceedings of
3rd Conference of the European Chapter of the Association for Computational
Linguistics\f1 Copenhagen, Denmark.
.XP
Boguraev, B. Briscoe, T. Carroll, J. Carter, D. and Grover, C. (1987)
\*QThe Derivation of a Grammatically Indexed Lexicon from the 
Longman Dictionary of Contemporary English\*U In \f2Proceedings of
the 25th Meeting of the Association of Computational Linguistics\f1. 
Stanford CA, USA.
.XP
Briscoe, E.J., I. Craig, and C. Grover. (1986)
\*QThe Use of the LOB Corpus in the Development of a Phrase Structure Grammar
of English.\*U
In: \f2 Proceedings of 6th ICAME\f1, Amsterdam.
(To be published eds. Meijs, W., and van der Steen, G.J.).
.XP
Gazdar,G., E. Klein, G.K. Pullum, and I.A. Sag, (1985) \f2Generalised Phrase
Structure Grammar.\f1 Oxford: Blackwell.
.XP
Hopcroft J.E. and Ullman J.D. (1979) 
\f2Introduction to Automata Theory, Languages and Computation\f1
Reading, Mass: Addison-Wesley.
.XP
Kaplan R. and Bresnan J. (1982) \*QLexical-Functional Grammar: A Formal
System for Grammatical Representation\*U In \f2Mental Representation
of Grammatical Relations\f1 Bresnan J. Cambridge Mass: MIT Press.
.XP
Karttunen, L. (1983) \*QKIMMO: A two level morphological analyser\*U
\f2Texas Linguistics Forum\f1 22, Department of Linguistics,
University of Texas, Austin, Texas
.XP
Karttunen, L. Koskenniemi, K. and Kaplan, R. \*QA Compiler for Two-Level
Phonological Rules\*U. Unpublished paper. Xerox Palo Alto Research Center 
and Center for the Study of Language and Information, Stanford CA. June
1987.
.XP
Kay M. (1985) \*QParsing in Functional Unification Grammar\*U In:
\f2Natural Language Parsing\f1
Dowty, D.; Karttunen, L. and Zwicky, A.
London: Cambridge University Press.
.XP
Koskenniemi, K. (1983) \f2Two-level Morphology: a general computational
model for word-form recognition and production.\f1
Publication No.11, University of Helsinki, Finland.
.XP
Koskenniemi, K. (1985) \*QCompilation of Automata from Two-Level Rules.\*U talk
given at Workshop on Finite-State Morphology, CSLI, Stanford, CA July 1985.
.XP
Ritchie, G; Black, A; Pulman, S; and Russell G. 1987 \*QThe
Edinburgh/Cambridge Morphological Analyser and Dictionary System:
User Manual. Version 3.0\*U Software Paper no. 10, Department of Artificial
Intelligence, University of Edinburgh.
.XP
Thompson, H. (1987) \*QFBF - A Grammatical micro-formalism:  Syntax,
Semantics and Use\*U forthcoming.
.XP
Thompson, H. and Ritchie, G.(1984) \*QImplementing Natural Language Parsers\*U.
In \f2Artificial Intelligence : Tools, Techniques and Applications\f1, ed.
O'Shea and Eisenstadt. New York : Harper and Row.
.XP
Thorne, J.P., Bratley,P. and Dewar,H.(1968) \*QThe Syntactic Analysis of
English by Machine\*U. In \f2Machine Intelligence 3\f1, ed. Michie. 
Edinburgh : Edinburgh University Press.
.XP
Winograd, T. (1983) \f2Language as a Cognitive Process\f1. Reading Mass:
Addison-Wesley.
