





A Prolog-Based Tool For Grammar Analysis Of Western European Languages


                 J. Barchan and J. Wusteman


                 Dept. of Computer Science,
                   University of Exeter,


                          Abstract

     In this article, we give an overview of  the  Language-
INdependent  Grammatical  Error  Reporter (LINGER). The four
key techniques in LINGER are identified, the  system  design
and  behavior  described, and directions for future research
surveyed.








































                     February 11, 1988





                           - 2 -


A Prolog-Based Tool For Grammar Analysis Of Western European Languages


                 J. Barchan and J. Wusteman


                 Dept. of Computer Science,
                   University of Exeter.


                          CONTENTS
1. Previous  work  in  Computer-Assisted  Language  Learning
(CALL) at Exeter University

2. Overview of LINGER

   2.1 Techniques
      2.1.1 Modularity
      2.1.2 Generation of correct sentences
          2.1.2.1 Handling strong syntactic errors
          2.1.2.2 Handling weak syntactic errors
      2.1.3 Unknown words and incorrect endings
      2.1.4 Error messages
   2.2 System Design
      2.2.1 The dictionary
      2.2.2 The grammar
      2.2.3 The shell
   2.3 System Operation

3. Current and Future Developments

References
Acknowledgments
























                     February 11, 1988





                           - 3 -


1. Previous  work  in  Computer-Assisted  Language  Learning
(CALL) at Exeter University

     The impetus for the first system at  Exeter  came  from
the  work  by  Imlah  and du Boulay (1985) who developed the
French Robust Grammar Checker ("FROG").

     FROG was designed to handle declarative sentences typed
in  by  a student in "free-form" French. This was a signifi-
cant departure from previous programs which tended  to  con-
centrate  on  exercises  involving  insertion, correction or
substitution of words, or set translations. It also differed
from  the  majority of CALL programs in that it did not need
to pre-store a large number of correct or incorrect answers.
Instead,  the program contained a parser which could analyse
the sentences typed in. Hence, FROG was one of a  new  breed
of  CALL  programs which actually "knew" something about the
language they were teaching.

     The system developed at Exeter to  supersede  FROG  was
the French Grammar Analyser ("FGA") (Woodmansee, 1985). Many
of the best ideas in FROG were incorporated in FGA  but  the
resulting  system was a significant improvement on FROG. The
four design features which typified the goals and  strengths
of FGA are:

1. It was designed in a highly modular fashion, thus  easing
the task of future development.

2. Its parsing mechanism was more robust; in  particular,  a
"pre-parse"  section  was  employed to do pre-parsing on the
lexeme level.

3. A small subset grammar  and  dictionary  of  French  were
used.  (FROG's subset tended to be some-what over-ambitious,
there-by losing generality, clarity and the modularity  con-
sidered so essential in (1)).

4. It aimed to generate  constructive error messages for the
purpose of teaching.


     Various other projects were undertaken at  the  depart-
ment  concerning  error reporting systems for the German and
Italian languages.It was recognized that developing separate
systems  for these languages, all of which were related more
or less  closely  with  Latin,  involved  a  duplication  of
effort.  As a result of this observation, research was begun
on  the  Language  INdependent  Grammatical  Error  Reporter
("LINGER", Barchan 1987).







                     February 11, 1988





                           - 4 -


2.Overview of LINGER

     LINGER is  a  language-independent  system  to  analyse
natural  language sentences and report and correct grammati-
cal errors encountered. An important objective is  that  the
system  should be easily configured for a particular natural
language by an expert in that language but not  in  computer
science.

     2.1 Techniques

     Four key techniques may be  identified  in  the  LINGER
system. These are
1. Modularity
2. Generation of correct sentence
3. Handling of unknown words and incorrect sentences
4. Grammar writers' control over the issuing and content  of
error messages


2.1.1 Modularity

     The distinction  of  dictionary,  grammar  and  parsing
mechanism  is  a  vital  feature  of  a language independent
system.Accordingly, two flexible formalisms have had  to  be
devised:one  which  allows  the  description of the features
pertaining to grammatical classes and one which permits  the
use of these attributes in a variety of ways when specifying
a language's weak syntactic constraints.  The  invention  of
these two formalisms and their integration is at the core of
the LINGER system.

     Further more, a clear distinction has  been  maintained
between weak and strong syntactic constraints. There are two
reasons for this division: firstly, specification of  strong
syntax  is accommodated conveniently by the DCG formalism so
has no need of the special notation devised for weak syntax;
secondly, since weak syntactic errors do not impair the for-
mation of a parse tree and can  be  more  effectively  dealt
with  after  a tree has been completed, they are better han-
dled separately from strong syntactic errors.


2.1.2 Generation of correct sentences

     One of the main features of LINGER is that it generates
a  complete  version  of the user's sentence (to the best of
its abilities) as well as issuing comments about any  errors
found.


2.1.2.1 Handling strong syntactic errors

     Having decided that it is not acceptable to abandon the



                     February 11, 1988





                           - 5 -


parse  on  the first occurrence of a strong syntactic error,
some solution has to be proposed as to how to continue  when
one  is  encountered.The  approach  adopted has the distinct
advantage of allowing a complete  parse  tree  (rather  than
unconnected  partial parses) to be generated in spite of the
error as well as reporting the error. It involves  a  simple
but  effective priority-based recovery algorithm which makes
a 'best-guess' as to the most appropriate sufficient correc-
tion.  In  accordance with Imlah and du Boulay's observation
that there is a point beyond which a set  of  words  becomes
too  garbled  to  make a reasonable guess as to the intended
structure and that students are unlikely to  produce  utter-
ances  which are garbled to such a degree, it was felt to be
reasonable to limit the number of strong syntactic errors in
a single sentence to one.


2.1.2.2 Handling weak syntactic errors

     Having formed a parse tree which holds the basic struc-
ture  of  the  sentence  and  the  roots of the words in the
input, the weak syntactic constraints are  applied  to  gen-
erate  correct forms from the roots and thus rebuild a gram-
matically correct sentence. The system can  cope  with  many
weak  syntactic  errors  and  ensures  that the final result
takes into account the  interrelationship  between  multiple
corrections.


2.1.3 Unknown words and incorrect endings

     An unknown word (that is, one  for  which  no  possible
root  can  be  found  in  the dictionary) will be guessed to
belong potentially to any one of a number of 'open' grammat-
ical  classes  (cf.  Tennant  1981,  Harris  1985).  ('Open'
classes include, for example, nouns and  verbs  but  exclude
prepositions.)  A  word which does contain a recognized root
but whose suffix does not correspond to any legitimate  suf-
fix  for  that root in the dictionary is not treated as unk-
nown but rather as having an incorrect ending.The incorpora-
tion  of  a misspelling algorithm was rejected: because unk-
nown words are expected and handled, there would be a strong
danger of treating a perfectly correct but unknown word as a
misspelling of one or more words which do exist in the  dic-
tionary  (cf.  (Weischedel  et  al,  1978) who incorporate a
misspelling algorithm but do not address the problem of unk-
nown words).




2.1.4 Error Messages

     In addition to the  corrected  form  of  the  sentence,



                     February 11, 1988





                           - 6 -


output  to the user includes comments of two types: the sys-
tem itself will point out the nature of any  corrections  it
has  carried out; the grammar writer may include comments to
be issued if certain expected  mistakes  are  found  in  the
input.  The  key  point here is that the inclusion of such a
bug catalogue, and the  extent  of  its  coverage,  is  left
entirely  to  the  discretion  of  the  grammar writer: this
avoids the criticism levelled against systems which  require
the  anticipation of errors, while still permitting a degree
of tutorial control.




2.2 System Design

     The configuration of the system consists of three  main
modules:  the  language  specific  dictionary,  the language
specific grammar (strong and weak syntax) and  the  language
independent shell, as shown in Figure 1.


2.2.1 The Dictionary

     The dictionary is one of  the  two  language  dependent
data  files  required  by  the language independent shell to
function for a particular language. It serves two  purposes:
firstly,  and  most obviously from its name, it contains all
the words in the specific language which are  known  to  the
system;  secondly,  it  holds  information  relevant to each
word, including what modifications can be made to that  word
together  with  their significance. This information is usu-
ally, but not necessarily, grammatical in nature, but  there
is a distinction between this information and that contained
in the language's grammar file. In the latter, the  informa-
tion  is  concerned  with  what legal grammatical structures
(e.g. sentences, noun phrases, verb  phrases  etc.)  may  be
formed  in the language and how they are put together, while
in the former it is concerned with what individual words are
permissible  in  the language, how they may be modified, and
what significance each such modification entails. Hence, the
distinction   is   that   the  grammar  file  specifies  the
language's non-terminals and provides  an entry-point to the
terminals,  while the dictionary file deals with the precise
form which the terminals may take.


2.2.2 The Grammar

     The grammar file is the  second  of  the  two  language
dependent  files  required by the language independent shell
to function for a  particular  language  and  includes  both
strong and weak syntax. It serves two functions: firstly, to
permit  the  grammar  writer  to  specify  what  grammatical



                     February 11, 1988





                           - 7 -


constructs  exist  in  a  given language and how they may be
combined to form legal sentences, noun phrases, verb phrases
etc. in the language (strong syntax); and secondly, to allow
him to  indicate  what  rules  must  be  obeyed  to  produce
correctly  formed sentences within the general framework  of
the constructs  permitted  (weak  syntax)  e.g.  appropriate
numbers, genders, cases, auxiliary verbs etc.

     If a grammar writer wishes to anticipate certain common
errors  he  may include messages to be presented to the end-
user if the input exhibits the appropriate features. Such  a
bug catalogue is represented in Figure 1 as a sub-section of
the weak syntax since the format for these comments is simi-
lar to that for weak syntactic checks.


2.2.3 The Shell

     The shell is  the  language  independent  core  of  the
LINGER  system.  All  language dependent information must be
located only in the dictionary and  grammar  files  and  so,
without  them, the shell cannot function. The shell contains
routines for such actions as  accepting  the  user's  input,
attempting  to  parse  the  input,  reforming  the  sentence
correctly, comparing the new version(s) with the input  ver-
sion, and producing the final output to the user. It expects
certain data to be present in the language  dependent  files
if  it  is  to function properly, but is said to be language
independent because none of the expected  data  is  language
specific. It controls the flow of the program and is respon-
sible for  ordering  and  guiding  decisions  at  points  of
choice.

     Figure  2  gives  an  overall  representation  of   the
behavior of the shell.




2.3 System Operation

     LINGER's first action  is  to  determine  the  language
desired  by  the  user. Once the appropriate files have been
loaded LINGER is said to be configured for that language for
the rest of that session of interaction with the user.LINGER
then repeatedly accepts input from the  user  and  processes
it,  giving  fresh  output and returning the database to its
initial state (i.e. as it was at  the  commencement  of  the
first  processing  of  input), until the user terminates the
session. Input, processing and output are performed in  real
time.

     Figure 3 gives a sample interaction with LINGER,  using
the French grammar and dictionary.



                     February 11, 1988





                           - 8 -


     Processing a  sentence  comprises  three  phases:  pre-
parsing, parsing and choosing the 'correct' sentence.

     Pre-parsing   involves   attempting    pattern-matching
between  words  found  in the input and those present in the
dictionary. Relevant information is extracted from the  dic-
tionary  for  subsequent  use by the system so that the dic-
tionary need not be consulted again.

     The parsing phase consists of two parts. The  first  is
to try to find a legal parse for the input: either the input
is structurally well-formed, in which case this succeeds  in
a  straightforward manner, or else there is a strong syntac-
tic error and a recovery mechanism  must  be  invoked.  Once
this  succeeds,  the result is essentially a parse tree con-
taining the non-terminals specified in the grammar  and  the
roots  of  the words encountered in the input. In the second
phase, relevant 'checks' specified by the grammar writer are
sought  and  applied and ,as a result, a new parse tree with
words based on the roots in the input is  rebuilt  with  the
modifications  effected  such  that  it  complies  with  the
checks. When this is  complete,  a  new,  alternative  legal
parse  is  sought  and,  if one exists, the whole process is
repeated until no more exist.

     Choosing the correct sentence involves looking  through
all  the  possible  parses  delivered by the parse phase and
choosing  which  one  of  them  is  to  be  considered  most
appropriate.  Final  selection between alternatives within a
parse still remaining after the application  of  the  checks
must  also  be  carried  out  at  this  stage. Once a single
corrected version has been chosen, it must be shown  to  the
user together with any comments pertaining to it.




3. Current and Future Developments

     Although LINGER is a  workable  system  as  it  stands,
there  are  several  areas in which development is necessary
before it can be regarded as a viable prototype for  practi-
cal use. Research is under way in the following areas:

(i) It is intended to  completely  rewrite  large  parts  of
LINGER's code, so as to improve efficiency, while  retaining
the basic overall structure of the system.

(ii)  Tutoring  strategies,  virtually  nonexistent  in  the
current system, are to be introduced.

(iii)Research is being carried out in an attempt to  replace
the  pre-stored  natural  language  responses to errors with
more meaningful and relevant tutorial explanations.



                     February 11, 1988





                           - 9 -


(iv)  The  dictionaries  and  grammars  currently  available
(French,  German,  Spanish,  Italian)  are to be extended to
cover specific topics  and  situations.  This  will  improve
LINGER's eventual usefulness in a learning environment.

(v) An extensive English grammar and dictionary  are  to  be
introduced, as is a bug catalogue based on common misconcep-
tions held by learners  of  English  as  a  second  language
(ESL).  As a offshoot to the research on LINGER, it is hoped
that an intelligent grammar-checking word processor  package
to aid ESL learners will be produced.





References


Barchan, J. (1987) "Language INdependent  Grammatical  Error
Reporter", M.Phil. Thesis, Dept of Computer Science, Univer-
sity of Exeter.

Harris, M. (1985) "Introduction to Natural Language Process-
ing". Reston: Reston Publishing Company Inc.

Imlah, W. and du Boulay, J. (1985) "Robust Natural  Language
Parsing  in Computer-Assisted Language Instruction." System,
Vol.13, No.2, pp 137-147.

Tennant,   H.   (1980)   "Natural   Language    Processing".
Princeton,N.J.: Petrocelli.

Weischedel, R., Voge,W. and James, M. (1978) "An  Artificial
intelligence  Approach  to Language Instruction". Artificial
Intelligence, Vol. 10 No. 3 pp 225-240.

Woodmansee B.J. (1985)  "French  Natural  Language  Parser",
Working  Paper  No.139, Dept of Computer Science, University
of Exeter.




Acknowledgments

     The authors wish to acknowledge the continuing  support
of their colleagues :
Keith Cameron, Paul O'Brien, Derek Partridge, Josephine Uren
and Masoud Yazdani.
They are also grateful for financial support from the  SERC,
ESRC and the Manpower Services Commission.





                     February 11, 1988





                           - 10 -



























































                     February 11, 1988


