
 plhip - a positive version of left-head corner island parser (compiler)
			       Version 1.0

			  (C) 1993, Afzal Ballim


				 CONTENTS

1.....................Introduction
2.....................Grammar Rules
3.....................Compiling Grammar Rules
4.....................Parser Input
5.....................Parsing
6.....................Threshold Coverage


1. Introduction
===============

This document describes the plhip compiler which turns an annotated
"CFG"-like grammar into an island parser in Prolog.  The program has been
tested under Sicstus 0.6 but should run under any compatible Prolog.  The
system has the following features:

    o negation is NOT permitted --- this is the major difference between
    plhip and lhip. As a result, plhip is more efficient but slightly less
    powerful;
        
    o rules can have multiple "heads", which are searched
      for before non-heads (left to right order);

    o all other things being equal, clauses in a rule are
      evaluated left-to-right (heads first) and depth-first;

    o rules can have optional parts to them;

    o disjunction is permitted in rules;

    o prolog code can be embedded in rules;

    o rules do not have to cover all of the input in order
      for them to succeed. More specifically, the constraints
      generated by islands are such that islands do not have
      to be adjacent, but may be separated by non-covered input.
      However, it is trivial to generate only complete-coverage
      parses if necessary;

    o there are two notions of non-coverage of the input,
      sanctioned and unsanctioned non-coverage. Unsanctioned
      non-coverage is that described in the previous paragraph.

    o sanctioned non-coverage means that specific "ignore" rules have 
      been applied to ignor parts of the input so that islands are 
      adjacent.  These ignore rules can be called individually 
      (specific ignore) or as a group (general ignore);

    o the bodies of ignore rules are the same as those of 
      normal rules, they are just a notational convention to
      mark certain rules as special;

    o grammar rules compile into a prolog program;

    o real adjacency between rhs clauses can be specified
      (i.e., it is possible to specify that two rhs objects
      must have no intervening terminals);

    o it is possible to define global and local thresholds 
      for the proportion of the spanned input that must be 
      covered by rules.

2. Grammar Rules
================

A normal plhip grammar rule has the form

                  Head ~~> Body.

where Head is a valid Prolog term (e.g.  predicate(arg0,..,argn)).  Valid
"Body" definitions are given below.  This parser is an island parser using
the weak constraint of RELATIVE POSITION rather than ADJACENCY as a
default..  In other words, islands constrain each other with respect to
relative position, but do not have to be adjacent.  Real adjacency is
specified by adjoiing RHS clauses by ``:''.  There is, therefore, the issue
of how much of the input is consumed by a grammar rule, and how much of it
is covered.

In the description here, a grammar rule is said to produce an island which
SPANS input terminals i to i+n if the island starts at terminal i, and the
terminal i+n is the terminal immediately to the right of the last terminal
of the island. A rule is said to COVER m items if m terminals are consumed
in the span of the rule. Thus m <= n. If m=n then the rule has completely
covered the span.

This immediately leads to the notion of input items left non-covered by a
grammar.  A distinction is made here between two categories of non-coverage:

    o Unsanctioned non-coverage.

    o Sanctioned non-coverage;

The first of these categories, unsanctioned non-coverage, refers to those
items of the input that are not accounted for by the grammar.  The second of
these categories, sanctioned non-coverage, refers to items of the input that
are ignored by way of special ``ignore'' grammar rules.  Such a rule has the
form:

                 -Head ~~> Body.

The purpose of such rules is to sanction certain ``repairs'' to the input
that may be necessary due, for example, to the input being a transcription
of dialogue.  Ignore rules, contrary to their name, do not actually delete
input, they merely mark it as being covered by the grammar.  They are thus
more of a notational convenience than anything else.


2.1. The Body of a Rule
-----------------------

We will first of all give a syntactic description of the body of a rule
using a pseudo-CFG format.  In addition to these rules, a valid body must
contain at lease one clause of one of the following atomic forms:

    pterm
    * pterm
    @ terminal

The rules are:

plhip_body --> plhip_clause
plhip_body --> plhip_clause,plhip_body
plhip_body --> plhip_body;plhip_body
plhip_body --> plhip_body : plhip_body
plhip_body --> (plhip_body;plhip_body)

plhip_clause --> pterm
plhip_clause --> * pterm
plhip_clause --> @ terminal
plhip_clause --> * @ terminal
plhip_clause --> - pterm
plhip_clause --> []
plhip_clause --> (? plhip_body ?)
plhip_clause --> !
plhip_clause --> {P}

The definitions of plhip_body, then, are similar to the definitions of the
body of a Prolog clause:  a plhip_body can contain 1 item, a conjunction of
items (using ``,'') or a disjunction (using ``;'').  In addition, a
plhip_body can contain two parts linked by ``:'' which indicates that the
constituents are adjacent (there are no untreated terminals between them).

We will go through each of the possibilities for a clause in turn.


2.1.1. plhip_clause --> pterm

This definition states that a prolog clause may be a simple Prolog term.
For example, the name of a rule,

                 np_rule

or the name of a rule with arguments,

              np_rule(Noun,Determiner,Mods).

We will refer to this as a normal rule invocation.


2.1.2. plhip_clause --> * pterm

This definition states that a clause can be a simple Prolog term preceded by
the symbol ``*'' and it is used to indicate that pterm is a HEAD rule
invocation.  A head rule invocation indicates that the current rule
critically depends on the success of the named sub-rule.  Those sub-rules
nominated as HEADS are evaluated before other rules in the parsing of input.
This is not necessarily the same as the notion of head used in HPSG, for
example.  In particular, there is no onus on head-rules to share any
structure with the parent rule.


2.1.3. plhip_clause --> @ terminal
       plhip_clause --> * @ terminal

These definitions are used to refer to terminals in the input.  The second
definition is used to declare a particular terminal as being a HEAD for
the rule. 


2.1.4. plhip_clause --> -pterm

A clause of this form is used to invoke a specific ignore rule. For
example, if an ignore rule ``hmm'' has been declared which accounts for
certain verbal pauses, then it can be invoked by

    -hmm.

An ignore rule will always succeed, but does not have to cover input.  In
other words if all else fails an ignore rule will succeed by covering
nothing.


2.1.5. plhip_clause --> []

This clause is used to invoke any ignore rule at the indicated point.
All the defined ignore rules will be tried in order.


2.1.7. plhip_clause --> (? plhip_body ?)

This definition shows how optional parts of a rule are indicated. 


2.1.8. plhip_clause --> !

The Prolog cut symbol ``!'' can be directly used in plhip rules and has the
same functionality as in Prolog. However, care must be taken in using it
since it ``freezes'' the current rule as being the only one to succeed
within its the search space.


2.1.9. plhip_clause --> {P}

Prolog code can be embedded in a rule by enclosing it in { }.  There is an
important point to note about this, as well as about the use of !.  When a
rule is compiled it becomes Prolog code.  However, the order of clauses in
the Prolog code may not be the same as that of the clauses in the rule.  In
particular, heads in a rule are moved to the front of the Prolog code.
Thus, embedded code and cuts will be evaluated AFTER heads which follow them
in the rule.


2.2. Example Rules
------------------

    s(conjunct(Conj,Sl,Sr))
    ~~>
        s(Sl),
          * conjunction(Conj),
        s(Sr).

At first sight this rule appears left recursive.  However, the sub-rule
``conjunction(Conj)'' is marked as a head and therefore is evaluated before
either of ``s(Sl)'' or ``s(Sr)''.  Presuming that the conjunction-rule does
not end up invoking (directly or indirectly) the s-rule, then the s-rule
is not left-recursive. On the other hand, if the conjunction-rule were not
marked as a head, then the s-rule would be left recursive and would not
terminate.

    s(if(P,Q))
    ~~>
        * @if,
          s(P),
        * @then,
          s(Q).

This rule shows how multiple heads can be used. The terminals ``if'' and
``then'' will be searched for in the input. If found, they will form
boundaries for searching for s(P) and s(Q).

    np(propernoun(N,Mods))
    ~~>
        (? adjectives(Mods) ?),
        * noun(N).

This rule illustrates optional forms, there can be optional adjectives
before the noun.

    noun(X)
    ~~>
        ( * @pussy, (? @cat ?);
          * @cat),
        {X=cat}.

This rule illustrates the use of disjunction and embedded Prolog code.  It
should be noted that within the scope of a disjunction , a head is local to
the disjunct.

    noun(missionary_camp)
    ~~>
        @missionary : @camp.

This rule illustrates a typical use of adjacency, to specify compound nouns.
Adjacency is not restricted such a use however, but may generally be used
anywhere.


3. Compiling Grammar Rules
==========================

The plhip system was written under Sicstus Prolog, v0.6, but it should work
under any Edinburgh syntax prolog which supports the term_expansion/2 user
defined predicate.  In addition it requires definitions for member/2 and
append/3.

Once the code for plhip has been loaded, a file containing grammar rules can
be consulted in the usual way.  The rules will automatically be converted
into Prolog code.  The code (if interpreted) can be listed for inspection.
A certain amount of syntax checking is performed by plhip, but don't expect
too much from it. 

The grammar is then ready for use in parsing.


4. Parser Input
===============

The input to the parser is a prolog list, where each element of the list
represents a terminal in the input. For example, the sentence:

             John saw the red rabbit.

could be written as:

               ['John',saw,the,red,rabbit].

** Note that since the system compiles to Prolog code, words that begin
with capital letters must be quoted to prevent them being interpreted as
variables.

The plhip system does not specify how the input should look.  This is to
allow for various forms of pre-processing that the user might wish to
perform, such as morphological analysis.  So, the above sentence might be
pre-processed and written as:

     [w(np1,'John'),w(vvd,see),w(at0,the),w(aj0,red),w(nn1,rabbit)].

There is only one constraint on the input, which is to allow for ambiguity
in the input (as might be caused by POS tagging). An ambiguous terminal is
marked by an embedded list. So, the sentence ``Right John'' might be
represented as:

  [[w(vvb,right),w(nn1,right),w(av0,right),w(aj0,right)],w(np1,'John')].

If it is desired to have a prolog list AS a terminal, then it must be
escaped by having it enclosed in a list. For example, if the morphological
analysis and part-of-speech tagging produced a result of the form:

                [word, head, tag]

then the above sentence could have the following representation:

[[[right,right,vvb],[right,right,nn1],[right,right,av0],[right,right,aj0]],
 ['John','John',np1]].


5. Parsing
==========

The current version of plhip provides a number of ways of applying a
grammar to some input. 

5.1 Interface Predicates
------------------------

plhip_phrase(+Cat,+Sent)
-----------------------

The input Sent is checked against grammar rule Cat.  The predicate succeeds
if Sent satisfies Cat.  For example, with an appropriate grammar the
invocation

    plhip_phrase(np(NP),[the,red,ball])

might succeed with NP=np(ball,det(the),[red]).


plhip_phrase_rp(+Cat,+Sent)
--------------------------

This is the same as plhip_phrase/2 except that the chart is not cleared out
before the grammar is applied to the input.  For information about the
chart, see 5.2.


plhip_cv_phrase(+Cat,+Sent)
--------------------------

Succeeds if Sent can be parsed by Cat and all the input is covered.


plhip_phrase(+Cat,+Sent,-SpanStart,-SpanEnd,-Cover)
--------------------------------------------------

Same conditions as for plhip_phrase/2, except that the SpanStart binds to
the beginning of the input spanned by Cat (indexed from 1), SpanEnd binds
to the point just beyond the end of the input spanned by Cat, and Cover is
bound to the number of items of the input covered by Cat in the span. For
example, the invocation

    plhip_phrase(np(NP),[the,eh,red,ball],SS,SE,C)

might succeed with
    NP=np(ball,det(the),[red])
    SS=1
    SE=5
    C =3

meaning that the input item ``eh'' is not covered by the grammar.

plhip_phrase_rp/5 is the same as this predicate except that it does not
clear the chart before applying the grammar rules.


plhip_mc_phrases(+Cat,+Sent,-Cover,-Phrases)
-------------------------------------------

The maximal coverage of Sent by Cat is Cover. Phrases is the set
of parses of Sent by Cat with coverage Cover.

plhip_minmax_phrases(+Cat,+Sent,-Cover,-Phrases)
-----------------------------------------------

Similar to plhip_mc_phrases/4 except Phrases is the set of parses of
maximal coverage, but which span the fewest amount of input items.


plhip_seq_phrase(+Cat,+Sent,-Seq)
--------------------------------

Succeeds if Seq is a sequence of one or more parses of Sent by Cat such
that they are non-overlapping and each consumes input that precedes that
consumed by the next.

plhip_maxT_phrases(+Cat,+Sent,-MaxT)
-----------------------------------

MaxT is the set of parses of Sent by Cat that have the maximum thresholds
(i.e. the proportion of covered to uncovered terminals in the span is
minimal). On backtracking it returns the next set with the highest
thresholds.

5.2. The Chart
--------------

To improve efficiency, plhip maintains a history of rule applications that
have previously succeeded or failed.  This history is referred to as the
chart, and it prevents redoing deep searches that have previously failed
or succeeded. The predicate plhip_phrase clears the chart before applying
the grammar rules to some input, however none of the other interface
predicates do so. The chart may be explicitly cleared by the predicate
plhip_clear_chart/0.

The chart can be examined, as it is maintained as Prolog clauses. The
following predicates are given to make it easier to examine the chart:

plhip_success    -- lists all successful rules;
plhip_success(N) -- lists all instances of rule N succeeding;
plhip_ms_success -- list the most specific successful rules (rules that have
                   succeeded, but whose results are not used elsewhere);
plhip_ms_success(N)
                -- list most specific successes of rule N.

Each rule is given an identifying number when it is compiled, and they are
listed for you at compile time. An example of using plhip_success/1 might
be as follows:

| ?- plhip_success(11).
(11) [1--2) /1 ~~> np(proper(john,_148))
(11) [2--3) /1 ~~> np(proper(mary,_89))

where rule 11 was of the form 

np(NP) ~~> propernoun(NP).

As we can see, a number of extra information is contained in the result.
The general form of a listing of a successful rule is:

(RN) [FS--FE) /C ~~> Rule

where RN = rule number
      FS = the start point of the found result
      FE = the end of the found result (exclusive)
      C  = the number of terminals covered between FS and FE

The maximum number of terminals that any rule can cover is the difference
between the end and the start of the successful span.  Thus, both successes
shown above had maximal coverage.

Rules that have failed can also be examined using analogues of the above:

plhip_failure/0
plhip_failure/1

an example being:

| ?- plhip_failure(24).
(24) 1-><-1 ~~> adjectives([A1000])
(24) 1-><-2 ~~> adjectives([A1000])

In the case of failure there is no need to give a cover, and the span
indicates the region over which the rule failed.  In addition, unbound
variables are represented by terms such as A1000, B1000, etc.

6. Threshold Coverage
=====================

The plhip system allows the user to define the minimum threshold of coverage
that must be obtained for a rule to succeed.  By this, we mean that the
percentage of a successful span that must be covered by a rule can be
configured by the user. 

By default in plhip the only restriction on a successful span is that at
least one terminal item in the span is covered.  In other words, islands are
only weakly constrained by relative position to each other, and moreover,
there can be uncovered items in an island.  This is illustrated in the
following diagram, where ``-'' indicates an uncovered item in an island,
``+'' a covered item in an island, and ``.'' an item between islands.

            island1            island2
               ^                 /\
              / \               /  \
             /   \             /    \
             +---+ . . . . . . +-+--+

A threshold is a minimum fraction of the span that must be covered in order
for a rule to succeed.  This can be set globally in plhip, and the global
value can be overridden within a particular rule.  The global value can be
set and interrogated by the predicate plhip_threshold/1.  If its argument is
unbound, then plhip_threshold will set the argument to the current global
threshold value (a number between zero and one).  If the argument is bound,
then plhip_threshold will set the global threshold to this value.

So, for example,

        plhip(X)

can be used to find the current global threshold value, while

        plhip(0.55)

sets the global threshold value to 0.55.

The global threshold value can be overridden within a rule by setting the
rules local threshold value explicitly. This is done by means of a special
syntax. For example, suppose that the grammar writer was satisfied that
the rules for NPs could cover any real NPs. The NP rules could than have
high thresholds set, as in the following:

np(np(N,Det,Mods)) # 0.9
~~>
    determiner(Det),
    (? adjectives(Mods) ?),
    * noun(N).

Here, the threshold value of 0.9 will be statically compiled into the
rule. It is also possible to evaluate a threshold dynamically at runtime,
however, no checking is done to see if the value has been set, and if
not set an error will occur. Here is the above rule with dynamic threshold
setting:

np(np(N,Det,Mods)) # Threshold
~~>
    determiner(Det),
    (? adjectives(Mods) ?),
    * noun(N),
    {set_my_dynamic_threshold(Threshold)}.

