Our first approach is based on the following observation: In any
particular sequence of words to be parsed, most of the disjuncts are
irrelevant for the simple reason that they contain a connector that
does not match any other connector on a word in the sequence. To be
more precise, suppose that a word
has a disjunct
with a
connector C in its right list. If no word to the right of
has a connector (pointing to the left) that matches C, then the
disjunct
cannot be in any linkage. This disjunct can therefore be
deleted without changing the set of linkages. Deleting such a
disjunct is called a pruning step. pruning consists
of repeating the pruning step until it can no longer be applied.
The set of disjuncts left (after pruning is complete) is independent of the order in which the steps are applied. (The pruning operation has the Church-Rosser property.) We therefore choose an ordering that can be efficiently implemented. It would be ideal if we could achieve a running time for pruning that is linear in the number of connectors. The scheme we propose satisfies no useful a-priori bound on its running time, but in practice it appears to run in linear time.
A series of sequential passes through the words is made, alternately
left-to-right and right-to-left. The two types of passes are
analogous, so it suffices to describe the left-to-right pass. The
pass processes the words sequentially, starting with word
.
Consider the situation after words
have been
processed. A set
of connectors has been computed. This is the
set of connectors that exists on the right lists of the disjuncts of
words
that have not been deleted. To process word
,
we consider each disjunct
of
in turn. For each connector
on the left list of
, we search the set
to see if it contains a
connector that matches
. If one of the connectors of
matches
nothing in
, then we apply the pruning step to
(we remove
).
Each right connector of each remaining disjunct of
is now
incorporated into the set
. This completes the processing of word
.
The function computed by this left-to-right pass is idempotent, which is another way of saying that doing the operation twice in a row will be the same as doing it once. Therefore if (as we alternate left-to-right and right-to-left passes) a pass (after the first one) does nothing, then all further passes will do nothing. This is how the algorithm decides when to stop.
The data structure used for the set
is simply a hash table, where
the hash function only uses the initial upper-case letters of the
connector name. This ensures that if two connectors get hashed to
different locations, then they definitely don't match.
Although we know of no non-trivial bound on the number of passes, we
have never seen a case requiring more than five.
Table
shows a typical example of the reduction in the
number of disjuncts achieved by pruning.
Table:
This table shows the number of disjuncts remaining on each word of the
sentence Now this vision is secular, but deteriorating economies
will favor Islamic radicalism. (The first number is for the
wall which has not been described in this paper. Of course the
comma also counts as a word.) The fourth pass of pruning has no
effect, so pruning stops. The last row in the table shows the number
of disjuncts that remain after power pruning.