Pruning

Next: The fast-match data Up: Speeding it up Previous: Speeding it up

Pruning

Our first approach is based on the following observation: In any particular sequence of words to be parsed, most of the disjuncts are irrelevant for the simple reason that they contain a connector that does not match any other connector on a word in the sequence. To be more precise, suppose that a word has a disjunct with a connector C in its right list. If no word to the right of has a connector (pointing to the left) that matches C, then the disjunct cannot be in any linkage. This disjunct can therefore be deleted without changing the set of linkages. Deleting such a disjunct is called a pruning step. pruning consists of repeating the pruning step until it can no longer be applied.

The set of disjuncts left (after pruning is complete) is independent of the order in which the steps are applied. (The pruning operation has the Church-Rosser property.) We therefore choose an ordering that can be efficiently implemented. It would be ideal if we could achieve a running time for pruning that is linear in the number of connectors. The scheme we propose satisfies no useful a-priori bound on its running time, but in practice it appears to run in linear time.

A series of sequential passes through the words is made, alternately left-to-right and right-to-left. The two types of passes are analogous, so it suffices to describe the left-to-right pass. The pass processes the words sequentially, starting with word . Consider the situation after words have been processed. A set of connectors has been computed. This is the set of connectors that exists on the right lists of the disjuncts of words that have not been deleted. To process word , we consider each disjunct of in turn. For each connector on the left list of , we search the set to see if it contains a connector that matches . If one of the connectors of matches nothing in , then we apply the pruning step to (we remove ). Each right connector of each remaining disjunct of is now incorporated into the set . This completes the processing of word .

The function computed by this left-to-right pass is idempotent, which is another way of saying that doing the operation twice in a row will be the same as doing it once. Therefore if (as we alternate left-to-right and right-to-left passes) a pass (after the first one) does nothing, then all further passes will do nothing. This is how the algorithm decides when to stop.

The data structure used for the set is simply a hash table, where the hash function only uses the initial upper-case letters of the connector name. This ensures that if two connectors get hashed to different locations, then they definitely don't match.

Although we know of no non-trivial bound on the number of passes, we have never seen a case requiring more than five. Table shows a typical example of the reduction in the number of disjuncts achieved by pruning.

Table: This table shows the number of disjuncts remaining on each word of the sentence Now this vision is secular, but deteriorating economies will favor Islamic radicalism. (The first number is for the wall which has not been described in this paper. Of course the comma also counts as a word.) The fourth pass of pruning has no effect, so pruning stops. The last row in the table shows the number of disjuncts that remain after power pruning.

Next: The fast-match data Up: Speeding it up Previous: Speeding it up

Thu Oct 12 13:01:13 EDT 1995