The algorithm

Next: Speeding it up Up: Parsing English with a Previous: Relative Clauses

The algorithm

Our algorithm for parsing link grammars is based on dynamic programming. Perhaps its closest relative in the standard literature is the dynamic programming algorithm for finding an optimal triangulation of a convex polygon [][p. 320]CLR. It tries to build up a linkage (which we'll call a solution in this section) in a top down fashion: It will never add a link (to a partial solution) that is above a link already in the partial solution.

The algorithm is most easily explained by specifying a data structure for representing disjuncts. A disjunct has pointers to two linked lists of connectors. These pointers are denoted and . If is a connector, then will denote the next connector after in its list. The next field of the last pointer of a list has the value .

For example, suppose the disjunct =13((D,O) ()) (using the notation of section 2). Then would point to the connector O, would point to the connector D, and would be . Similarly, .

To give some intuition of how the algorithm works, consider the situation after a link has been proposed between a connector on word and a connector on word . (The words of the sequence to be parsed are numbered from to .) For convenience, we define and to be and respectively. The situation is shown in the following diagram:

Here the square boxes above the words and represent a data structure node corresponding to the word. The rectangular box above each of these represents one of the (possibly many) disjuncts for the word. The small squares pointed to by the disjuncts represent connectors.

How do we go about extending the partial solution into the region strictly between and ? (This region will be denoted .) First of all, if there are no words in this region (i.e. ) then the partial solution we've built is certainly invalid if either or . If then this region is ok, and we may proceed to construct the rest of the solution.

Now suppose that the region between and contains at least one word. In order to attach the words of this region to the rest of the sentence there must be at least one link either from to some word in this region, or from to some word in this region (since no word in this region can link to a word outside of the range, and something must connect these words to the rest of the sentence).

Since the connector has already been used in the solution being constructed, this solution must use the rest of the connectors of the disjunct in which resides. The same holds for . The only connectors of these disjuncts that can be involved in the region are those in the lists beginning with and . (The use of any other connector on these disjuncts in this region would violate the ordering requirement.) In fact, all of the connectors of these lists must be used in this region in order to have a satisfactory solution.

Suppose, for the moment, that is not . We know that this connector must link to some disjunct on some word in the region . (It can't link to because of the exclusion rule.) The algorithm tries all possible such words and disjuncts. Suppose it finds a word and a disjunct on such that the connector matches . We can now add this link to our partial solution.

The situation is shown in the following diagram.

./alg-2small.ps

How do we determine if this partial solution can be extended to a full solution? We do this by solving two problems similar to the problem we started with. In particular, we ask if the solution can be extended to the word range using the connector lists beginning with and . We also ask if the solution can be extended to the word range using the connector lists beginning with and . Notice that in the latter case, the problem we are solving seems superficially different: the boundary words have not already been connected together by a link. This difference is actually of no consequence because the pair of links ( to and to ) play the role that a direct link from to would play: (1) they separate the region from all the other words, and (2) they serve to connect the words and together.

We need to consider one other possibility. That is that there might be a solution with a link between words and and a link between words and . (This results in a solution where the word/link graph is cyclic.) The algorithm handles this possibility by also attempting to form a link between and . If these two match, it does a third recursive call, solving a third problem analogous to our original problem. In this problem the word range is and the connector lists to be satisfied begin with and .

A very similar analysis suffices to handle the case when is .

The algorithm described has an exponential worst-case running time as a function of , the number of words in the sequence to be parsed. This can easily be transformed into an efficient dynamic programming algorithm by using memoization ([p. 312]CLR).

The running time is now bounded by the number of different possible recursive calls multiplied by the time used by each call. A recursive call is completely determined by specifying the pointers and . (These uniquely determine and .) The cost of a given call is bounded by the total number of disjuncts in the sequence of words.

If we let be the number of disjuncts and be the number of connectors, then the running time is . For a fixed link grammar, and , so the running time is .

Our technical reports describe this algorithm in more detail. They contain pseudo-code for the algorithm [14], an argument for it's correctness [14] and an elegant recurrence for the number of linkages of a sentence [11].

After the algorithm above was implemented, we were interested in seeing how well it would work on sentences taken from newspapers and other natural sources. It quickly became clear that something else was needed to make the algorithm run faster on long sentences.

Next: Speeding it up Up: Parsing English with a Previous: Relative Clauses

Thu Oct 12 13:01:13 EDT 1995