Candidate Generation Phase

Next: Adding to the Final Up: Our Solution: WOLFIE Previous: Our Solution: WOLFIE

Candidate Generation Phase

Initial candidate meanings for a phrase are produced by computing the maximally common substructure(s) between sampled pairs of representations of sentences that contain it. We derive common substructure by computing the Largest Isomorphic Connected Subgraphs (LICS) of two labeled trees, taking labels into account in the isomorphism. The analogous Largest Common Subgraph problem [Garey Johnson1979] is solvable in polynomial time if, as we assume, both inputs are trees and if K, the number of edges to include, is given. Thus, we start with K set equal to the largest number of edges in the two trees being compared, test for common subgraph(s), and iterate down to K=1, stopping when one or more subgraphs are found for a given K. For the Prolog query representation, the algorithm is complicated a bit by variables. Therefore, we use LICS with an addition similar to computing the Least General Generalization of first-order clauses [Plotkin1970]. The LGG of two sets of literals is the least general set of literals that subsumes both sets of literals. We add to this by allowing that when a term in the argument of a literal is a conjunction, the algorithm tries all orderings in its matching of the terms in the conjunction. Overall, our algorithm for finding the LICS between two trees in the Prolog representation first finds the common labeled edges and vertices as usual in LICS, but treats all variables as equivalent. Then, it computes the Least General Generalization, with conjunction taken into account, of the resulting trees as converted back into literals. For example, given the two trees:

answer(C, (largest(P, (state(S), population(S,P))), capital(S,C))).

answer(P, (high_point(S,P), largest(A, (state(S), area(S,A))))).,

the common meaning is answer(_,largest(_,state(_)). Note that the LICS of two trees may not be unique: there may be multiple common subtrees that both contain the same number of edges; in this case LICS returns multiple answers.

**Table 1:** Sample Candidate Lexical Entries and their Derivation
$\begin{table}\begin{small} \begin{tabbing} nnnnnnnnnnnn\=hi\=gnnnhest point: \... ...{\tt (state(S), loc(\_,S))} \> 3 \\ \end{tabbing} \end{small} \end{table}$

The sets of initial candidate meanings for some of the phrases in the sample corpus are shown in Table 1. While in this example we show the LICS for all pairs that a phrase appears in, in the actual algorithm we randomly sample a subset for efficiency reasons, as in GOLEM [Muggleton Feng1990]. For phrases appearing in only one sentence (e.g., ``encuentra''), the entire sentence representation (excluding the database constant given as background knowledge) is used as an initial candidate meaning. Such candidates are typically generalized in step 2.2 of the algorithm to only the correct portion of the representation before they are added to the lexicon; we will see an example of this below.

Next: Adding to the Final Up: Our Solution: WOLFIE Previous: Our Solution: WOLFIE

Cindi Thompson
2003-01-02