Solving the Lexicon Acquisition Problem

Next: Our Solution: WOLFIE Up: The WOLFIE Algorithm and Previous: The WOLFIE Algorithm and

Solving the Lexicon Acquisition Problem

A first attempt to solve the Lexicon Acquisition Problem might be to examine all interpretation functions across the corpus, then choose the one(s) with minimal lexicon size. The number of possible interpretation functions for a given input pair is dependent on both the size of the sentence and its representation. In a sentence with w words, there are $\Theta(w^2)$ possible phrases, not a particular challenge. However, the number of possible interpretation functions grows extremely quickly with the size of the input. For a sentence with p phrases and an associated tree with n vertices, the number of possible interpretation functions is:

$\begin{displaymath}c!(n-1)! \sum_{i=1}^{c} \frac{1}{(i-1)! (n-i)! (c-i)!}. \end{displaymath}$

(1)

where c is min(p,n). The derivation of the above formula is as follows. We must choose which phrases to use in the domain of f, and we can choose one phrase, or two, or any number up to min(p,n) (if n<p we can only assign n phrases since f is one-to-one), or

$\begin{displaymath}\left( \begin{array}{c} p \\ i \end{array} \right) = \frac{p!}{i! (p-i)!} \end{displaymath}$

where i is the number of phrases chosen. But we can also permute these phrases, so that the ``order'' in which they are assigned to the vertices is different. There are i! such permutations. We must also choose which vertices to include in the range of the interpretation function. We have to choose the root each time, so if we are choosing i vertices, we have n-1 choose i-1 vertices left after choosing the root, or

$\begin{displaymath}\left( \begin{array}{c} n-1 \\ i-1 \end{array} \right) = \frac{(n-1)!} {(i-1)! (n-i)!}. \end{displaymath}$

The full number of possible interpretation functions is then:

$\begin{displaymath}\sum_{i=1}^{min(p,n)} \frac{p!}{i! (p-i)!} \times i! \times \frac{(n-1)!}{(i-1)! (n-i)!}, \end{displaymath}$

which simplifies to Equation 1. When n=p, the largest term of this equation is c!=p!, which grows at least exponentially with p, so in general the number of interpretation functions is too large to allow enumeration. Therefore, finding a lexicon by examining all interpretations across the corpus, then choosing the lexicon(s) of minimum size, is clearly not tractable. Instead of finding all interpretations, one could find a set of candidate meanings for each phrase, from which the final meaning(s) for that phrase could be chosen in a way that minimizes lexicon size. One way to find candidate meanings is to fracture the meanings of sentences in which a phrase appears. Siskind (1993) defined fracturing (he also calls it the UNLINK* operation) over terms such that the result includes all subterms of an expression plus $\perp$ . In our representation formalism, this corresponds to finding all possible connected subgraphs of a meaning, and adding the empty graph. Like the interpretation function technique just discussed, fracturing would also lead to an exponential blowup in the number of candidate meanings for a phrase: A lower bound on the number of connected subgraphs for a full binary tree with n vertices is obtained by noting that any subset of the (n+1)/2 leaves may be deleted and still maintain connectivity of the remaining tree. Thus, counting all of the ways that leaves can be deleted gives us a lower bound of 2^(n+1)/2 fractures.⁶ This does not completely rule out fracturing as part of a technique for lexicon learning since trees do not tend to get very large, and indeed Siskind uses it in many of his systems, with other constraints to help control the search. However, we wish to avoid any chance of exponential blowup to preserve the generality of our approach for other tasks. Another option is to force CHILL to essentially induce a lexicon on its own. In this model, we would provide to CHILL an ambiguous lexicon in which each phrase is paired with every fracture of every sentence in which it appears. CHILL would then have to decide which set of fractures leads to the correct parse for each training sentence, and would only include those in a final learned parser-lexicon combination. Thus the search would again become exponential. Furthermore, even with small representations, it would likely lead to a system with poor generalization ability. While some of Siskind's work (e.g., Siskind, 1992) took syntactic constraints into account and did not encounter such difficulties, those versions did not handle lexical ambiguity. If we could efficiently find some good candidates, a standard induction algorithm could then attempt to use them as a source of training examples for each phrase. However, any attempt to use the list of candidate meanings of one phrase as negative examples for another phrase would be flawed. The learner could not know in advance which phrases are possibly synonymous, and thus which phrase lists to use as negative examples of other phrase meanings. Also, many representation components would be present in the lists of more than one phrase. This is a source of conflicting evidence for a learner, even without the presence of synonymy. Since only positive examples are available, one might think of using most specific conjunctive learning, or finding the intersection of all the representations for each phrase, as proposed by Anderson (1977). However, the meanings of an ambiguous phrase are disjunctive, and this intersection would be empty. A similar difficulty would be expected with the positive-only compression of Muggleton (1995).

Next: Our Solution: WOLFIE Up: The WOLFIE Algorithm and Previous: The WOLFIE Algorithm and

Cindi Thompson
2003-01-02