next up previous
Next: Basic model construction Up: Algorithms for Inductive Learning Previous: Algorithms for Inductive Learning


We begin by specifying a large collection tex2html_wrap_inline2000 of candidate features. We do not require a priori that these features are actually relevant or useful. Instead, we let the pool be as large as practically possible. Only a small subset of this collection of features will eventually be employed in our final model. In short, we would like to include in the model only a subset tex2html_wrap_inline2002 of the full set of candidate features tex2html_wrap_inline2004 . We will call tex2html_wrap_inline2006 the set of active features. The choice of tex2html_wrap_inline2008 must capture as much information about the random process as possible, yet only include features whose expected values can be reliably estimated.

With each tex2html_wrap_inline2010 one might associate a Bernoulli random variable tex2html_wrap_inline2012 with the following (``objectivist'' or ``non-Bayesian'') interpretation. Imagine that the set of tex2html_wrap_inline2014 pairs comprising tex2html_wrap_inline2016 were drawn from an imaginary, infinite distribution of tex2html_wrap_inline2018 pairs. In this imaginary distribution, f is active ( tex2html_wrap_inline2022 ) with some probability tex2html_wrap_inline2024 , and inactive ( tex2html_wrap_inline2026 ) with probability tex2html_wrap_inline2028 .

(A Bayesian would prefer to put a distribution over the possible values of tex2html_wrap_inline2030 , informed both by some empirical evidence--such as the fraction of times tex2html_wrap_inline2032 in the sample tex2html_wrap_inline2034 --and also by some outside knowledge of what one miht expect tex2html_wrap_inline2036 to be. This line of reasoning forms the basis of ``fuzzy maximum entropy,'' which is a quite intriguing line of investigation we shall not pursue here.)

As a Bernoulli variable, f is subject to the law of large numbers,gif which asserts that for any tex2html_wrap_inline2042 :


That is, by increasing the number (N) of examples tex2html_wrap_inline2046 , the fraction tex2html_wrap_inline2048 for which tex2html_wrap_inline2050 approaches the constant tex2html_wrap_inline2052 . More succinctly, tex2html_wrap_inline2054 . Unfortunately, tex2html_wrap_inline2056 is not infinite, and the empirical estimate tex2html_wrap_inline2058 for each tex2html_wrap_inline2060 is not guaranteed to lie close to its ``true'' value tex2html_wrap_inline2062 . However, a subset of tex2html_wrap_inline2064 will exhibit tex2html_wrap_inline2066 . And of these, a fraction will have a non-negligeable bearing on the likelihood of the model. The goal of the inductive learning process is to identify this subset and construct a conditional model from it.

Searching exhaustively for this subset is hopeless for all but the most restricted problems. Even a priori fixing its size tex2html_wrap_inline2068 doesn't help much: there are tex2html_wrap_inline2070 unique outcomes of this selection process. For concreteness, a typical set of figures in documented experiments involving maxent modeling are tex2html_wrap_inline2072 and tex2html_wrap_inline2074 , meaning tex2html_wrap_inline2076 possible sets of 25 active features. Viewed as a pure search problem, finding the optimal set of tex2html_wrap_inline2080 features is computationally inconceivable.

To find tex2html_wrap_inline2082 , we adopt an incremental approach to feature selection. The idea is to build up tex2html_wrap_inline2084 by successively adding features. The choice of feature to add at each step is determined by the training data. Let us denote the set of models determined by the feature set tex2html_wrap_inline2086 as tex2html_wrap_inline2088 . ``Adding'' a feature f is shorthand for requiring that the set of allowable models all satisfy the equality tex2html_wrap_inline2092 . Only some members of tex2html_wrap_inline2094 will satisfy this equality; the ones that do we denote by tex2html_wrap_inline2096 .

Thus, each time a candidate feature is adjoined to tex2html_wrap_inline2098 , another linear constraint is imposed on the space tex2html_wrap_inline2100 of models allowed by the features in tex2html_wrap_inline2102 . As a result, tex2html_wrap_inline2104 shrinks; the model tex2html_wrap_inline2106 in tex2html_wrap_inline2108 with the greatest entropy reflects ever-increasing knowledge and thus, hopefully, becomes a more accurate representation of the process. This narrowing of the space of permissible models was represented in figure 1 by a series of intersecting lines (hyperplanes, in general) in a probability simplex. Perhaps more intuitively, we could represent it by a series of nested subsets of tex2html_wrap_inline2110 , as in figure 2.

As an aside, the intractability of the ``all at once'' optimization problem is not unique to or an indictment of the exponential approach. Decision trees, for example, are typically constructed recursively using a greedy algorithm. And in designing a neural network representation of an arbitrary distribution, one typically either fixes the network topology in advance or performs a restricted search within a neighborhood of possible topologies for the optimal configuration, because the complete search over parameters and topologies is intractable.

Figure 2: A nested sequence of subsets tex2html_wrap_inline2114 of tex2html_wrap_inline2116 corresponding to increasingly large sets of features tex2html_wrap_inline2118

next up previous
Next: Basic model construction Up: Algorithms for Inductive Learning Previous: Algorithms for Inductive Learning

Adam Berger
Fri Jul 5 11:43:50 EDT 1996