next up previous
Next: Algorithms for Inductive Learning Up: A Brief Maxent Tutorial Previous: Outline

Computing the Parameters


For all but the most simple problems, the tex2html_wrap_inline1902 that maximize tex2html_wrap_inline1904 cannot be found analytically. Instead, we must resort to numerical methods. From the perspective of numerical optimization, the function tex2html_wrap_inline1906 is well behaved, since it is smooth and convex- tex2html_wrap_inline1908 in each tex2html_wrap_inline1910 . Consequently, a variety of numerical methods can be used to calculate tex2html_wrap_inline1912 . One simple method is coordinate-wise ascent, in which tex2html_wrap_inline1914 is computed by iteratively maximizing tex2html_wrap_inline1916 one coordinate at a time. When applied to the maximum entropy problem, this technique yields the popular Brown algorithm. Other general purpose methods that can be used to maximize tex2html_wrap_inline1918 include gradient ascent and conjugate gradient.

An optimization method specifically tailored to the maximum entropy problem is the iterative scaling algorithm of Darroch and Ratcliff. We present here a version of this algorithm specifically designed for the problem at hand; the interested reader is referred to the further readings. The algorithm is applicable whenever the feature functions tex2html_wrap_inline1920 are non-negative:


This is, of course, true for the binary-valued feature functions we are considering here. The algorithm generalizes the Darroch-Ratcliff procedure, which requires, in addition to the non-negativity, that the feature functions satisfy tex2html_wrap_inline1928 for all x,y.


The key step in the algorithm is step (2a), the computation of the increments tex2html_wrap_inline1956 that solve (19). Notice that this equation contains the term tex2html_wrap_inline1958 , which changes as tex2html_wrap_inline1960 becomes updated in step (2b). For this reason, the algorithm is iterative, and requires a pass through the entire set of tex2html_wrap_inline1962 pairs in the empirical sample tex2html_wrap_inline1964 for each iteration.

Equation (19) merits a brief comment. In words, tex2html_wrap_inline1966 is the number of features tex2html_wrap_inline1968 which are ``active'' ( tex2html_wrap_inline1970 ) for x,y. In some cases, tex2html_wrap_inline1974 may be a constant ( tex2html_wrap_inline1976 for all x,y, say), in which case tex2html_wrap_inline1980 is given explicitly as


However, in practice it is typically the case that different numbers of features will apply at different x,y. If tex2html_wrap_inline1984 is not constant, then the tex2html_wrap_inline1986 must be computed numerically. A simple and effective way of doing this is by Newton's method. This method computes the solution tex2html_wrap_inline1988 of an equation tex2html_wrap_inline1990 iteratively by the recurrence


with an appropriate choice for tex2html_wrap_inline1992 and suitable attention paid to the domain of g.

next up previous
Next: Algorithms for Inductive Learning Up: A Brief Maxent Tutorial Previous: Outline

Adam Berger
Fri Jul 5 11:43:50 EDT 1996