The basic incremental growth procedure may be outlined as follows. Every stage of the process is characterized by a set of active features . These determine a space of models

The optimal model in this space, denoted by , is the model with the greatest entropy:

By adding feature to , we obtain a new set of active features . Following (22), this set of features determines a set of models

The optimal model in this space of models is

Adding the feature allows the model to better account for the
training sample; this results in a *gain* in the log-likelihood of
the training data

At each stage of the model-construction process, our goal is to select the candidate feature which maximizes the gain ; that is, we select the candidate feature which, when adjoined to the set of active features , produces the greatest increase in likelihood of the training sample. This strategy is implemented in

One issue left unaddressed by this algorithm is the termination condition. Obviously, we would like a condition which applies exactly when all the ``useful'' features have been selected. One reasonable stopping criterion is to subject each proposed feature to cross-validation on a held-out sample of data. If the feature does not lead to an increase in likelihood of the held-out sample of data, the feature is discarded.

Another outstanding issue is the efficiency of Algorithm 2. It is impractical for any reasonable size of to imagine performing Algorithm 1 once for each at every iteration. The interested reader is referred to the list of further readings at the end of this document for pointers to articles containing efficient algorithms for model construction.

Fri Jul 5 11:43:50 EDT 1996