In the context of active learning, we are assuming that the input distribution is known. With a mixture of Gaussians, one interpretation of this assumption is that we know and for each Gaussian. In that case, our application of EM will estimate only , , and .

Generally however, knowing the input distribution will not correspond
to knowing the actual and for each
Gaussian. We may simply know, for example, that is uniform, or
can be approximated by some set of sampled inputs. In such cases, we
must use EM to estimate and in addition
to the parameters involving **y**. If we simply estimate these values
from the training data, though, we will be estimating the joint
distribution of instead of . To obtain a
proper estimate, we must correct Equation 5 as
follows:

Here, is computed by applying
Equation 7 given the mean and **x** variance of the
training data, and is computed by applying the same equation
using the mean and **x** variance of a set of reference data drawn
according to .

If our goal in active learning is to minimize variance, we should select training examples to minimize . With a mixture of Gaussians, we can compute efficiently. The model's estimated distribution of given is explicit:

where , and denotes the normal distribution with mean and variance . Given this, we can model the change in each separately, calculating its expected variance given a new point sampled from and weight this change by . The new expectations combine to form the learner's new expected variance

where the expectation can be computed exactly in closed form:

If, as discussed earlier, we are also estimating and , we must take into account the effect of the new example on those estimates, and must replace and in the above equations with

We can use Equation 9 to guide active learning. By evaluating the expected new variance over a reference set given candidate , we can select the giving the lowest expected model variance. Note that in high-dimensional spaces, it may be necessary to evaluate an excessive number of candidate points to get good coverage of the potential query space. In these cases, it is more efficient to differentiate Equation 9 and hillclimb on to find a locally maximal . See, for example, [Cohn 1994].

Mon Mar 25 09:20:31 EST 1996