Four of the CPC cases have 20 or fewer positive findings (see Table 1), and for these cases it is possible to calculate the exact values of the likelihood and the posterior marginals in a reasonable amount of time. We used Heckerman's ``Quickscore'' algorithm (Heckerman 1989)--an algorithm tailored to the QMR-DT architecture--to perform these exact calculations.

**Table 1:** Description of the cases for which we evaluated the exact
posterior marginals.

Figure 3 shows the log-likelihood for the four tractable CPC cases.

**Figure 3:** Exact values and variational upper and lower bounds on the
log-likelihood for the four tractable CPC cases.
In (a) 8 positive findings were treated exactly, and in (b) 12
positive findings were treated exactly.

The figure also shows the variational lower and upper bounds. We
calculated the variational bounds twice, with differing numbers of
positive findings treated exactly in the two cases (``treated exactly''
simply means that the finding is not transformed variationally). In
panel (a) there were 8 positive findings treated exactly, and in
(b) 12 positive findings were treated exactly. As expected, the bounds
were tighter when more positive findings were treated
exactly.
^{4}

The average running time across the four tractable CPC cases was 26.9 seconds for the exact method, 0.11 seconds for the variational method with 8 positive findings treated exactly, and 0.85 seconds for the variational method with 12 positive findings treated exactly. (These results were obtained on a 433 MHz DEC Alpha computer).

Although the likelihood is an important quantity to approximate (particularly in applications in which parameters need to be estimated), of more interest in the QMR-DT setting are the posterior marginal probabilities for the individual diseases. As we discussed in the previous section, the simplest approach to obtaining variational estimates of these quantities is to define an approximate variational distribution based either on the distribution , which upper-bounds the likelihood, or the distribution , which lower-bounds the likelihood. For fixed values of the variational parameters (chosen to provide a tight bound to the likelihood), both distributions provide partially factorized approximations to the joint probability distribution. These factorized forms can be exploited as efficient approximate inference engines for general posterior probabilities, and in particular we can use them to provide approximations to the posterior marginals of individual diseases.

In practice we found that the distribution yielded more accurate posterior marginals than the distribution , and we restrict our presentation to . Figure 4

**Figure 4:** Scatterplot of the variational posterior estimates and the
exact marginals. In (a) 8 positive findings were treated exactly
and in (b) 12 positive findings were treated exactly.

displays a scatterplot of these approximate posterior marginals, with panel (a) corresponding to the case in which 8 positive findings were treated exactly and panel (b) the case in which 12 positive findings treated exactly. The plots were obtained by first extracting the 50 highest posterior marginals from each case using exact methods and then computing the approximate posterior marginals for the corresponding diseases. If the approximate marginals are in fact correct then the points in the figures should align along the diagonals as shown by the dotted lines. We see a reasonably good correspondence--the variational algorithm appears to provide a good approximation to the largest posterior marginals. (We quantify this correspondence with a ranking measure later in this section).

A current state-of-the-art algorithm for the QMR-DT
is the enhanced version of likelihood-weighted sampling
proposed by Shwe and Cooper (1991). Likelihood-weighted
sampling is a stochastic sampling method proposed by
Fung and Chang (1990) and Shachter and Peot (1990).
Likelihood-weighted sampling is basically a simple
forward sampling method that weights samples by their
likelihoods. It can be enhanced and improved by
utilizing ``self-importance sampling'' (see Shachter
& Peot, 1990), a version of importance sampling in
which the importance sampling distribution is continually
updated to reflect the current estimated posterior distribution.
Middleton et al. (1990) utilized likelihood-weighted sampling
with self-importance sampling (as well as a heuristic
initialization scheme known as ``iterative tabular
Bayes'') for the QMR-DT model and found that it did not
work satisfactorily. Subsequent work by Shwe and Cooper
(1991), however, used an additional enhancement to the
algorithm known as `Markov blanket scoring'' (see Shachter
& Peot, 1990), which distributes fractions of samples
to the positive and negative values of a node in proportion
to the probability of these values conditioned on the
Markov blanket of the node. The combination of Markov blanket
scoring and self-importance sampling yielded an effective
algorithm.
^{5} In particular,
with these modifications in place, Shwe and Cooper reported
reasonable accuracy for two of the difficult CPC cases.

We re-implemented the likelihood-weighted sampling algorithm of
Shwe and Cooper, incorporating the Markov blanket scoring heuristic
and self-importance sampling. (We did not utilize ``iterative
tabular Bayes'' but instead utilized a related initialization
scheme-``heuristic tabular Bayes''-also discussed by Shwe and Cooper).
In this section we discuss the results of running this algorithm
on the four tractable CPC cases, comparing to the results of
variational inference.
^{6} In the following section we present a fuller
comparative analysis of the two algorithms for all of the
CPC cases.

Likelihood-weighting sampling, and indeed any sampling algorithm, realizes a time-accuracy tradeoff--taking additional samples requires more time but improves accuracy. In comparing the sampling algorithm to the variational algorithm, we ran the sampling algorithm for several different total time periods, so that the accuracy achieved by the sampling algorithm roughly covered the range achieved by the variational algorithm. The results are shown in Figure 5, with the right-hand curve corresponding to the sampling runs. The figure displays the mean correlations between the approximate and exact posterior marginals across ten independent runs of the algorithm

**Figure 5:**
The mean correlation between the approximate and exact posterior
marginals as a function of the execution time (in seconds). Solid line:
variational estimates; dashed line: likelihood-weighting sampling.
The lines above and below the sampling result represent standard errors
of the mean based on the ten independent runs of the sampler.

(for the four tractable CPC cases).

Variational algorithms are also characterized by a time-accuracy tradeoff. In particular, the accuracy of the method generally improves as more findings are treated exactly, at the cost of additional computation. Figure 5 also shows the results from the variational algorithm (the left-hand curve). The three points on the curve correspond to up to 8, 12 and 16 positive findings treated exactly. Note that the variational estimates are deterministic and thus only a single run was made.

The figure shows that to achieve roughly equivalent levels of accuracy, the sampling algorithm requires significantly more computation time than the variational method.

Although scatterplots and correlation measures provide a rough indication
of the accuracy of an approximation algorithm, they are deficient in several
respects. In particular, in diagnostic practice the interest is in
the ability of an algorithm to rank diseases correctly, and to avoid both
false positives (diseases that are not in fact significant but are included
in the set of highly ranked diseases) and false negatives (significant
diseases that are omitted from the set of highly ranked diseases). We
defined a ranking measure as follows (see also Middleton et al., 1990).
Consider a set of the *N* highest ranking disease hypotheses, where the ranking
is based on the correct posterior marginals. Corresponding to this set of
diseases we can find the smallest set of *N*' approximately ranked diseases
that includes the *N* significant ones. In other words, for any *N* ``true
positives'' an approximate method produces *N*'-*N* ``false positives.''
Plotting false positives as a function of true positives provides a meaningful
and useful measure of the accuracy of an approximation scheme. To the extent
that a method provides a nearly correct ranking of true positives the plot
increases slowly and the area under the curve is small. When a significant
disease appears late in the approximate ordering the plot increases rapidly
near the true rank of the missed disease and the area under the curve is large.

We also plot the number of ``false negatives'' in a set of top *N* highly
ranked diseases. False negatives refer to the number of diseases, out
of the *N* highest ranking diseases, that do not appear in the set
of *N* approximately ranked diseases. Note that unlike the previous
measure, this measure does not reveal the severity of the misplacements,
only their frequency.

With this improved diagnostic measure in hand, let us return to the evaluation of the inference algorithms, beginning with the variational algorithm. Figure 6 provides plots of

**Figure 6:** (a) Average number of false positives as a function of true
positives for the variational method (solid lines) and the
partially-exact method (dashed line). (b) False negatives in the set of top
*N* approximately ranked diseases. In both figures 8 positive findings
were treated exactly.

**Figure 7:** (a) Average number of false positives as a function of true
positives for the variational method (solid line) and the
partially-exact method (dashed line). (b) False negatives in the set of top
*N* approximately ranked diseases. In both figures 12 positive findings were
treated exactly.

the false positives (panel a) and false negatives (panel b) against the true positives for the tractable CPC cases. Eight positive findings were treated exactly in the simulation shown in this figure. Figure 7 displays the results when 12 positive finding were treated exactly.

As we noted earlier, 8 and 12 positive findings comprise a significant fraction
of the total positive findings for the tractable CPC cases, and thus it is
important to verify that the variational transformations are in fact
contributing to the accuracy of the posterior approximations above and beyond
the exact calculations. We did this by comparing the variational method
to a method which we call the ``partially-exact'' method in which the posterior
probabilities were obtained using only those findings that were treated
exactly in the variational calculations (i.e., using only those findings
that were not transformed). If the variational transformations did not
contribute to the accuracy of the approximation, then the performance of the
partially-exact method should be comparable to that of the variational
method.
^{8} Figure 6 and
Figure 7 clearly indicate that this is not the case.
The difference in accuracy between these methods is substantial while
their computational load is comparable (about 0.1 seconds on a 433MHz
Dec Alpha).

We believe that the accuracy portrayed in the false positive plots provides a good indication of the potential of the variational algorithm for providing a practical solution to the approximate inference problem for the QMR-DT. As the figures show, the number of false positives grows slowly with the number of true positives. For example, as shown in Figure 6 where eight positive findings are treated exactly, to find the 20 most likely diseases we would only need to entertain the top 23 diseases in the list of approximately ranked diseases (compared to more than 50 for the partially-exact method).

The ranking plot for the likelihood-weighted sampler is shown in Figure 8, with the curve for the

**Figure 8:** Average number of false positives as a function of true
positives for the likelihood-weighted sampler (dashed line)
and the variational method (solid line) with 12 positive findings
treated exactly.

variational method from Figure 7 included for comparison. To make these plots, we ran the likelihood-weighted sampler for an amount of time (6.15 seconds) that was comparable to the time allocated to our slowest variational method (3.17 seconds; this was the case in which 16 positive findings were treated exactly. Recall that the time required for the variational algorithm with 12 positive findings treated exactly was 0.85 seconds.) As the plots show, for these tractable CPC cases, the variational method is significantly more accurate than the sampling algorithm for comparable computational loads.

Sun May 9 16:22:01 PDT 1999