Comparison to Exact Marginals

Next: The Full CPC Corpus Up: Experimental Evaluation Previous: Experimental Evaluation

Comparison to Exact Marginals

Four of the CPC cases have 20 or fewer positive findings (see Table 1), and for these cases it is possible to calculate the exact values of the likelihood and the posterior marginals in a reasonable amount of time. We used Heckerman's ``Quickscore'' algorithm (Heckerman 1989)--an algorithm tailored to the QMR-DT architecture--to perform these exact calculations.

table185
Table 1: Description of the cases for which we evaluated the exact posterior marginals.

Figure 3 shows the log-likelihood for the four tractable CPC cases.

Figure 3: Exact values and variational upper and lower bounds on the log-likelihood for the four tractable CPC cases. In (a) 8 positive findings were treated exactly, and in (b) 12 positive findings were treated exactly.

The figure also shows the variational lower and upper bounds. We calculated the variational bounds twice, with differing numbers of positive findings treated exactly in the two cases (``treated exactly'' simply means that the finding is not transformed variationally). In panel (a) there were 8 positive findings treated exactly, and in (b) 12 positive findings were treated exactly. As expected, the bounds were tighter when more positive findings were treated exactly. ⁴

The average running time across the four tractable CPC cases was 26.9 seconds for the exact method, 0.11 seconds for the variational method with 8 positive findings treated exactly, and 0.85 seconds for the variational method with 12 positive findings treated exactly. (These results were obtained on a 433 MHz DEC Alpha computer).

Although the likelihood is an important quantity to approximate (particularly in applications in which parameters need to be estimated), of more interest in the QMR-DT setting are the posterior marginal probabilities for the individual diseases. As we discussed in the previous section, the simplest approach to obtaining variational estimates of these quantities is to define an approximate variational distribution based either on the distribution , which upper-bounds the likelihood, or the distribution , which lower-bounds the likelihood. For fixed values of the variational parameters (chosen to provide a tight bound to the likelihood), both distributions provide partially factorized approximations to the joint probability distribution. These factorized forms can be exploited as efficient approximate inference engines for general posterior probabilities, and in particular we can use them to provide approximations to the posterior marginals of individual diseases.

In practice we found that the distribution yielded more accurate posterior marginals than the distribution , and we restrict our presentation to . Figure 4

Figure 4: Scatterplot of the variational posterior estimates and the exact marginals. In (a) 8 positive findings were treated exactly and in (b) 12 positive findings were treated exactly.

displays a scatterplot of these approximate posterior marginals, with panel (a) corresponding to the case in which 8 positive findings were treated exactly and panel (b) the case in which 12 positive findings treated exactly. The plots were obtained by first extracting the 50 highest posterior marginals from each case using exact methods and then computing the approximate posterior marginals for the corresponding diseases. If the approximate marginals are in fact correct then the points in the figures should align along the diagonals as shown by the dotted lines. We see a reasonably good correspondence--the variational algorithm appears to provide a good approximation to the largest posterior marginals. (We quantify this correspondence with a ranking measure later in this section).

A current state-of-the-art algorithm for the QMR-DT is the enhanced version of likelihood-weighted sampling proposed by Shwe and Cooper (1991). Likelihood-weighted sampling is a stochastic sampling method proposed by Fung and Chang (1990) and Shachter and Peot (1990). Likelihood-weighted sampling is basically a simple forward sampling method that weights samples by their likelihoods. It can be enhanced and improved by utilizing ``self-importance sampling'' (see Shachter & Peot, 1990), a version of importance sampling in which the importance sampling distribution is continually updated to reflect the current estimated posterior distribution. Middleton et al. (1990) utilized likelihood-weighted sampling with self-importance sampling (as well as a heuristic initialization scheme known as ``iterative tabular Bayes'') for the QMR-DT model and found that it did not work satisfactorily. Subsequent work by Shwe and Cooper (1991), however, used an additional enhancement to the algorithm known as `Markov blanket scoring'' (see Shachter & Peot, 1990), which distributes fractions of samples to the positive and negative values of a node in proportion to the probability of these values conditioned on the Markov blanket of the node. The combination of Markov blanket scoring and self-importance sampling yielded an effective algorithm. ⁵ In particular, with these modifications in place, Shwe and Cooper reported reasonable accuracy for two of the difficult CPC cases.

We re-implemented the likelihood-weighted sampling algorithm of Shwe and Cooper, incorporating the Markov blanket scoring heuristic and self-importance sampling. (We did not utilize ``iterative tabular Bayes'' but instead utilized a related initialization scheme-``heuristic tabular Bayes''-also discussed by Shwe and Cooper). In this section we discuss the results of running this algorithm on the four tractable CPC cases, comparing to the results of variational inference. ⁶ In the following section we present a fuller comparative analysis of the two algorithms for all of the CPC cases.

Likelihood-weighting sampling, and indeed any sampling algorithm, realizes a time-accuracy tradeoff--taking additional samples requires more time but improves accuracy. In comparing the sampling algorithm to the variational algorithm, we ran the sampling algorithm for several different total time periods, so that the accuracy achieved by the sampling algorithm roughly covered the range achieved by the variational algorithm. The results are shown in Figure 5, with the right-hand curve corresponding to the sampling runs. The figure displays the mean correlations between the approximate and exact posterior marginals across ten independent runs of the algorithm

Figure 5: The mean correlation between the approximate and exact posterior marginals as a function of the execution time (in seconds). Solid line: variational estimates; dashed line: likelihood-weighting sampling. The lines above and below the sampling result represent standard errors of the mean based on the ten independent runs of the sampler.

(for the four tractable CPC cases).

Variational algorithms are also characterized by a time-accuracy tradeoff. In particular, the accuracy of the method generally improves as more findings are treated exactly, at the cost of additional computation. Figure 5 also shows the results from the variational algorithm (the left-hand curve). The three points on the curve correspond to up to 8, 12 and 16 positive findings treated exactly. Note that the variational estimates are deterministic and thus only a single run was made.

The figure shows that to achieve roughly equivalent levels of accuracy, the sampling algorithm requires significantly more computation time than the variational method.

Although scatterplots and correlation measures provide a rough indication of the accuracy of an approximation algorithm, they are deficient in several respects. In particular, in diagnostic practice the interest is in the ability of an algorithm to rank diseases correctly, and to avoid both false positives (diseases that are not in fact significant but are included in the set of highly ranked diseases) and false negatives (significant diseases that are omitted from the set of highly ranked diseases). We defined a ranking measure as follows (see also Middleton et al., 1990). Consider a set of the N highest ranking disease hypotheses, where the ranking is based on the correct posterior marginals. Corresponding to this set of diseases we can find the smallest set of N' approximately ranked diseases that includes the N significant ones. In other words, for any N ``true positives'' an approximate method produces N'-N ``false positives.'' Plotting false positives as a function of true positives provides a meaningful and useful measure of the accuracy of an approximation scheme. To the extent that a method provides a nearly correct ranking of true positives the plot increases slowly and the area under the curve is small. When a significant disease appears late in the approximate ordering the plot increases rapidly near the true rank of the missed disease and the area under the curve is large.

We also plot the number of ``false negatives'' in a set of top N highly ranked diseases. False negatives refer to the number of diseases, out of the N highest ranking diseases, that do not appear in the set of N approximately ranked diseases. Note that unlike the previous measure, this measure does not reveal the severity of the misplacements, only their frequency.

With this improved diagnostic measure in hand, let us return to the evaluation of the inference algorithms, beginning with the variational algorithm. Figure 6 provides plots of

Figure 6: (a) Average number of false positives as a function of true positives for the variational method (solid lines) and the partially-exact method (dashed line). (b) False negatives in the set of top N approximately ranked diseases. In both figures 8 positive findings were treated exactly.

Figure 7: (a) Average number of false positives as a function of true positives for the variational method (solid line) and the partially-exact method (dashed line). (b) False negatives in the set of top N approximately ranked diseases. In both figures 12 positive findings were treated exactly.

the false positives (panel a) and false negatives (panel b) against the true positives for the tractable CPC cases. Eight positive findings were treated exactly in the simulation shown in this figure. Figure 7 displays the results when 12 positive finding were treated exactly.

As we noted earlier, 8 and 12 positive findings comprise a significant fraction of the total positive findings for the tractable CPC cases, and thus it is important to verify that the variational transformations are in fact contributing to the accuracy of the posterior approximations above and beyond the exact calculations. We did this by comparing the variational method to a method which we call the ``partially-exact'' method in which the posterior probabilities were obtained using only those findings that were treated exactly in the variational calculations (i.e., using only those findings that were not transformed). If the variational transformations did not contribute to the accuracy of the approximation, then the performance of the partially-exact method should be comparable to that of the variational method. ⁸ Figure 6 and Figure 7 clearly indicate that this is not the case. The difference in accuracy between these methods is substantial while their computational load is comparable (about 0.1 seconds on a 433MHz Dec Alpha).

We believe that the accuracy portrayed in the false positive plots provides a good indication of the potential of the variational algorithm for providing a practical solution to the approximate inference problem for the QMR-DT. As the figures show, the number of false positives grows slowly with the number of true positives. For example, as shown in Figure 6 where eight positive findings are treated exactly, to find the 20 most likely diseases we would only need to entertain the top 23 diseases in the list of approximately ranked diseases (compared to more than 50 for the partially-exact method).

The ranking plot for the likelihood-weighted sampler is shown in Figure 8, with the curve for the

Figure 8: Average number of false positives as a function of true positives for the likelihood-weighted sampler (dashed line) and the variational method (solid line) with 12 positive findings treated exactly.

variational method from Figure 7 included for comparison. To make these plots, we ran the likelihood-weighted sampler for an amount of time (6.15 seconds) that was comparable to the time allocated to our slowest variational method (3.17 seconds; this was the case in which 16 positive findings were treated exactly. Recall that the time required for the variational algorithm with 12 positive findings treated exactly was 0.85 seconds.) As the plots show, for these tractable CPC cases, the variational method is significantly more accurate than the sampling algorithm for comparable computational loads.

Next: The Full CPC Corpus Up: Experimental Evaluation Previous: Experimental Evaluation

Michael Jordan
Sun May 9 16:22:01 PDT 1999