We now consider the full CPC corpus. The majority of these cases (44 of 48 cases), have more than 20 positive findings and thus appear to be beyond the reach of exact methods.

An important attraction of sampling methods is the mathematical guarantee of accurate estimates in the limit of a sufficiently large sample size (Gelfand & Smith, 1990). Thus sampling methods have the promise of providing a general methodology for approximate inference, with two caveats: (1) the number of samples that is needed can be difficult to diagnosis, and (2) very many samples may be required to obtain accurate estimates. For real-time applications, the latter issue can rule out sampling solutions. However, long-term runs of a sampler can still provide a useful baseline for the evaluation of the accuracy of faster approximation algorithms. We begin by considering this latter possibility in the context of likelihood-weighted sampling for the QMR-DT. We then turn to a comparative evaluation of likelihood-weighted sampling and variational methods in the time-limited setting.

To explore the viability of the likelihood-weighted sampler for providing a surrogate for the gold standard, we carried out two independent runs each consisting of 400,000 samples. Figure 9(a) shows the estimates of the log-likelihood

**Figure 9:** (a) Upper and lower bounds (solid lines) and the corresponding
sampling estimates (dashed line) of the log-likelihood of observed
findings for the CPC cases. (b) A correlation plot between the posterior
marginal estimates from two independent sampling runs.

from the first sampling run for all of the CPC cases. We also show the variational upper and lower bounds for these cases (the cases have been sorted according to the lower bound). Note that these bounds are rigorous bounds on the true log-likelihood, and thus they provide a direct indication of the accuracy of the sampling estimates. Although we see that many of the estimates lie between the bounds, we also see in many cases that the sampling estimates deviate substantially from the bounds. This suggests that the posterior marginal estimates obtained from these samples are likely to be unreliable as well. Indeed, Figure 9(b) presents a scatterplot of estimated posterior marginals for the two independent runs of the sampler. Although we see many cases in which the results lie on the diagonal, indicating agreement between the two runs, we also see many pairs of posterior estimates that are far from the diagonal.

These results cast some doubt on the viability of the likelihood-weighted sampler as a general approximator for the full set of CPC cases. Even more problematically we appear to be without a reliable surrogate for the gold standard for these cases, making it difficult to evaluate the accuracy of real-time approximations such as the variational method. Note, however, that the estimates in Figure 9(a) seem to fall into two classes--estimates that lie within the variational bounds and estimates that are rather far from the bounds. This suggests the possibility that the distribution being sampled from is multi-modal, with some estimates falling within the correct mode and providing good approximations and with others falling in spurious modes and providing seriously inaccurate approximations. If the situation holds, then an accurate surrogate for the gold standard might be obtained by using the variational bounds to filter the sampling results and retaining only those estimates that lie between the bounds given by the variational approach.

Figure 10 provides some evidence of the viability of

**Figure 10:** A correlation plot between the selected posterior
marginal estimates from two independent sampling runs, where
the selection was based on the variational upper and lower
bounds.

this approach. In 24 out of the 48 CPC cases both of the independent runs of the sampler resulted in estimates of the log-likelihood lying approximately within the variational bounds. We recomputed the posterior marginal estimates for these selected cases and plotted them against each other in the figure. The scatterplot shows a high degree of correspondence of the posterior estimates in these cases. We thus tentatively assume that these estimates are accurate enough to serve as a surrogate gold standard and proceed to evaluate the real-time approximations.

Figure 11 plots the false positives against

**Figure 11:** Average number of false positives as a function of true
positives for the variational method (solid line) and the likelihood-weighted
sampler (dashed line). For the variational method 12 positive
findings were treated exactly, and for the sampler the results
are averages across ten runs.

the true positives on the 24 selected CPC cases for the variational method. Twelve positive findings were treated exactly in this simulation. Obtaining the variational estimates took 0.29 seconds of computer time per case. Although the curve increases more rapidly than with the tractable CPC cases, the variational algorithm still appears to provide a reasonably accurate ranking of the posterior marginals, within a reasonable time frame.

To compare the variational algorithm to a time-limited version of the likelihood-weighted sampler we ran the latter algorithm for a period of time (8.83 seconds per case) roughly comparable to the running time of the variational algorithm (0.29 seconds per case). Figure 11 shows the corresponding plot of false positives against true positives, where we have averaged over ten independent runs. We see that the curve increases significantly more steeply than the variational curve. To find the 20 most likely diseases with the variational method we would only need to entertain the top 30 diseases in the list of approximately ranked diseases. For the sampling method we would need to entertain the top 70 approximately ranked diseases.

Sun May 9 16:22:01 PDT 1999