Probabilistic models have become increasingly prevalent in AI in recent years. Beyond the significant representational advantages of probability theory, including guarantees of consistency and a naturalness at combining diverse sources of knowledge (Pearl, 1988), the discovery of general exact inference algorithms has been principally responsible for the rapid growth in probabilistic AI (see, e.g., Lauritzen & Spiegelhalter, 1988; Pearl, 1988; Shenoy, 1992). These exact inference methods greatly expand the range of models that can be treated within the probabilistic framework and provide a unifying perspective on the general problem of probabilistic computation in graphical models.
Probability theory can be viewed as a combinatorial calculus that instructs us in how to merge the probabilities of sets of events into probabilities of composites. The key operation is that of marginalization, which involves summing (or integrating) over the values of variables. Exact inference algorithms essentially find ways to perform as few sums as possible during marginalization operations. In terms of the graphical representation of probability distributions--in which random variables correspond to nodes and conditional independencies are expressed as missing edges between nodes--exact inference algorithms define a notion of ``locality'' (for example as cliques in an appropriately defined graph), and attempt to restrict summation operators to locally defined sets of nodes.
While this approach manages to stave off the exponential explosion of exact probabilistic computation, such an exponential explosion is inevitable for any calculus that explicitly performs summations over sets of nodes. That is, there are models of interest in which ``local'' is overly large (see Jordan, et al., in press). From this point of view, it is perhaps not surprising that exact inference is NP-hard (Cooper, 1990).
In this paper we discuss the inference problem for a particular large-scale graphical model, the Quick Medical Reference (QMR) model. 1 The QMR model consists of a combination of statistical and expert knowledge for approximately 600 significant diseases and approximately 4000 findings. In the probabilistic formulation of the model (the QMR-DT), the diseases and the findings are arranged in a bi-partite graph, and the diagnosis problem is to infer a probability distribution for the diseases given a subset of findings. Given that each finding is generally relevant to a wide variety of diseases, the graph underlying the QMR-DT is dense, reflecting high-order stochastic dependencies. The computational complexity of treating these dependencies exactly can be characterized in terms of the size of the maximal clique of the ``moralized'' graph (see, e.g., Dechter, 1998; Lauritzen & Spiegelhalter, 1988). In particular, the running time is exponential in this measure of size. For the QMR-DT, considering the standardized ``clinocopathologic conference'' (CPC) cases that we discuss below, we find that the median size of the maximal clique of the moralized graph is 151.5 nodes. This rules out the use of general exact algorithms for the QMR-DT.
The general algorithms do not take advantage of the particular parametric form of the probability distributions at the nodes of the graph, and it is conceivable that additional factorizations might be found that take advantage of the particular choice made by the QMR-DT. Such a factorization was in fact found by Heckerman (1989); his ``Quickscore algorithm'' provides an exact inference algorithm that is tailored to the QMR-DT. Unfortunately, however, the run time of the algorithm is still exponential in the number of positive findings. For the CPC cases, we estimate that the algorithm would require an average of 50 years to solve the inference problem on current computers.
Faced with the apparent infeasibility of exact inference for large-scale models such as the QMR-DT, many researchers have investigated approximation methods. One general approach to developing approximate algorithms is to perform exact inference, but to do so partially. One can consider partial sets of node instantiations, partial sets of hypotheses, and partial sets of nodes. This point of view has led to the development of algorithms for approximate inference based on heuristic search. Another approach to developing approximation algorithms is to exploit averaging phenomena in dense graphs. In particular, laws of large numbers tell us that sums of random variables can behave simply, converging to predictable numerical results. Thus, there may be no need to perform sums explicitly, either exactly or partially. This point of view leads to the variational approach to approximate inference. Finally, yet another approach to approximate inference is based on stochastic sampling. One can sample from simplified distributions and in so doing obtain information about a more complex distribution of interest. We discuss each of these methods in turn.
Horvitz, Suermondt and Cooper (1991) have developed a partial evaluation algorithm known as ``bounded conditioning'' that works by considering partial sets of node instantiations. The algorithm is based on the notion of a ``cutset''; a subset of nodes whose removal renders the remaining graph singly-connected. Efficient exact algorithms exist for singly-connected graphs (Pearl, 1988). Summing over all instantiations of the cutset, one can calculate posterior probabilities for general graphs using the efficient algorithm as a subroutine. Unfortunately, however, there are exponentially many such cutset instantiations. The bounded conditioning algorithm aims at forestalling this exponential growth by considering partial sets of instantiations. Although this algorithm has promise for graphs that are ``nearly singly-connected,'' it seems unlikely to provide a solution for dense graphs such as the QMR-DT. In particular, the median cutset size for the QMR-DT across the CPC cases is 106.5, yielding an unmanageably large number of cutset instantiations.
Another approach to approximate inference is provided by ``search-based'' methods, which consider node instantiations across the entire graph (Cooper, 1985; Henrion, 1991; Peng & Reggia, 1987). The general hope in these methods is that a relatively small fraction of the (exponentially many) node instantiations contains a majority of the probability mass, and that by exploring the high probability instantiations (and bounding the unexplored probability mass) one can obtain reasonable bounds on posterior probabilities. The QMR-DT search space is huge, containing approximately disease hypotheses. If, however, one only considers cases with a small number of diseases, and if the hypotheses involving a small number of diseases contain most of the high probability posteriors, then it may be possible to search a significant fraction of the relevant portions of the hypothesis space. Henrion (1991) was in fact able to run a search-based algorithm on the QMR-DT inference problem, for a set of cases characterized by a small number of diseases. These were cases, however, for which the exact Quickscore algorithm is efficient. The more general corpus of CPC cases that we discuss in the current paper is not characterized by a small number of diseases per case. In general, even if we impose the assumption that patients have a limited number N of diseases, we cannot assume a priori that the model will show a sharp cutoff in posterior probability after disease N. Finally, in high-dimensional search problems it is often necessary to allow paths that are not limited to the target hypothesis subspace; in particular, one would like to be able to arrive at a hypothesis containing few diseases by pruning hypotheses containing additional diseases (Peng & Reggia, 1987). Imposing such a limitation can lead to failure of the search.
More recent partial evaluation methods include the ``localized partial evaluation'' method of Draper and Hanks (1994), the ``incremental SPI'' algorithm of D'Ambrosio (1993), the ``probabilistic partial evaluation'' method of Poole (1997), and the ``mini-buckets'' algorithm of Dechter (1997). The former algorithm considers partial sets of nodes, and the latter three consider partial evaluations of the sums that emerge during an exact inference run. These are all promising methods, but like the other partial evaluation methods it is yet not clear if they restrict the exponential growth in complexity in ways that yield realistic accuracy/time tradeoffs in large-scale models such as the QMR-DT. 2
Variational methods provide an alternative approach to approximate inference. They are similar in spirit to partial evaluation methods (in particular the incremental SPI and mini-buckets algorithms), in that they aim to avoid performing sums over exponentially many summands, but they come at the problem from a different point of view. From the variational point of view, a sum can be avoided if it contains a sufficient number of terms such that a law of large numbers can be invoked. A variational approach to inference replaces quantities that can be expected to be the beneficiary of such an averaging process with surrogates known as ``variational parameters.'' The inference algorithm manipulates these parameters directly in order to find a good approximation to a marginal probability of interest. The QMR-DT model turns out to be a particularly appealing architecture for the development of variational methods. As we will show, variational methods have a simple graphical interpretation in the case of the QMR-DT.
A final class of methods for performing approximate inference are the stochastic sampling methods. Stochastic sampling is a large family, including techniques such as rejection sampling, importance sampling, and Markov chain Monte Carlo methods (MacKay, 1998). Many of these methods have been applied to the problem of approximate probabilistic inference for graphical models and analytic results are available (Dagum & Horvitz, 1993). In particular, Shwe and Cooper (1991) proposed a stochastic sampling method known as ``likelihood-weighted sampling'' for the QMR-DT model. Their results are the most promising results to date for inference for the QMR-DT--they were able to produce reasonably accurate approximations in reasonable time for two of the difficult CPC cases. We consider the Shwe and Cooper algorithm later in this paper; in particular we compare the algorithm empirically to our variational algorithm across the entire corpus of CPC cases.
Although it is important to compare approximation methods, it should be emphasized at the outset that we do not think that the goal should be to identify a single champion approximate inference technique. Rather, different methods exploit different structural features of large-scale probability models, and we expect that optimal solutions will involve a combination of methods. We return to this point in the discussion section, where we consider various promising hybrids of approximate and exact inference algorithms.
The general problem of approximate inference is NP-hard (Dagum & Luby, 1993) and this provides additional reason to doubt the existence of a single champion approximate inference technique. We think it important to stress, however, that this hardness result, together with Cooper's (1990) hardness result for exact inference cited above, should not be taken to suggest that exact inference and approximate inference are ``equally hard.'' To take an example from a related field, there exist large domains of solid and fluid mechanics in which exact solutions are infeasible but in which approximate techniques (finite element methods) work well. Similarly, in statistical physics, very few models are exactly solvable, but there exist approximate methods (mean field methods, renormalization group methods) that work well in many cases. We feel that the goal of research in probabilistic inference should similarly be that of identifying effective approximate techniques that work well in large classes of problems.