Existing Importance Sampling Algorithms for Bayesian Networks

Next: Practical Performance of the Up: Importance Sampling Algorithms for Previous: A Generic Importance Sampling

Existing Importance Sampling Algorithms for Bayesian Networks

The main difference between various stochastic sampling algorithms is in how they process Steps 2, 7, and 8 in the generic importance sampling algorithm of Figure 1.

Probabilistic logic sampling [Henrion1988] is the simplest and the first proposed sampling algorithm for Bayesian networks. The importance function is initialized in Step 2 to Pr( $\bf X$ ) and never updated (Step 7 is null). Without evidence, Pr( $\bf X$ ) is the optimal importance function for the evidence set, which is empty anyway. It escapes most authors that Pr( $\bf X$ ) may be not the optimal importance function for Pr(A = a), A $\in$ $\bf X$ , when A is not a root node. A mismatch between the optimal and the actually used importance function may result in a large variance. The sampling process with evidence is the same as without evidence except that in Step 10 we do not count the scores for those samples that are inconsistent with the observed evidence, which amounts to discarding them. When the evidence is very unlikely, there is a large difference between Pr( $\bf X$ ) and the optimal importance function. Effectively, most samples are discarded and the performance of logic sampling deteriorates badly.

Likelihood weighting (LW) [Fung and Chang1989,Shachter and Peot1989] enhances the logic sampling in that it never discards samples. In likelihood weighting, the importance function in Step 2 is

Pr( $\displaystyle \bf X$ \ $\displaystyle \bf E$ ) = $\displaystyle \left.\vphantom{ \prod_{x_{i} \notin\mathbf{e}} {\mbox{\rm Pr}}(x_{i}\vert{\mbox{\rm Pa}}(X_{i}))}\right.$ $\displaystyle \prod_{x_{i} \notin\mathbf{e}}^{}$ Pr(x_i|Pa(X_i)) $\displaystyle \left.\vphantom{ \prod_{x_{i} \notin\mathbf{e}} {\mbox{\rm Pr}}(x_{i}\vert{\mbox{\rm Pa}}(X_{i}))}\right\vert _{\mathbf{E}=\mathbf{e}}^{}$ .

Likelihood weighting does not update the importance function in Step 7. Although likelihood weighting is an improvement on logic sampling, its convergence rate can be still very slow when there is large difference between the optimal importance function and Pr( $\bf X$ \ $\bf E$ ), again especially in situations when evidence is very unlikely. Because of its simplicity, the likelihood weighting algorithm has been the most commonly used simulation method for Bayesian network inference. It often matches the performance of other, more sophisticated schemes because it is simple and able to increase its precision by generating more samples than other algorithms in the same amount of time.

Backward sampling [Fung and del Favero1994] changes Step 1 of our generic algorithm and allows for generating samples from evidence nodes in the direction that is opposite to the topological order of nodes in the network. In Step 2, backward sampling uses the likelihood of some of the observed evidence and some instantiated nodes to calculate Pr⁰( $\bf X$ \ $\bf E$ ). Although Fung and del Favero mentioned the possibility of dynamic node ordering, they did not propose any scheme for updating the importance function in Step 7. Backward sampling suffers from problems that are similar to those of likelihood weighting, i.e., a possible mismatch between its importance function and the optimal importance function can lead to poor convergence.

Importance sampling [Shachter and Peot1989] is the same as our generic sampling algorithm. Shachter and Peot introduced two variants of importance sampling: self-importance (SIS) and heuristic importance. The importance function used in the first step of the self-importance algorithm is

Pr⁰( $\displaystyle \bf X$ \ $\displaystyle \bf E$ ) = $\displaystyle \left.\vphantom{ \prod_{x_{i} \notin\mathbf{e}} {\mbox{\rm Pr}}(x_{i}\vert{\mbox{\rm Pa}}(X_{i}))}\right.$ $\displaystyle \prod_{x_{i} \notin\mathbf{e}}^{}$ Pr(x_i|Pa(X_i)) $\displaystyle \left.\vphantom{ \prod_{x_{i} \notin\mathbf{e}} {\mbox{\rm Pr}}(x_{i}\vert{\mbox{\rm Pa}}(X_{i}))}\right\vert _{\mathbf{E}=\mathbf{e}}^{}$ .

This function is updated in Step 7. The algorithm tries to revise the conditional probability tables (CPTs) periodically in order to make the sampling distribution gradually approach the posterior distribution. Since the same data are used to update the importance function and to compute the estimator, this process introduces bias in the estimator. Heuristic importance first removes edges from the network until it becomes a polytree, and then uses a modified version of the polytree algorithm [Pearl1986] to compute the likelihood functions for each of the unobserved nodes. Pr⁰( $\bf X$ \ $\bf E$ ) is a combination of these likelihood functions with Pr( $\bf X$ \ $\bf E$ , $\bf e$ ). In Step 7 heuristic importance does not update Pr^k( $\bf X$ \ $\bf E$ ). As Shachter and Peot [1989] point out, this heuristic importance function can still lead to a bad approximation of the optimal importance function. There exist also other algorithms such as a combination of self-importance and heuristic importance [Shachter and Peot1989,Shwe and Cooper1991]. Although some researchers suggested that this may be a promising direction for the work on sampling algorithms, we have not seen any results that would follow up on this.

A separate group of stochastic sampling methods is formed by so-called Markov Chain Monte Carlo (MCMC) methods that are divided into Gibbs sampling, Metropolis sampling, and Hybrid Monte Carlo sampling [Geman and Geman1984,Gilks et al.1996,MacKay1998]. Roughly speaking, these methods draw random samples from an unknown target distribution f ( $\bf X$ ) by biasing the search for this distribution towards higher probability regions. When applied to Bayesian networks [Pearl1987,Chavez and Cooper1990] this approach determines the sampling distribution of a variable from its previous sample given its Markov blanket [Pearl1988]. This corresponds to updating Pr^k( $\bf X$ \ $\bf E$ ) when sampling every node. Pr^k( $\bf X$ \ $\bf E$ ) will converge to the optimal importance function for Pr( $\bf e$ ) if Pr⁰( $\bf X$ \ $\bf E$ ) satisfies some ergodic properties [York1992]. Since the convergence to the limiting distribution is very slow and calculating updates of the sampling distribution is costly, these algorithms are not used in practice as often as the simple likelihood weighting scheme.

There are also some other simulation algorithms, such as bounded variance algorithm [Dagum and Luby1997] and the AA algorithm [Dagum et al.1995], which are essentially based on the LW algorithm and the Stopping-Rule Theorem [Dagum et al.1995]. Cano et al. [1996] proposed another importance sampling algorithm that performed somewhat better than LW in cases with extreme probability distributions, but, as the authors state, in general cases it ``produced similar results to the likelihood weighting algorithm.'' Hernandez et al. [1998] also applied importance sampling and reported a moderate improvement on likelihood weighting.

Next: Practical Performance of the Up: Importance Sampling Algorithms for Previous: A Generic Importance Sampling

Jian Cheng 2000-10-01