Existing Importance Sampling Algorithms for Bayesian Networks

The main difference between various stochastic sampling algorithms is in how they process Steps 2, 7, and 8 in the generic importance sampling algorithm of Figure 1.

*Probabilistic logic sampling* [Henrion1988] is the
simplest and the first proposed sampling algorithm for Bayesian
networks. The importance function is initialized in Step 2 to
Pr() and never updated (Step 7 is null). Without
evidence,
Pr() is the optimal importance function for
the evidence set, which is empty anyway. It escapes most authors
that
Pr() may be not the optimal importance function
for
Pr(*A* = *a*),
*A* , when *A* is not a root node. A
mismatch between the optimal and the actually used importance
function may result in a large variance. The sampling process with
evidence is the same as without evidence except that in Step 10 we
do not count the scores for those samples that are inconsistent
with the observed evidence, which amounts to discarding them. When
the evidence is very unlikely, there is a large difference between
Pr() and the optimal importance function. Effectively,
most samples are discarded and the performance of logic sampling
deteriorates badly.

*Likelihood weighting* (LW) [Fung and Chang1989,Shachter and Peot1989]
enhances the logic sampling in that it never discards samples. In
likelihood weighting, the importance function in Step 2 is

Pr(\) = Pr(*x*_{i}|Pa(*X*_{i})).

Likelihood weighting does not update the importance function in
Step 7. Although likelihood weighting is an improvement on logic
sampling, its convergence rate can be still very slow when there
is large difference between the optimal importance function and
Pr(\), again especially in situations
when evidence is very unlikely. Because of its simplicity, the
likelihood weighting algorithm has been the most commonly used
simulation method for Bayesian network inference. It often matches
the performance of other, more sophisticated schemes because it is
simple and able to increase its precision by generating more
samples than other algorithms in the same amount of time.
*Backward sampling* [Fung and del Favero1994] changes Step 1 of our
generic algorithm and allows for generating samples from evidence
nodes in the direction that is opposite to the topological order
of nodes in the network. In Step 2, backward sampling uses the
likelihood of some of the observed evidence and some instantiated
nodes to calculate
Pr^{0}(\). Although
Fung and del Favero mentioned the possibility of dynamic node
ordering, they did not propose any scheme for updating the
importance function in Step 7. Backward sampling suffers from
problems that are similar to those of likelihood weighting, i.e.,
a possible mismatch between its importance function and the
optimal importance function can lead to poor convergence.

*Importance sampling* [Shachter and Peot1989] is the same as
our generic sampling algorithm. Shachter and Peot introduced two
variants of importance sampling: *self-importance* (SIS) and
*heuristic importance*. The importance function used in the
first step of the self-importance algorithm is

Pr^{0}(\) = Pr(*x*_{i}|Pa(*X*_{i})) .

This function is updated in Step 7. The algorithm tries to revise
the conditional probability tables (CPTs) periodically in order
to make the sampling distribution gradually approach the posterior
distribution. Since the same data are used to update the
importance function and to compute the estimator, this process
introduces bias in the estimator. Heuristic importance first
removes edges from the network until it becomes a polytree, and
then uses a modified version of the polytree algorithm
[Pearl1986] to compute the likelihood functions for each of the
unobserved nodes.
Pr
A separate group of stochastic sampling methods is formed by
so-called *Markov Chain Monte Carlo (MCMC)* methods that are
divided into Gibbs sampling, Metropolis sampling, and Hybrid Monte
Carlo sampling [Geman and Geman1984,Gilks *et al.*1996,MacKay1998]. Roughly speaking,
these methods draw random samples from an unknown target
distribution
*f* () by biasing the search for this
distribution towards higher probability regions. When applied to
Bayesian networks [Pearl1987,Chavez and Cooper1990] this approach determines
the sampling distribution of a variable from its previous sample
given its Markov blanket [Pearl1988]. This corresponds to
updating
Pr^{k}(\) when sampling every
node.
Pr^{k}(\) will converge to the
optimal importance function for
Pr() if
Pr^{0}(\) satisfies some ergodic properties
[York1992]. Since the convergence to the limiting distribution
is very slow and calculating updates of the sampling distribution
is costly, these algorithms are not used in practice as often as
the simple likelihood weighting scheme.

There are also some other simulation algorithms, such as bounded
variance algorithm [Dagum and Luby1997] and the AA algorithm
[Dagum *et al.*1995], which are essentially based on the LW algorithm
and the Stopping-Rule Theorem [Dagum *et al.*1995]. Cano et al.
[1996] proposed another importance sampling algorithm that
performed somewhat better than LW in cases with extreme
probability distributions, but, as the authors state, in general
cases it ``produced similar results to the likelihood weighting
algorithm.'' Hernandez et al. [1998] also applied
importance sampling and reported a moderate improvement on
likelihood weighting.