The main difference between various stochastic sampling algorithms is in how they process Steps 2, 7, and 8 in the generic importance sampling algorithm of Figure 1.
Probabilistic logic sampling [Henrion1988] is the
simplest and the first proposed sampling algorithm for Bayesian
networks. The importance function is initialized in Step 2 to
Pr() and never updated (Step 7 is null). Without
evidence,
Pr(
) is the optimal importance function for
the evidence set, which is empty anyway. It escapes most authors
that
Pr(
) may be not the optimal importance function
for
Pr(A = a),
A
, when A is not a root node. A
mismatch between the optimal and the actually used importance
function may result in a large variance. The sampling process with
evidence is the same as without evidence except that in Step 10 we
do not count the scores for those samples that are inconsistent
with the observed evidence, which amounts to discarding them. When
the evidence is very unlikely, there is a large difference between
Pr(
) and the optimal importance function. Effectively,
most samples are discarded and the performance of logic sampling
deteriorates badly.
Likelihood weighting (LW) [Fung and Chang1989,Shachter and Peot1989] enhances the logic sampling in that it never discards samples. In likelihood weighting, the importance function in Step 2 is
Backward sampling [Fung and del Favero1994] changes Step 1 of our
generic algorithm and allows for generating samples from evidence
nodes in the direction that is opposite to the topological order
of nodes in the network. In Step 2, backward sampling uses the
likelihood of some of the observed evidence and some instantiated
nodes to calculate
Pr0(\
). Although
Fung and del Favero mentioned the possibility of dynamic node
ordering, they did not propose any scheme for updating the
importance function in Step 7. Backward sampling suffers from
problems that are similar to those of likelihood weighting, i.e.,
a possible mismatch between its importance function and the
optimal importance function can lead to poor convergence.
Importance sampling [Shachter and Peot1989] is the same as our generic sampling algorithm. Shachter and Peot introduced two variants of importance sampling: self-importance (SIS) and heuristic importance. The importance function used in the first step of the self-importance algorithm is
A separate group of stochastic sampling methods is formed by
so-called Markov Chain Monte Carlo (MCMC) methods that are
divided into Gibbs sampling, Metropolis sampling, and Hybrid Monte
Carlo sampling [Geman and Geman1984,Gilks et al.1996,MacKay1998]. Roughly speaking,
these methods draw random samples from an unknown target
distribution
f () by biasing the search for this
distribution towards higher probability regions. When applied to
Bayesian networks [Pearl1987,Chavez and Cooper1990] this approach determines
the sampling distribution of a variable from its previous sample
given its Markov blanket [Pearl1988]. This corresponds to
updating
Prk(
\
) when sampling every
node.
Prk(
\
) will converge to the
optimal importance function for
Pr(
) if
Pr0(
\
) satisfies some ergodic properties
[York1992]. Since the convergence to the limiting distribution
is very slow and calculating updates of the sampling distribution
is costly, these algorithms are not used in practice as often as
the simple likelihood weighting scheme.
There are also some other simulation algorithms, such as bounded variance algorithm [Dagum and Luby1997] and the AA algorithm [Dagum et al.1995], which are essentially based on the LW algorithm and the Stopping-Rule Theorem [Dagum et al.1995]. Cano et al. [1996] proposed another importance sampling algorithm that performed somewhat better than LW in cases with extreme probability distributions, but, as the authors state, in general cases it ``produced similar results to the likelihood weighting algorithm.'' Hernandez et al. [1998] also applied importance sampling and reported a moderate improvement on likelihood weighting.