 
 
 
 
 
   
The main difference between various stochastic sampling algorithms is in how they process Steps 2, 7, and 8 in the generic importance sampling algorithm of Figure 1.
Probabilistic logic sampling [Henrion1988] is the
simplest and the first proposed sampling algorithm for Bayesian
networks. The importance function is initialized in Step 2 to
Pr( ) and never updated (Step 7 is null). Without
evidence, 
Pr(
) and never updated (Step 7 is null). Without
evidence, 
Pr( ) is the optimal importance function for
the evidence set, which is empty anyway. It escapes most authors
that 
Pr(
) is the optimal importance function for
the evidence set, which is empty anyway. It escapes most authors
that 
Pr( ) may be not the optimal importance function
for 
Pr(A = a), 
A
) may be not the optimal importance function
for 
Pr(A = a), 
A  
  , when A is not a root node. A
mismatch between the optimal and the actually used importance
function may result in a large variance. The sampling process with
evidence is the same as without evidence except that in Step 10 we
do not count the scores for those samples that are inconsistent
with the observed evidence, which amounts to discarding them. When
the evidence is very unlikely, there is a large difference between
Pr(
, when A is not a root node. A
mismatch between the optimal and the actually used importance
function may result in a large variance. The sampling process with
evidence is the same as without evidence except that in Step 10 we
do not count the scores for those samples that are inconsistent
with the observed evidence, which amounts to discarding them. When
the evidence is very unlikely, there is a large difference between
Pr( ) and the optimal importance function. Effectively,
most samples are discarded and the performance of logic sampling
deteriorates badly.
) and the optimal importance function. Effectively,
most samples are discarded and the performance of logic sampling
deteriorates badly.
Likelihood weighting (LW) [Fung and Chang1989,Shachter and Peot1989] enhances the logic sampling in that it never discards samples. In likelihood weighting, the importance function in Step 2 is
 \
\ ) =
) = 
 Pr(xi|Pa(Xi))
Pr(xi|Pa(Xi)) .
.
 \
\ ), again especially in situations
when evidence is very unlikely. Because of its simplicity, the
likelihood weighting algorithm has been the most commonly used
simulation method for Bayesian network inference. It often matches
the performance of other, more sophisticated schemes because it is
simple and able to increase its precision by generating more
samples than other algorithms in the same amount of time.
), again especially in situations
when evidence is very unlikely. Because of its simplicity, the
likelihood weighting algorithm has been the most commonly used
simulation method for Bayesian network inference. It often matches
the performance of other, more sophisticated schemes because it is
simple and able to increase its precision by generating more
samples than other algorithms in the same amount of time.
Backward sampling [Fung and del Favero1994] changes Step 1 of our
generic algorithm and allows for generating samples from evidence
nodes in the direction that is opposite to the topological order
of nodes in the network. In Step 2, backward sampling uses the
likelihood of some of the observed evidence and some instantiated
nodes to calculate 
Pr0( \
\ ). Although
Fung and del Favero mentioned the possibility of dynamic node
ordering, they did not propose any scheme for updating the
importance function in Step 7. Backward sampling suffers from
problems that are similar to those of likelihood weighting, i.e.,
a possible mismatch between its importance function and the
optimal importance function can lead to poor convergence.
). Although
Fung and del Favero mentioned the possibility of dynamic node
ordering, they did not propose any scheme for updating the
importance function in Step 7. Backward sampling suffers from
problems that are similar to those of likelihood weighting, i.e.,
a possible mismatch between its importance function and the
optimal importance function can lead to poor convergence.
Importance sampling [Shachter and Peot1989] is the same as our generic sampling algorithm. Shachter and Peot introduced two variants of importance sampling: self-importance (SIS) and heuristic importance. The importance function used in the first step of the self-importance algorithm is
 \
\ ) =
) = 
 Pr(xi|Pa(Xi))
Pr(xi|Pa(Xi)) .
  .
 \
\ ) is a
combination of these likelihood functions with 
Pr(
) is a
combination of these likelihood functions with 
Pr( \
\ ,
, ). In Step 7 heuristic importance does
not update 
Prk(
). In Step 7 heuristic importance does
not update 
Prk( \
\ ). As Shachter and
Peot [1989] point out, this heuristic importance
function can still lead to a bad approximation of the optimal
importance function. There exist also other algorithms such as a
combination of self-importance and heuristic importance
[Shachter and Peot1989,Shwe and Cooper1991]. Although some researchers
suggested that this may be a promising direction for the work on
sampling algorithms, we have not seen any results that would
follow up on this.
). As Shachter and
Peot [1989] point out, this heuristic importance
function can still lead to a bad approximation of the optimal
importance function. There exist also other algorithms such as a
combination of self-importance and heuristic importance
[Shachter and Peot1989,Shwe and Cooper1991]. Although some researchers
suggested that this may be a promising direction for the work on
sampling algorithms, we have not seen any results that would
follow up on this.
A separate group of stochastic sampling methods is formed by
so-called Markov Chain Monte Carlo (MCMC) methods that are
divided into Gibbs sampling, Metropolis sampling, and Hybrid Monte
Carlo sampling [Geman and Geman1984,Gilks et al.1996,MacKay1998]. Roughly speaking,
these methods draw random samples from an unknown target
distribution 
f ( ) by biasing the search for this
distribution towards higher probability regions. When applied to
Bayesian networks [Pearl1987,Chavez and Cooper1990] this approach determines
the sampling distribution of a variable from its previous sample
given its Markov blanket [Pearl1988]. This corresponds to
updating 
Prk(
) by biasing the search for this
distribution towards higher probability regions. When applied to
Bayesian networks [Pearl1987,Chavez and Cooper1990] this approach determines
the sampling distribution of a variable from its previous sample
given its Markov blanket [Pearl1988]. This corresponds to
updating 
Prk( \
\ ) when sampling every
node. 
Prk(
) when sampling every
node. 
Prk( \
\ ) will converge to the
optimal importance function for 
Pr(
) will converge to the
optimal importance function for 
Pr( ) if 
Pr0(
) if 
Pr0( \
\ ) satisfies some ergodic properties
[York1992]. Since the convergence to the limiting distribution
is very slow and calculating updates of the sampling distribution
is costly, these algorithms are not used in practice as often as
the simple likelihood weighting scheme.
) satisfies some ergodic properties
[York1992]. Since the convergence to the limiting distribution
is very slow and calculating updates of the sampling distribution
is costly, these algorithms are not used in practice as often as
the simple likelihood weighting scheme.
There are also some other simulation algorithms, such as bounded variance algorithm [Dagum and Luby1997] and the AA algorithm [Dagum et al.1995], which are essentially based on the LW algorithm and the Stopping-Rule Theorem [Dagum et al.1995]. Cano et al. [1996] proposed another importance sampling algorithm that performed somewhat better than LW in cases with extreme probability distributions, but, as the authors state, in general cases it ``produced similar results to the likelihood weighting algorithm.'' Hernandez et al. [1998] also applied importance sampling and reported a moderate improvement on likelihood weighting.
 
 
 
 
