A Generic Importance Sampling Algorithm for Bayesian Networks

In the following discussion, all random variables used are
multiple-valued, discrete variables. Capital letters, such as *A*,
*B*, or *C*, denote random variables. Bold capital letters, such
as , , or , denote sets of variables.
Bold capital letter will usually be used to denote the
set of evidence variables. Lower case letters *a*, *b*, *c* denote
particular instantiations of variables *A*, *B*, and *C*
respectively. Bold lower case letters, such as , , , denote particular instantiations of sets , , and respectively. Bold lower case letter
, in particular, will be used to denote the observations,
i.e., instantiations of the set of evidence variables .
Anc(*A*) denotes the set of ancestors of node *A*.
Pa(*A*)
denotes the set of parents (direct ancestors) of node *A*.
pa(*A*) denotes a particular instantiation of
Pa(*A*).
\ denotes set difference.
Pa(**A**) denotes that we
use the extended vertical bar to indicate substitution of
**e** for
**E** in
**A**.

We know that the joint probability distribution over all variables of a Bayesian network model, Pr(), is the product of the probability distributions over each of the nodes conditional on their parents, i.e.,

In order to calculate Pr( = ), we need to sum over all Pr(\, = ). We can see that Equation 6 is almost identical to Equation 1 except that integration is replaced by summation and the domain is replaced by \. The theoretical results derived for the importance sampling that we reviewed in the previous section can thus be directly applied to computing probabilities in Bayesian networks.While there has been previous work on importance sampling-based algorithms for Bayesian networks, we will postpone the discussion of this work until the next section. Here we will present a generic stochastic sampling algorithm that will help us in both reviewing the prior work and in presenting our algorithm.

The posterior probability Pr(|) can be obtained by first computing Pr(,) and Pr() and then combining these based on the definition of conditional probability

In order to increase the accuracy of results of importance sampling in computing the posterior probabilities over different network variables given evidence, we should in general use different importance functions for Pr(,) and for Pr(). Doing so increases the computation time only linearly while the gain in accuracy may be significant given that obtaining a desired accuracy is exponential in nature. Very often, it is a common practice to use the same importance function (usually for Pr()) to sample both probabilities. If the difference between the optimal importance functions for these two quantities is large, the performance may deteriorate significantly. Although () and () are unbiased estimators according to Property 1 (Section 2.1), (|) obtained by means of Equation 7 is not an unbiased estimator. However, as the number of samples increases, the bias decreases and can be ignored altogether when the sample size is large enough [Fishman1995].
Figure 1 presents a generic stochastic sampling
algorithm that captures most of the existing sampling algorithms.
Without the loss of generality, we restrict ourselves in our
description to so-called *forward sampling*, i.e., generation
of samples in the topological order of the nodes in the network.
The forward sampling order is accomplished by the initialization
performed in Step 1, where parents of each node are placed before
the node itself. In forward sampling, Step 8 of the algorithm, the
actual generation of samples, works as follows. (*i*) each
evidence node is instantiated to its observed state and is further
omitted from sample generation; (*ii*) each root node is
randomly instantiated to one of its possible states, according to
the importance prior probability of this node, which can be
derived from
Pr^{k}(\); (*iii*) each
node whose parents are instantiated is randomly instantiated to
one of its possible states, according to the importance
conditional probability distribution of this node given the values
of the parents, which can also be derived from
Pr^{k}(\); (*iv*) this procedure is followed until
all nodes are instantiated. A complete instantiation
of the network based on this method is one sample of the joint
importance probability distribution
Pr^{k}(\) over all variables of the network. The scoring of Step 10
amounts to calculating
Pr(,)/Pr^{k}(), as required by Equation 2. The ratio
between the total score sum and the number of samples is an
unbiased estimator of
Pr(). In Step 10, if we also
count the score sum under the condition
= , i.e.,
that some unobserved variables have the values , the ratio between this score sum and the number of samples is
an unbiased estimator of
Pr(,).

Most existing algorithms focus on the posterior probability
distributions of individual nodes. As we mentioned above, for the
sake of efficiency they count the score sum corresponding to
Pr(*A* = *a*,),
*A* \, and record
it in an score array for node *A*. Each entry of this array
corresponds to a specified state of *A*. This method introduces
additional variance, as opposed to using the importance function
derived from
Pr^{k}(\) to sample
Pr(*A* = *a*,), *A*
\,
directly.