next up previous
Next: Existing Importance Sampling Algorithms Up: Importance Sampling Algorithms for Previous: Mathematical Foundations

A Generic Importance Sampling Algorithm for Bayesian Networks

In the following discussion, all random variables used are multiple-valued, discrete variables. Capital letters, such as A, B, or C, denote random variables. Bold capital letters, such as $ \bf A$, $ \bf B$, or $ \bf C$, denote sets of variables. Bold capital letter $ \bf E$ will usually be used to denote the set of evidence variables. Lower case letters a, b, c denote particular instantiations of variables A, B, and C respectively. Bold lower case letters, such as $ \bf a$, $ \bf b$, $ \bf c$, denote particular instantiations of sets $ \bf A$, $ \bf B$, and $ \bf C$ respectively. Bold lower case letter $ \bf e$, in particular, will be used to denote the observations, i.e., instantiations of the set of evidence variables $ \bf E$. Anc(A) denotes the set of ancestors of node A. Pa(A) denotes the set of parents (direct ancestors) of node A. pa(A) denotes a particular instantiation of Pa(A). \ denotes set difference. $ \left.\vphantom{ {\mbox{\rm Pa}}(\mathbf{A})}\right.$ Pa(A)$ \left.\vphantom{ {\mbox{\rm Pa}}(\mathbf{A})}\right\vert _{\mathbf{E}=\mathbf{e}}^{}$ denotes that we use the extended vertical bar to indicate substitution of e for E in A.

We know that the joint probability distribution over all variables of a Bayesian network model, Pr($ \bf X$), is the product of the probability distributions over each of the nodes conditional on their parents, i.e.,

Pr($\displaystyle \bf X$) = $\displaystyle \prod\limits_{i=1}^{n}$Pr(Xi|Pa(Xi))  . (5)

In order to calculate Pr($ \bf E$ = $ \bf e$), we need to sum over all Pr($ \bf X$\$ \bf E$,$ \bf E$ = $ \bf e$).

Pr($\displaystyle \bf E$ = $\displaystyle \bf e$) = $\displaystyle \sum\limits_{{\bf X}\backslash{\bf E}}^{}$Pr($\displaystyle \bf X$\$\displaystyle \bf E$,$\displaystyle \bf E$ = $\displaystyle \bf e$) (6)

We can see that Equation 6 is almost identical to Equation 1 except that integration is replaced by summation and the domain $ \Omega$ is replaced by $ \bf X$\$ \bf E$. The theoretical results derived for the importance sampling that we reviewed in the previous section can thus be directly applied to computing probabilities in Bayesian networks.

While there has been previous work on importance sampling-based algorithms for Bayesian networks, we will postpone the discussion of this work until the next section. Here we will present a generic stochastic sampling algorithm that will help us in both reviewing the prior work and in presenting our algorithm.

Figure 1: A generic importance sampling algorithm.
Normalize the score arrays for every node.

The posterior probability Pr($ \bf a$|$ \bf e$) can be obtained by first computing Pr($ \bf a$,$ \bf e$) and Pr($ \bf e$) and then combining these based on the definition of conditional probability

Pr($\displaystyle \bf a$|$\displaystyle \bf e$) = $\displaystyle {\frac{{\mbox{\rm Pr}}({\bf a},{\bf e})}{{\mbox{\rm Pr}}({\bf e})}}$  . (7)

In order to increase the accuracy of results of importance sampling in computing the posterior probabilities over different network variables given evidence, we should in general use different importance functions for Pr($ \bf a$,$ \bf e$) and for Pr($ \bf e$). Doing so increases the computation time only linearly while the gain in accuracy may be significant given that obtaining a desired accuracy is exponential in nature. Very often, it is a common practice to use the same importance function (usually for Pr($ \bf e$)) to sample both probabilities. If the difference between the optimal importance functions for these two quantities is large, the performance may deteriorate significantly. Although $ \widehat{{\mbox{\rm Pr}}}$($ \bf a,e$) and $ \widehat{{\mbox{\rm Pr}}}$($ \bf e$) are unbiased estimators according to Property 1 (Section 2.1), $ \widehat{{\mbox{\rm Pr}}}$($ \bf a$|$ \bf e$) obtained by means of Equation 7 is not an unbiased estimator. However, as the number of samples increases, the bias decreases and can be ignored altogether when the sample size is large enough [Fishman1995].

Figure 1 presents a generic stochastic sampling algorithm that captures most of the existing sampling algorithms. Without the loss of generality, we restrict ourselves in our description to so-called forward sampling, i.e., generation of samples in the topological order of the nodes in the network. The forward sampling order is accomplished by the initialization performed in Step 1, where parents of each node are placed before the node itself. In forward sampling, Step 8 of the algorithm, the actual generation of samples, works as follows. (i) each evidence node is instantiated to its observed state and is further omitted from sample generation; (ii) each root node is randomly instantiated to one of its possible states, according to the importance prior probability of this node, which can be derived from Prk($ \bf X$\$ \bf E$); (iii) each node whose parents are instantiated is randomly instantiated to one of its possible states, according to the importance conditional probability distribution of this node given the values of the parents, which can also be derived from Prk($ \bf X$\$ \bf E$); (iv) this procedure is followed until all nodes are instantiated. A complete instantiation $ \bf s_{i}^{}$ of the network based on this method is one sample of the joint importance probability distribution Prk($ \bf X$\$ \bf E$) over all variables of the network. The scoring of Step 10 amounts to calculating Pr($ \bf s_{i}^{}$,$ \bf e$)/Prk($ \bf s_{i}^{}$), as required by Equation 2. The ratio between the total score sum and the number of samples is an unbiased estimator of Pr($ \bf e$). In Step 10, if we also count the score sum under the condition $ \bf A$ = $ \bf a$, i.e., that some unobserved variables $ \bf A$ have the values $ \bf a$, the ratio between this score sum and the number of samples is an unbiased estimator of Pr($ \bf a$,$ \bf e$).

Most existing algorithms focus on the posterior probability distributions of individual nodes. As we mentioned above, for the sake of efficiency they count the score sum corresponding to Pr(A = a,$ \bf e$), A $ \in$ $ \bf X$\$ \bf E$, and record it in an score array for node A. Each entry of this array corresponds to a specified state of A. This method introduces additional variance, as opposed to using the importance function derived from Prk($ \bf X$\$ \bf E$) to sample Pr(A = a,$ \bf e$), A $ \in$ $ \bf X$\$ \bf E$, directly.

next up previous
Next: Existing Importance Sampling Algorithms Up: Importance Sampling Algorithms for Previous: Mathematical Foundations
Jian Cheng 2000-10-01