In the following discussion, all random variables used are multiple-valued, discrete variables. Capital letters, such as A, B, or C, denote random variables. Bold capital letters, such as , , or , denote sets of variables. Bold capital letter will usually be used to denote the set of evidence variables. Lower case letters a, b, c denote particular instantiations of variables A, B, and C respectively. Bold lower case letters, such as , , , denote particular instantiations of sets , , and respectively. Bold lower case letter , in particular, will be used to denote the observations, i.e., instantiations of the set of evidence variables . Anc(A) denotes the set of ancestors of node A. Pa(A) denotes the set of parents (direct ancestors) of node A. pa(A) denotes a particular instantiation of Pa(A). \ denotes set difference. Pa(A) denotes that we use the extended vertical bar to indicate substitution of e for E in A.
We know that the joint probability distribution over all variables of a Bayesian network model, Pr(), is the product of the probability distributions over each of the nodes conditional on their parents, i.e.,
While there has been previous work on importance sampling-based algorithms for Bayesian networks, we will postpone the discussion of this work until the next section. Here we will present a generic stochastic sampling algorithm that will help us in both reviewing the prior work and in presenting our algorithm.
The posterior probability Pr(|) can be obtained by first computing Pr(,) and Pr() and then combining these based on the definition of conditional probability
Figure 1 presents a generic stochastic sampling algorithm that captures most of the existing sampling algorithms. Without the loss of generality, we restrict ourselves in our description to so-called forward sampling, i.e., generation of samples in the topological order of the nodes in the network. The forward sampling order is accomplished by the initialization performed in Step 1, where parents of each node are placed before the node itself. In forward sampling, Step 8 of the algorithm, the actual generation of samples, works as follows. (i) each evidence node is instantiated to its observed state and is further omitted from sample generation; (ii) each root node is randomly instantiated to one of its possible states, according to the importance prior probability of this node, which can be derived from Prk(\); (iii) each node whose parents are instantiated is randomly instantiated to one of its possible states, according to the importance conditional probability distribution of this node given the values of the parents, which can also be derived from Prk(\); (iv) this procedure is followed until all nodes are instantiated. A complete instantiation of the network based on this method is one sample of the joint importance probability distribution Prk(\) over all variables of the network. The scoring of Step 10 amounts to calculating Pr(,)/Prk(), as required by Equation 2. The ratio between the total score sum and the number of samples is an unbiased estimator of Pr(). In Step 10, if we also count the score sum under the condition = , i.e., that some unobserved variables have the values , the ratio between this score sum and the number of samples is an unbiased estimator of Pr(,).
Most existing algorithms focus on the posterior probability distributions of individual nodes. As we mentioned above, for the sake of efficiency they count the score sum corresponding to Pr(A = a,), A \, and record it in an score array for node A. Each entry of this array corresponds to a specified state of A. This method introduces additional variance, as opposed to using the importance function derived from Prk(\) to sample Pr(A = a,), A \, directly.