The dimensionality of the problem of Bayesian network inference is equal to the number of variables in a network, which in the networks considered in this paper can be very high. As a result, the learning space of the optimal importance function is very large. Choice of the initial importance function Pr0(\) is an important factor affecting the learning -- an initial value of the importance function that is close to the optimal importance function can greatly affect the speed of convergence. In this section, we present two heuristics that help to achieve this goal.
Due to their explicit encoding of the structure of a decomposable joint probability distribution, Bayesian networks offer computational advantages compared to finite-dimensional integrals. A possible first approximation of the optimal importance function is the prior probability distribution over the network variables, Pr(. We propose an improvement on this initialization. We know that the effect of evidence nodes on a node will be attenuated as the path length of that node to evidence nodes is increased [Henrion1989] and the most affected nodes are the direct ancestors of the evidence nodes. Initializing the ICPT tables of the parents of the evidence nodes to uniform distributions in our experience improves the convergence rate. Furthermore, the CPT tables of the parents of an evidence node E may be not favorable to the observed state e if the probability of E = e without any condition is less than a small value, such as Pr(E = e) < 1/(2 . nE), where nE is the number of outcomes of node E. Based on this observation, we change the CPT tables of the parents of an evidence node E to uniform distributions in our experiment only when Pr(E = e) < 1/(2 . nE), otherwise we leave them unchanged. This kind of initialization involves the knowledge of Pr(E = e), the marginal probability without evidence. Probabilistic logic sampling [Henrion1988] enhanced by Latin hypercube sampling [Cheng and Druzdzel2000b] or quasi-Monte Carlo methods [Cheng and Druzdzel2000a] will produce a very good estimate of Pr(E = e). This is an one-time effort that can be made at the model building stage and is worth pursuing to any desired precision.
Another serious problem related to sampling are extremely small probabilities. Suppose there exists a root node with a state s that has the prior probability Pr(s) = 0.0001. Let the posterior probability of this state given evidence be Pr(s|) = 0.8. A simple calculation shows that if we update the importance function every 1, 000 samples, we can expect to hit s only once every 10 updates. Thus s's convergence rate will be very slow. We can overcome this problem by setting a threshold and replacing every probability p < in the network by .2 At the same time, we subtract ( - p) from the largest probability in the same conditional probability distribution. For example, the value of = 10/l, where l is the updating interval, will allow us to sample 10 times more often in the first stage of the algorithm. If this state turns out to be more likely (having a large weight), we can increase its probability even more in order to converge to the correct answer faster. Considering that we should avoid f () g() - I . f () in an unimportant region as discussed in Section 2.1, we need to make this threshold larger. We have found that the convergence rate is quite sensitive to this threshold. Based on our empirical tests, we suggest to use = 0.04 in networks whose maximum number of outcomes per node does not exceed five. A smaller threshold might lead to fast convergence in some cases but slow convergence in others. If one threshold does not work, changing it in a specific network will usually improve convergence rate.