Introduction

Bayesian networks [Pearl1988] are increasingly popular tools
for modeling uncertainty in intelligent systems. With practical
models reaching the size of several hundreds of variables (e.g.,
[Pradhan *et al.*1994,Conati *et al.*1997]), it becomes increasingly important
to address the problem of feasibility of probabilistic inference.
Even though several ingenious exact algorithms have been proposed,
in very large models they all stumble on the theoretically
demonstrated NP-hardness of inference [Cooper1990]. The
significance of this result can be observed in practice -- exact
algorithms applied to large, densely connected practical networks
require either a prohibitive amount of memory or a prohibitive
amount of computation and are unable to complete. While
approximating inference to any desired precision has been shown to
be NP-hard as well [Dagum and Luby1993], it is for very complex networks
the only alternative that will produce any result at all.
Furthermore, while obtaining the result is crucial in all
applications, precision guarantees may not be critical for some
types of problems and can be traded off against the speed of
computation.

A prominent subclass of approximate algorithms is the family of
stochastic sampling algorithms (also called *stochastic
simulation* or *Monte Carlo* algorithms). The precision
obtained by stochastic sampling generally increases with the
number of samples generated and is fairly unaffected by the
network size. Execution time is fairly independent of the topology
of the network and is linear in the number of samples. Computation
can be interrupted at any time, yielding an anytime property of
the algorithms, important in time-critical applications.

While stochastic sampling performs very well in predictive
inference, diagnostic reasoning, i.e., reasoning from observed
evidence nodes to their ancestors in the network often exhibits
poor convergence. When the number of observations increases,
especially if these observations are unlikely a-priori, stochastic
sampling often fails to converge to reasonable estimates of the
posterior probabilities. Although this problem has been known
since the very first sampling algorithm was proposed by Henrion
[1988], little has been done to address it
effectively. Furthermore, various sampling algorithms proposed
were tested on simple and small networks, or networks with special
topology, without the presence of extremely unlikely evidence and
the practical significance of this problem has been
underestimated. Given a typical number of samples used in
real-time that are feasible on today's hardware, say 10^{6}
samples, the behavior of a stochastic sampling algorithm will be
drastically different for different size networks. While in a
network consisting of 10 nodes and a few observations, it may be
possible to converge to exact probabilities, in very large
networks only a negligibly small fraction of the total sample
space will be probed. One of the practical Bayesian network models
that we used in our tests, a subset of the CPCS network
[Pradhan *et al.*1994], consists of 179 nodes. Its total sample
space is larger than 10^{61}. With 10^{6} samples, we can sample
only 10^{-55} fraction of the sample space.

We believe that it is crucial (1) to study the feasibility and
convergence properties of sampling algorithms on very large
practical networks, and (2) to develop sampling algorithms that
will show good convergence under extreme, yet practical
conditions, such as evidential reasoning given extremely unlikely
evidence. After all, small networks can be updated using any of
the existing exact algorithms -- it is precisely the very large
networks where stochastic sampling can be most useful. As to the
likelihood of evidence, we know that stochastic sampling will
generally perform well when it is high [Henrion1988]. So,
it is important to look at those cases in which evidence is very
unlikely. In this paper, we test two existing state of the art
stochastic sampling algorithms for Bayesian networks, likelihood
weighting [Fung and Chang1989,Shachter and Peot1989] and self-importance
sampling [Shachter and Peot1989], on a subset of the CPCS
network with extremely unlikely evidence. We show that they both
exhibit similarly poor convergence rates. We propose a new
sampling algorithm, that we call the *adaptive importance
sampling* for Bayesian networks (AIS-BN), which is suitable for
evidential reasoning in large multiply-connected Bayesian
networks. The AIS-BN algorithm is based on importance sampling,
which is a widely applied method for variance reduction in
simulation that has also been applied in Bayesian networks (e.g.,
[Shachter and Peot1989]). We demonstrate empirically on three large
practical Bayesian network models that the AIS-BN algorithm
consistently outperforms the other two algorithms. In the majority
of the test cases, it achieved over two orders of magnitude
improvement in convergence. Improvement in speed given a desired
precision is even more dramatic, although we are unable to report
numerical results here, as the other algorithms never achieved the
precision reached even by the first few iterations of the AIS-BN
algorithm. The main sources of improvement are: (1) two heuristics
for the initialization of the importance function that are based
on the theoretical properties of importance sampling in
finite-dimensional integrals and the structural advantages of
Bayesian networks, (2) a smooth learning method for updating the
importance function, and (3) a dynamic weighting function for
combining samples from different stages of the algorithm. We study
the value of the two heuristics used in the AIS-BN algorithm: (1)
initialization of the probability distributions of parents of
evidence nodes to uniform distribution and (2) adjusting very
small probabilities in the conditional probability tables, and
show that they both play an important role in the AIS-BN algorithm
but only a moderate role in the existing algorithms.

The remainder of this paper is structured as follows. Section 2 first gives a general introduction to importance sampling in the domain of finite-dimensional integrals, where it was originally proposed. We show how importance sampling can be used to compute probabilities in Bayesian networks and how it can draw additional benefits from the graphical structure of the network. Then we develop a generalized sampling scheme that will aid us in reviewing the previously proposed sampling algorithms and in describing the AIS-BN algorithm. Section 3 describes the AIS-BN algorithm. We propose two heuristics for initialization of the importance function and discuss their theoretical foundations. We describe a smooth learning method for the importance function and a dynamic weighting function for combining samples from different stages of the algorithm. Section 4 describes the empirical evaluation of the AIS-BN algorithm. Finally, Section 5 suggests several possible improvements to the AIS-BN algorithm, possible applications of our learning scheme, and directions for future work.