In this section we recast the robust inference problem as a parameter estimation problem. Consider a Quasi-Bayesian network and the transformed Bayesian network with artificial variables { z'i }. Each artificial variable has values { 1, 2, ..., |z'i| }. Assume z'i is a random variable with distribution thetaij = p(z'i = j). Call Theta the vector of all thetaij.
Suppose xq is a queried variable. The objective is to find:
p(xq = a | e) =
Optimization through gradient descent techniques, such as the conjugate gradient algorithm, have been used for learning in Bayesian networks. These algorithms benefit from the fact that the necessary gradient calculations can be computed with standard Bayesian network algorithms [Russell et al. 1995].
To solve the robust inference problem, we must maximize the posterior log-likelihood for Theta (minimization is accomplished by maximizing the negative log-likelihood):
L(Theta) =
L(Theta) =
The gradient of L(Theta) is obtained by computing, for each thetaij:
{ pdL(Theta) }/{ pdthetaij } =
{ pd
{ pd
{ pd
{ pd
{ pdL(Theta) }/{ pdthetaij } = { p(z'i = j|xq = a, e) }/{ thetaij } - { p(z'i = j|e) }/{ thetaij } ,
which can be obtained through standard Bayesian network algorithms using local computations. A conjugate gradient descent can be constructed by selecting an initial value for Theta and, at each step, normalizing the values of Theta to ensure they represent proper distributions [Russell et al. 1995].
The EM algorithm produces a maximum likelihood estimate by maximizing the log-likelihood expression [Dempster, Laird, & Rubin1977]. To solve the robust inference problem, we must maximize the posterior log-likelihood L(Theta). We show how the original EM algorithm can be extended to a Quasi-Bayesian Expectation Maximization (QEM) algorithm and prove convergence properties.
The algorithm begins by assuming that the artificial variables are actual random quantities with distributions specified by Theta, and that we could even observe those variables as evidence. An initial estimate Theta0 is assumed for Theta.
Suppose we had l sets of complete data for the transformed network, i.e., we had observed l trials for all variables in the network, including the artificial variables. The log-likelihood for this complete data would be:
L(Theta) = sumijk li(j,k)
The first step of the QEM algorithm is to obtain the expected value of the log-likelihood given the evidence and assuming Theta0 is correct:
Q(Theta|Thetak) = E{[
The second step of the QEM algorithm is to maximize Q(Theta|Thetak) for Theta. Only a few terms in the expression for Q(Theta|Thetak) will be free, since only the thetaij for z'i are estimated. Collecting these terms we obtain:
sumij p(z'i = j | xq = a, e)
{ p(z'i = j|xq = a, e) }/{ thetaij } - { p(z'i = j|e) }/{ thetaij } ,
which can be obtained through standard Bayesian network algorithms. Note this is the expression used in the previous section to perform gradient descent optimization; in the QEM algorithm, we must ensure that Q(Thetak+1|Thetak) > Q(Thetak|Thetak) to obtain convergence. Now set Thetak+1 to these values go to the next iteration.
Proof. We have:
H(Theta'|Theta) =
- sum1370x is in { xq, e }
Q(Theta'|Theta) =
sum1371x is in { xq, e }
L(Theta') = Q(Theta'|Theta) + H(Theta'|Theta)
and (by [Dempster, Laird, & Rubin1977, Lemma 1,]):H(Thetak+1|Thetak) geH(Thetak|Thetak).
Since by construction we have Q(Thetak+1|Thetak) > Q(Thetak|Thetak), we obtain:
© Fabio Cozman[Send Mail?] Tue Jan 21 15:59:56 EST 1997