Next: Approximate inferences through Lavine's Up: Robustness Analysis of Bayesian Previous: Complexity and generic approximations

# Approximate inferences through parameter estimation

In this section we recast the robust inference problem as a parameter estimation problem. Consider a Quasi-Bayesian network and the transformed Bayesian network with artificial variables { z'i }. Each artificial variable has values { 1, 2, ..., |z'i| }. Assume z'i is a random variable with distribution thetaij = p(z'i = j). Call Theta the vector of all thetaij.

Suppose xq is a queried variable. The objective is to find:

p(xq = a | e) = maxTheta { sumx is in {e, xq = a} prodi pi{ e, xq = a}}/ { sumx is in e prodi pie }.

This problem is very similar to the problem of estimating the vector of parameters Theta given data e by maximum likelihood. Note that the optimizing vector must have, for each i, thetaij = 1 for some j. The difference between this and maximum likelihood parameter estimation is that the distribution to be optimized involves the posterior distribution instead of the prior. We can use learning techniques for this problem [Buntine1994]. Notice that the optimization procedure has to be repeated for each of the values of the queried variable xq.

Optimization through gradient descent techniques, such as the conjugate gradient algorithm, have been used for learning in Bayesian networks. These algorithms benefit from the fact that the necessary gradient calculations can be computed with standard Bayesian network algorithms [Russell et al. 1995].

To solve the robust inference problem, we must maximize the posterior log-likelihood for Theta (minimization is accomplished by maximizing the negative log-likelihood):

L(Theta) = logp(xq = a | e) = log{ p(xq = a, e) }/{ p(e) } ,

which is:

L(Theta) = logp(xq = a, e) - logp(e).

The gradient of L(Theta) is obtained by computing, for each thetaij:

{ pdL(Theta) }/{ pdthetaij } = { pdlogp(xq = a, e) }/{ pdthetaij } - { pdlogp(e) }/{ pdthetaij } .

Each term in the right-hand side of this expression can be written as { pdlogp(e') }/{ pdthetaij } where e' is some subset of the variables x. We have:

{ pdlogp(e') }/{ pdthetaij } = { pdsumj p(e'|z'i = j) p(z'i = j) }/{ pdthetaij } { 1 }/{ p(e') } .

Since p(z'i = j) = thetaij and this term appears only once in the summation, we have:

{ pdlogp(e') }/{ pdthetaij } = { p(e'|z'i = j) }/{ p(e') } .

and using Bayes rule:

{ pdlogp(e') }/{ pdthetaij } = { p(z'i = j|e') p(e') }/{ p(z'i = j) } { 1 }/{ p(e') } = { p(z'i = j|e') }/{ p(z'i = j) } .

The final expression for the gradient of L(Theta) with respect to thetaij is:

{ pdL(Theta) }/{ pdthetaij } = { p(z'i = j|xq = a, e) }/{ thetaij } - { p(z'i = j|e) }/{ thetaij } ,

which can be obtained through standard Bayesian network algorithms using local computations. A conjugate gradient descent can be constructed by selecting an initial value for Theta and, at each step, normalizing the values of Theta to ensure they represent proper distributions [Russell et al. 1995].

## Expectation Maximization (EM) algorithm

The EM algorithm produces a maximum likelihood estimate by maximizing the log-likelihood expression [Dempster, Laird, & Rubin1977]. To solve the robust inference problem, we must maximize the posterior log-likelihood L(Theta). We show how the original EM algorithm can be extended to a Quasi-Bayesian Expectation Maximization (QEM) algorithm and prove convergence properties.

The algorithm begins by assuming that the artificial variables are actual random quantities with distributions specified by Theta, and that we could even observe those variables as evidence. An initial estimate Theta0 is assumed for Theta.

Suppose we had l sets of complete data for the transformed network, i.e., we had observed l trials for all variables in the network, including the artificial variables. The log-likelihood for this complete data would be:

L(Theta) = sumijk li(j,k) logthetaijk,

where li(j,k) indicates the number of data points when the variable xi is instantiated in its j value with its parents instantiated in their k value.

The first step of the QEM algorithm is to obtain the expected value of the log-likelihood given the evidence and assuming Theta0 is correct:

Q(Theta|Thetak) = E{[ log{( p(xq = a, e) } - log{( p(e) } }.

Suppose we wanted the expected value of L(Theta) for given evidence e. We have:

E{[ L(Theta) } = E{[ sumijk li(j,k) logthetaij } = sumijk {( suml p(xi, pa(xi) | e) } logthetaijk.

Based on this expression, we have:

Q(Theta|Thetak) = sumijk p(xi, pa(xi) | xq = a, e) logthetaijk - sumijk p(xi, pa(xi) | e) logthetaijk.

The second step of the QEM algorithm is to maximize Q(Theta|Thetak) for Theta. Only a few terms in the expression for Q(Theta|Thetak) will be free, since only the thetaij for z'i are estimated. Collecting these terms we obtain:

sumij p(z'i = j | xq = a, e) logthetaij - sumij p(z'i = j| e) logthetaij,

To perform maximization, use gradient descent with Thetak as a starting point and ensure that at the end of the process we have Q(Thetak+1|Thetak) > Q(Thetak|Thetak). The gradient is:

{ p(z'i = j|xq = a, e) }/{ thetaij } - { p(z'i = j|e) }/{ thetaij } ,

which can be obtained through standard Bayesian network algorithms. Note this is the expression used in the previous section to perform gradient descent optimization; in the QEM algorithm, we must ensure that Q(Thetak+1|Thetak) > Q(Thetak|Thetak) to obtain convergence. Now set Thetak+1 to these values go to the next iteration.

Proof. We have:

H(Theta'|Theta) = - sum1370x is in { xq, e } log{( { p(x, xq|e, Theta') }/{ p(xq | e, Theta') } } p(x | xq, e, Theta)

Q(Theta'|Theta) = sum1371x is in { xq, e } logp(x, xq|e, Theta') p(x | xq, e, Theta).

Consequently:

L(Theta') = Q(Theta'|Theta) + H(Theta'|Theta)

and (by [Dempster, Laird, & Rubin1977, Lemma 1,]):

H(Thetak+1|Thetak) geH(Thetak|Thetak).

Since by construction we have Q(Thetak+1|Thetak) > Q(Thetak|Thetak), we obtain:

L(Thetak+1) = Q(Thetak+1|Thetak) + H(Thetak+1|Thetak) > Q(Thetak|Thetak) + H(Thetak|Thetak) = L(Thetak).

The algorithm produces a monotonically increasing and bounded sequence, so the sequence converges to a local maximum of L(Theta).

Next: Approximate inferences through Lavine's Up: Robustness Analysis of Bayesian Previous: Complexity and generic approximations