Next: Approximate inferences through Lavine's Up: Robustness Analysis of Bayesian Previous: Complexity and generic approximations

Approximate inferences through parameter estimation

In this section we recast the robust inference problem as a parameter estimation problem. Consider a Quasi-Bayesian network and the transformed Bayesian network with artificial variables { z'_i }. Each artificial variable has values { 1, 2, ..., |z'_i| }. Assume z'_i is a random variable with distribution theta_ij = p(z'_i = j). Call Theta the vector of all theta_ij.

Suppose x_q is a queried variable. The objective is to find:

p(x_q = a | e) = max_Theta { sum_{x is in {e, x_q = a}} prod_i p_i^{{ e, x_q = a}}}/ { sum_{x is in e} prod_i p_i^e }.

This problem is very similar to the problem of estimating the vector of parameters Theta given data e by maximum likelihood. Note that the optimizing vector must have, for each i, theta_ij = 1 for some j. The difference between this and maximum likelihood parameter estimation is that the distribution to be optimized involves the posterior distribution instead of the prior. We can use learning techniques for this problem [Buntine1994]. Notice that the optimization procedure has to be repeated for each of the values of the queried variable x_q.

Gradient descent techniques

Optimization through gradient descent techniques, such as the conjugate gradient algorithm, have been used for learning in Bayesian networks. These algorithms benefit from the fact that the necessary gradient calculations can be computed with standard Bayesian network algorithms [Russell et al. 1995].

To solve the robust inference problem, we must maximize the posterior log-likelihood for Theta (minimization is accomplished by maximizing the negative log-likelihood):

L(Theta) = logp(x_q = a | e) = log{ p(x_q = a, e) }/{ p(e) } ,

which is:

L(Theta) = logp(x_q = a, e) - logp(e).

The gradient of L(Theta) is obtained by computing, for each theta_ij:

{ pdL(Theta) }/{ pdtheta_ij } = { pdlogp(x_q = a, e) }/{ pdtheta_ij } - { pdlogp(e) }/{ pdtheta_ij } .

Each term in the right-hand side of this expression can be written as { pdlogp(e') }/{ pdtheta_ij } where e' is some subset of the variables x. We have:

{ pdlogp(e') }/{ pdtheta_ij } = { pdsum_j p(e'|z'_i = j) p(z'_i = j) }/{ pdtheta_ij } { 1 }/{ p(e') } .

Since p(z'_i = j) = theta_ij and this term appears only once in the summation, we have:

{ pdlogp(e') }/{ pdtheta_ij } = { p(e'|z'_i = j) }/{ p(e') } .

and using Bayes rule:

{ pdlogp(e') }/{ pdtheta_ij } = { p(z'_i = j|e') p(e') }/{ p(z'_i = j) } { 1 }/{ p(e') } = { p(z'_i = j|e') }/{ p(z'_i = j) } .

The final expression for the gradient of L(Theta) with respect to theta_ij is:

{ pdL(Theta) }/{ pdtheta_ij } = { p(z'_i = j|x_q = a, e) }/{ theta_ij } - { p(z'_i = j|e) }/{ theta_ij } ,

which can be obtained through standard Bayesian network algorithms using local computations. A conjugate gradient descent can be constructed by selecting an initial value for Theta and, at each step, normalizing the values of Theta to ensure they represent proper distributions [Russell et al. 1995].

Expectation Maximization (EM) algorithm

The EM algorithm produces a maximum likelihood estimate by maximizing the log-likelihood expression [Dempster, Laird, & Rubin1977]. To solve the robust inference problem, we must maximize the posterior log-likelihood L(Theta). We show how the original EM algorithm can be extended to a Quasi-Bayesian Expectation Maximization (QEM) algorithm and prove convergence properties.

The algorithm begins by assuming that the artificial variables are actual random quantities with distributions specified by Theta, and that we could even observe those variables as evidence. An initial estimate Theta⁰ is assumed for Theta.

Suppose we had l sets of complete data for the transformed network, i.e., we had observed l trials for all variables in the network, including the artificial variables. The log-likelihood for this complete data would be:

L(Theta) = sum_ijk l_i(j,k) logtheta_ijk,

where l_i(j,k) indicates the number of data points when the variable x_i is instantiated in its j value with its parents instantiated in their k value.

The first step of the QEM algorithm is to obtain the expected value of the log-likelihood given the evidence and assuming Theta⁰ is correct:

Q(Theta|Theta^k) = E{[ log{( p(x_q = a, e) } - log{( p(e) } }.

Suppose we wanted the expected value of L(Theta) for given evidence e. We have:

E{[ L(Theta) } = E{[ sum_ijk l_i(j,k) logtheta_ij } = sum_ijk {( sum_l p(x_i, pa(x_i) | e) } logtheta_ijk.

Based on this expression, we have:

Q(Theta|Theta^k) = sum_ijk p(x_i, pa(x_i) | x_q = a, e) logtheta_ijk - sum_ijk p(x_i, pa(x_i) | e) logtheta_ijk.

The second step of the QEM algorithm is to maximize Q(Theta|Theta^k) for Theta. Only a few terms in the expression for Q(Theta|Theta^k) will be free, since only the theta_ij for z'_i are estimated. Collecting these terms we obtain:

sum_ij p(z'_i = j | x_q = a, e) logtheta_ij - sum_ij p(z'_i = j| e) logtheta_ij,

To perform maximization, use gradient descent with Theta^k as a starting point and ensure that at the end of the process we have Q(Theta^k+1|Theta^k) > Q(Theta^k|Theta^k). The gradient is:

{ p(z'_i = j|x_q = a, e) }/{ theta_ij } - { p(z'_i = j|e) }/{ theta_ij } ,

which can be obtained through standard Bayesian network algorithms. Note this is the expression used in the previous section to perform gradient descent optimization; in the QEM algorithm, we must ensure that Q(Theta^k+1|Theta^k) > Q(Theta^k|Theta^k) to obtain convergence. Now set Theta^k+1 to these values go to the next iteration.

Proof. We have:

H(Theta'|Theta) = - sum₁₃₇₀x is in { x_q, e } log{( { p(x, x_q|e, Theta') }/{ p(x_q | e, Theta') } } p(x | x_q, e, Theta)

Q(Theta'|Theta) = sum₁₃₇₁x is in { x_q, e } logp(x, x_q|e, Theta') p(x | x_q, e, Theta).

Consequently:

L(Theta') = Q(Theta'|Theta) + H(Theta'|Theta)

and (by [Dempster, Laird, & Rubin1977, Lemma 1,]):

H(Theta^k+1|Theta^k) geH(Theta^k|Theta^k).

Since by construction we have Q(Theta^k+1|Theta^k) > Q(Theta^k|Theta^k), we obtain:

L(Theta^k+1) = Q(Theta^k+1|Theta^k) + H(Theta^k+1|Theta^k) > Q(Theta^k|Theta^k) + H(Theta^k|Theta^k) = L(Theta^k).

The algorithm produces a monotonically increasing and bounded sequence, so the sequence converges to a local maximum of L(Theta).

Next: Approximate inferences through Lavine's Up: Robustness Analysis of Bayesian Previous: Complexity and generic approximations

Tue Jan 21 15:59:56 EST 1997