Lecture 5: Parameter Estimation in Fully Observed Bayesian Networks

Introduction to the problem of Parameter Estimation in fully observed Bayesian Networks


Last class, we presented two major types of graphical model task, Inference and Learning, and we mainly discussed about inference. In this lecture, we start to introduce the learning task, and explore some learning techniques used in fully observable Bayesian Networks.

Learning Graphical Models

Before we get started, it’s better for us to get some intuition about what learning is and which goal learning is achieving.

The goal

The goal: Given a set of independent samples (assignments of random variables), find the best (or the most likely) Bayesian Network (both DAG and CPDs).

As shown in the figure below, we are given a set of independent samples (assignments of binary random variables). Assume it’s a DAG, we are going to learn the directed links (causality relationships) between nodes, this process is called Structural Learning. Learning the conditional possibility is another task called Parameter learning.

The goal


As listed below, there’re several learning scenarios we are interested in:

  1. Completely observed GMs:
    • 1.1. directed
    • 1.2. undirected
  2. Partially or unobserved GMs:
    • 1.1. directed
    • 1.2. undirected (an open research topic)

Some useful estimation principles are also listed here:

  1. Maximal likelihood estimation (MLE)
  2. Bayesian estimation
  3. Maximal conditional likelihood
  4. Maximal “Margin”
  5. Maximum entropy

We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

Parameter Learning

In this section, we are going to introduce Parameter Learning, and some useful related models.

When doing parameter learning, we assume that the Graphical model G itself is known and fixed, G could come from either expert design or an intermediate outcome of iterative structure learning.

The goal of parameter learning is to estimate parameters from a dataset of independent, identically distributed (i.i.d.) training samples $D = {x_1, \cdots, x_N}$.

In general, each training case is a vector of values, one per node. Depending on the model, elements in could be all known (no missing values, no hidden variables) if the model is completely observable, or partially known ($\exists i, x_{n,i}$ is not observed) if the model is partially observable.

In this class, we mainly consider learning parameters for a BN that is completely observable with a given structure.

Exponential Family Distributions

There are various density estimation tasks that can be viewed as single-node GMs (see the supplementary section for examples), and they are in fact instances of Exponential Family Distributions.

Exponential Family Distributions are building blocks for general GMs, and have nice properties allowing us to easily do MLE and Bayesian estimation.


For a numeric random variable $X$,

Function $T(x)$ is a sufficient statistic (will be explained in sufficiency section).

Function $A(\eta) = \log{Z(\eta)}$ is the log normalizer.


We can see lots of probability distributions actually belonging to this family (including Bernoulli, Multinomial, Gaussian, Poisson, Gamma).

Example 1: Multivariate Gaussian Distribution

For a continuous vector random variable $X \in R^k$:

\begin{aligned} p_X(x|\mu,\Sigma) &= \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left\{-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)\right\} \\ &= \frac{1}{(2\pi)^{k/2}} \exp\left\{-\frac{1}{2}tr(\Sigma^{-1}xx^T)+\mu^T\Sigma^{-1}x-\frac{1}{2}\mu^T\Sigma^{-1}\mu-\log{|\Sigma|}\right\} \end{aligned}

Exponential family representation:

\begin{aligned} \eta &= [\Sigma^{-1}\mu;-\frac{1}{2}vec(\Sigma^{-1})] = [\eta_1,vec(\eta_2)], \eta_1 = \Sigma^{-1}\mu, \eta_2 = -\frac{1}{2}\Sigma^{-1} \\ T(x) &= [x;vec(xx^T)] \\ A(\eta) &= \frac{1}{2}\mu^T\Sigma^{-1}\mu+\log{|\Sigma|}=-\frac{1}{2}tr(\eta_2\eta_1\eta_1^T)-\frac{1}{2}\log{(-2\eta_2)} \\ h(x) &= (2\pi)^{-k/2} \end{aligned}

After this conversion, we see that the Multivariate Gaussian Distribution indeed belongs to exponential family.

Note that a $k$-dimension Multivariate Gaussian Distribution has a $k + k^2$-dimensional natural parameter (and sufficient statistic). However, as has to be symmetric and PSD, parameters actually have a lower degree of freedom.

Example 2: Multinomial Distribution

For a binary vector random variable $x\sim multinomial(x \vert \pi)$,

Exponential family representation:

Properties of exponential family

Moment generating property

The derivative of exponential family gives the centered moment.

So we can take this advantage to compute moments of any exponential family distribution by taking the derivatives of the log normalizer .

Note: when the sufficient statistic is a stacked vector, partial derivatives need to be considered.

Moment vs. canonical parameters

Applying the moment generating property, the moment parameter can be derived from the natural (canonical) parameter by:

Note that is convex since

Thus we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

So we can say that a distribution in the exponential family can be parameterized not only by - the canonical parameterization, but also by - the moment parameterization.

MLE for Exponential Family

For i.i.d. data, we have the log-likelihood:

Take the derivatives and set to zero:


This amounts to moment matching. Also, we could infer the canonical parameters using .


We can see from above that for , is sufficient for if there is no information in regarding beyond that in . Then, We can throw away for the purpose of inference w.r.t . These could be shown by both Bayesian view and Frequentist view.

Bayesian view:

Bayesian view

Frequentist view:

Frequentist view

The Neyman factorization theorem:

is sufficient for $\theta$ if





Generalized Linear Models

There are various conditional density estimation tasks that can be viewed as two-node GMs (see supplementary section for examples). Many are instances of Generalized Linear Models.

Generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows the linear model to be related to response variables via a link function, that have error distribution models other than a normal distribution. For example both linear regression and logistic regression can be unified by generalized linear model.



We model the expecatation of :

The observed input is assumed to enter into the model via a linear combination of its elements, and the conditional mean is represented as a function of , where f is known as the link function of . The observed output is assumed to be characterized by an exponential family distribution with conditional mean .

Example 1: Linear Regression

Let’s take a quick recap on Linear Regression.

Assume that the target variable and the inputs are related by:

where is an error term of unmodeled effects or random noise.

Now assume that $\epsilon$ follows a Gaussian distribution , then

We can use te LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter.

Example 2: Logistic Regression (sigmoid classifier, perceptron, etc.)

The condition distribution is a Bernoulli:

where is a logistic function

We can use the brute force gradient method as in LR. But we can also apply generic laws by observing the is an exponential family function - a generalized linear model.

More examples: parameterizing graphical models

Markov random fields

Restricted Boltzmann Machines

Conditional Random Fields


Example canonical response functions


Derivative is

which is a fixed point function since is a function of .

Iteratively Reweighted Least Squares (IRLS)

Recall that the Hessian matrix , and in least mean square optimization, we use Newton-Raphson method with cost function :

where the response is:

Hence, this can be understood as solving the Iteratively reweighted least squares problem.

Global and local parameter independence

Simple graphical models can be viewed as building blocks of complex graphical models. With the same concept, if we assume the parameters for each local conditional probabilistic distribution to be globally independent, and all nodes are fully observed, then the log-likelihood function can be decomposed into a sum of local terms, one per node:


A plate is a macro that allows subgraphs to be replicated. Conventionally, instead of drawing each repeated variable individually, a plate is used to group these variables into a subgraph that repeat together, and a number is drawn on the plate to represent the number of repetitions of the subgraph in the plate.

Rules for plates: Repeat every structure in a box a number of times given by the integer in the corner of the box (e.g. ), updating the plate index variable (e.g. ) as you go; Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.

For example, in the directed acyclic network, it can be decomposed as This is exactly like learning four separate small BNs, each of which consists of a node and its parents, as shown by the following graph:

Global parameter independence: For every DAG model, .

Local parameter independence: For every node, .

Global parameter sharing

Consider a network structure over a set of variables , parameterized by a set of parameters . Each variable is associated with a CPD . Now, rather than assuming that each such CPD has its own parameterization , we assume that we have a certain set of shared parameters that are used by multiple variables in the network. That is the global parameter sharing. Assuming is partitioned into disjoint subsets , and with each subset we assign a disjoint set of variables . For , and for ,

Supervised ML estimation

Hidden Markov Model (HMM)

We are given a HMM, and let hidden states be $y_1, \cdots, y_N$, our observations be $x_1, \cdots, x_N$, which are all known to us.

In the training time, we can record the frequency of each transition from a hidden state to another, or from a hidden state to an observation state.

Let $A_{ij}$ be a transition from a hidden state $i$ to a hidden state $j$, and let $B_{ik}$ be a transition from a hidden state to an observation state $k$. We calculate our parameters using maximum likelihood estimation by the following:

If what we observe are continuous, we can treat them as Gaussian, and apply corresponding learning rules for Gaussian.


This method gives us the parameters that “best fit”, or maximizes the likelihood of seeing the training data. Therefore for the test data, intuitively it should be at least close to the truth. MLE tends to give good performance in the reality.


We just show that the parameters for calculation of probability upon seeing a test data depend on the frequency of seeing the same pattern in the training time. This leads to a problem. If there’s a test data case that’s never seen in the training data, then we will auto assign a zero probability to that test data, which is “infinitely wrong” because it will lead to a infinity cross entropy from real world to our model.

This also shows the overfitting problem, that we fit our training data too well that we cannot generalize our model to the real world well.

Example on the slide

because there’s no casino roll such that and in the training data. However we all know this is not true!


To solve this problem, we can add “hallucinated counts” to all the cases, so that even the cases which are never seen in the training time will get some counts. Therefore at test time there will be no zero probability assigned to any case.

How many Pseudocounts do we add

Imagine if the total frequency of the cases in training is 100, and we add 10000 counts to one case, then that case will be assigned a very high probability in the test time. What we just have done is equal to saying that “I strongly believe this will happen”. In another word, we put lots of belief in that case.

However, if we just add one Pseudocount to each case, then the probability of frequent cases will decrease, but not too much. Their rankings in probability won’t change, because it’s just their denominators in MLE calculation have changed, by the same amount. The spared probability will be distributed to cases never seen in the training. They will be very small, but this eliminates the problem of zero probability in testing. This is what we call smoothing.

We can also see this as Bayesian Estimation under a uniform prior with “parameter strength”, while we add pseudocounts to cases.


Density Estimation

Density estimation can be viewed as single-node graphical models. It is the building block of general GM.

For density estimation we have:

Discrete Distributions

Bernoulli distribution:

Multinomial distribution:

It’s generally similar to Binomial. However, the “1” in parameter indicates there will be only one trial. Therefore it’s similar to Bernoulli, except now we have instances that could happen, each with a probability , where $\sum_{i=1}^{k}{\theta_i}=1$.

Suppose $n$ is the observed vector with:

MLE: constrained optimization with Lagrange multipliers

Objective Function:


Constrained cost function with a Lagrange multiplier:

Bayesian estimation

Dirichlet prior:

Posterior distribution of :

Posterior mean estimation:

Sequential Bayesian updating

Start with Dirichlet prior:

Observe N’ samples with sufficient statistics . Posterior becomes:

Hierarchical Bayesian Models

are the parameters for the likelihood

are the parameters for the prior

We can have hyper-parameters, etc.

We stop when the choice of hyper-parameters makes no difference to the marginal likelihood; typically make hyper-parameters constants.

Limitation of Dirichlet Prior

Dirichlet prior could only put emphasis/bias on all coordinates or one single coordinate; it cannot, for example, emphasize two coordinates simultaneorly. The prior corresponding to different hyperparamter is shown in the following picture:

The Logistic Normal Prior

Pro: co-variance structure
Con: non-conjugate

Continuous Distributions

Uniform Probability Density Function

Normal (Gaussian) Probability Density Function

Multivariate Gaussian

MLE for a multivariate-Gaussian:

It can be shown that the MLE for and are:

where the scatter matrix is:

Note that may not be full rank (eg. if ), in which case is not invertible

Bayesian Parameter Estimation for a Gaussian

Reasons for Bayesian Approach:

Now that we only consider conjugate priors, and consider various cases of increasing complexity:

Unknown , known

Known , unkown

Unkown , unkown

Two nodes fully observed BNs

Two nodes fully observed BNs


Goal: wish to learn

Conditional Gaussian

The data:

is a class indicator vector (one-hot encoding):

is a conditional Gaussian variable with a class-specific mean:

MLE of Conditional Gaussian

MLE of Conditional Gaussian.

Bayesian Estimation of Conditional Gaussian

Bayesian Estimation of Conditional Gaussian

Gausian Distriminative Analysis:

Linear Regression

The data: , where is an input vector and is a response vector(either continuous or discrete)

A regression scheme can be used to model directly rather than

A discrimintative probabilitstic model

Assume , where is an error term of unmodeled effects or random noise. Assume , then

And by that each are i.i.d, we have:


we notice that maximizing this term w.r.t $\theta$ is equivalent to minimizing the MSE(mean square error):

ML Structure Learning for Completely Observed GMs

Two “Optimal” Approaches

“Optimal” means the employed algorithms are guaranteed to return a structure that maximizes the objective. Some popluar heuristics, however, provide no gurantee on attaining optimality, interpretability or even do not have an explicit objective. e.g. structured EM, module network, greedy structural search, deep learning via auto-encoders, etc.

Will learn two classes of algorithms for guaranteed structure learning, albeit only applying to certain families of graphs:

But we can find exact solution of an optimal tree under MLE:

Lead to the Chow-Liu Algorithm

Chow-Liu Algorithm

\begin{aligned} l(\theta_G, G; D) &= \log P(D | \theta_G, G)\\ &= \log \prod_n (\prod_i p(x_{n,i} | x_{n, \pi_i(G)}, \theta_{i | \pi_i(G)} ) )\\ &= \sum_{i} (\sum_{n}\log p(x_{n,i} | x_{n, \pi_i(G), \theta_{i | \pi_i(G) }}) )\\ &= M\sum_i (\sum_{x_i, x_{\pi_i(G)}} \hat p(x_i, x_{\pi_i(G)}\log p(x_i | x_{\pi_i(G)}, \theta_{i | \pi_i(G)} ))\\ &= M\sum_i(\sum_{x_i, x_{\pi_i(G)}} \hat{p}(x_i, x_{\pi_i(G)}) \log \frac{\hat p(x_i, x_{\pi_i(G)}, \theta_{i | \pi_i(G)})}{\hat p(x_{\pi_i(G)}) \hat p(x_i) } ) - M \sum_i (\sum_{x_i} \hat p(x_i) \log \hat p(x_i))\\ &= M\sum_i \hat I(x_i, x_{\pi_i(G)}) - M \sum_i \hat H(x_i) \end{aligned}

Structure Learning for General Graphs

Theorem: The problem of learning a BN structure with at most parents is NP-hard for any (fixed) .

Most structure learning approaches use heuristics