Lecture 5: Parameter Estimation in Fully Observed Bayesian Networks
Introduction to the problem of Parameter Estimation in fully observed Bayesian Networks
Introduction
Last class, we presented two major types of graphical model task, Inference and Learning, and we mainly discussed about inference. In this lecture, we start to introduce the learning task, and explore some learning techniques used in fully observable Bayesian Networks.
Learning Graphical Models
Before we get started, it’s better for us to get some intuition about what learning is and which goal learning is achieving.
The goal
The goal: Given a set of independent samples (assignments of random variables), find the best (or the most likely) Bayesian Network (both DAG and CPDs).
As shown in the figure below, we are given a set of independent samples (assignments of binary random variables). Assume it’s a DAG, we are going to learn the directed links (causality relationships) between nodes, this process is called Structural Learning. Learning the conditional possibility is another task called Parameter learning.
Overview
As listed below, there’re several learning scenarios we are interested in:
 Completely observed GMs:
 1.1. directed
 1.2. undirected
 Partially or unobserved GMs:
 1.1. directed
 1.2. undirected (an open research topic)
Some useful estimation principles are also listed here:
 Maximal likelihood estimation (MLE)
 Bayesian estimation
 Maximal conditional likelihood
 Maximal “Margin”
 Maximum entropy
We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
Parameter Learning
In this section, we are going to introduce Parameter Learning, and some useful related models.
When doing parameter learning, we assume that the Graphical model G itself is known and fixed, G could come from either expert design or an intermediate outcome of iterative structure learning.
The goal of parameter learning is to estimate parameters from a dataset of independent, identically distributed (i.i.d.) training samples $D = {x_1, \cdots, x_N}$.
In general, each training case is a vector of values, one per node. Depending on the model, elements in could be all known (no missing values, no hidden variables) if the model is completely observable, or partially known ($\exists i, x_{n,i}$ is not observed) if the model is partially observable.
In this class, we mainly consider learning parameters for a BN that is completely observable with a given structure.
Exponential Family Distributions
There are various density estimation tasks that can be viewed as singlenode GMs (see the supplementary section for examples), and they are in fact instances of Exponential Family Distributions.
Exponential Family Distributions are building blocks for general GMs, and have nice properties allowing us to easily do MLE and Bayesian estimation.
Definition
For a numeric random variable $X$,
Function $T(x)$ is a sufficient statistic (will be explained in sufficiency section).
Function $A(\eta) = \log{Z(\eta)}$ is the log normalizer.
Examples
We can see lots of probability distributions actually belonging to this family (including Bernoulli, Multinomial, Gaussian, Poisson, Gamma).
Example 1: Multivariate Gaussian Distribution
For a continuous vector random variable $X \in R^k$:
Exponential family representation:
After this conversion, we see that the Multivariate Gaussian Distribution indeed belongs to exponential family.
Note that a $k$dimension Multivariate Gaussian Distribution has a $k + k^2$dimensional natural parameter (and sufficient statistic). However, as has to be symmetric and PSD, parameters actually have a lower degree of freedom.
Example 2: Multinomial Distribution
For a binary vector random variable $x\sim multinomial(x \vert \pi)$,
Exponential family representation:
Properties of exponential family
Moment generating property
The derivative of exponential family gives the centered moment.
So we can take this advantage to compute moments of any exponential family distribution by taking the derivatives of the log normalizer .
Note: when the sufficient statistic is a stacked vector, partial derivatives need to be considered.
Moment vs. canonical parameters
Applying the moment generating property, the moment parameter can be derived from the natural (canonical) parameter by:
Note that is convex since
Thus we can invert the relationship and infer the canonical parameter from the moment parameter (1to1):
So we can say that a distribution in the exponential family can be parameterized not only by  the canonical parameterization, but also by  the moment parameterization.
MLE for Exponential Family
For i.i.d. data, we have the loglikelihood:
Take the derivatives and set to zero:
and
This amounts to moment matching. Also, we could infer the canonical parameters using .
Sufficiency
We can see from above that for , is sufficient for if there is no information in regarding beyond that in . Then, We can throw away for the purpose of inference w.r.t . These could be shown by both Bayesian view and Frequentist view.
Bayesian view:
Frequentist view:
The Neyman factorization theorem:
is sufficient for $\theta$ if
Examples
Gaussian:
Multinomial:
Poisson:
Generalized Linear Models
There are various conditional density estimation tasks that can be viewed as twonode GMs (see supplementary section for examples). Many are instances of Generalized Linear Models.
Generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows the linear model to be related to response variables via a link function, that have error distribution models other than a normal distribution. For example both linear regression and logistic regression can be unified by generalized linear model.
Commonality
We model the expecatation of :
 is the conditional distribution of
 is the response function
The observed input is assumed to enter into the model via a linear combination of its elements, and the conditional mean is represented as a function of , where f is known as the link function of . The observed output is assumed to be characterized by an exponential family distribution with conditional mean .
Example 1: Linear Regression
Let’s take a quick recap on Linear Regression.
Assume that the target variable and the inputs are related by:
where is an error term of unmodeled effects or random noise.
Now assume that $\epsilon$ follows a Gaussian distribution , then
We can use te LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter.
Example 2: Logistic Regression (sigmoid classifier, perceptron, etc.)
The condition distribution is a Bernoulli:
where is a logistic function
We can use the brute force gradient method as in LR. But we can also apply generic laws by observing the is an exponential family function  a generalized linear model.
More examples: parameterizing graphical models
Markov random fields
Restricted Boltzmann Machines
Conditional Random Fields
 Discriminative model, modeling
 s are assumed as features that are interdependent
 When labeling , future observations are taken into account.
MLE for GLIMs
Loglikelihood
Derivative is
which is a fixed point function since is a function of .
Iteratively Reweighted Least Squares (IRLS)
Recall that the Hessian matrix , and in least mean square optimization, we use NewtonRaphson method with cost function :
where the response is:
Hence, this can be understood as solving the Iteratively reweighted least squares problem.
Global and local parameter independence
Simple graphical models can be viewed as building blocks of complex graphical models. With the same concept, if we assume the parameters for each local conditional probabilistic distribution to be globally independent, and all nodes are fully observed, then the loglikelihood function can be decomposed into a sum of local terms, one per node:
Plate
A plate is a macro that allows subgraphs to be replicated. Conventionally, instead of drawing each repeated variable individually, a plate is used to group these variables into a subgraph that repeat together, and a number is drawn on the plate to represent the number of repetitions of the subgraph in the plate.
Rules for plates: Repeat every structure in a box a number of times given by the integer in the corner of the box (e.g. ), updating the plate index variable (e.g. ) as you go; Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.
For example, in the directed acyclic network, it can be decomposed as This is exactly like learning four separate small BNs, each of which consists of a node and its parents, as shown by the following graph:
Global parameter independence: For every DAG model, .
Local parameter independence: For every node, .
Global parameter sharing
Consider a network structure over a set of variables , parameterized by a set of parameters . Each variable is associated with a CPD . Now, rather than assuming that each such CPD has its own parameterization , we assume that we have a certain set of shared parameters that are used by multiple variables in the network. That is the global parameter sharing. Assuming is partitioned into disjoint subsets , and with each subset we assign a disjoint set of variables . For , and for ,
Supervised ML estimation
Hidden Markov Model (HMM)
We are given a HMM, and let hidden states be $y_1, \cdots, y_N$, our observations be $x_1, \cdots, x_N$, which are all known to us.
In the training time, we can record the frequency of each transition from a hidden state to another, or from a hidden state to an observation state.
Let $A_{ij}$ be a transition from a hidden state $i$ to a hidden state $j$, and let $B_{ik}$ be a transition from a hidden state to an observation state $k$. We calculate our parameters using maximum likelihood estimation by the following:
If what we observe are continuous, we can treat them as Gaussian, and apply corresponding learning rules for Gaussian.
Pros
This method gives us the parameters that “best fit”, or maximizes the likelihood of seeing the training data. Therefore for the test data, intuitively it should be at least close to the truth. MLE tends to give good performance in the reality.
Cons
We just show that the parameters for calculation of probability upon seeing a test data depend on the frequency of seeing the same pattern in the training time. This leads to a problem. If there’s a test data case that’s never seen in the training data, then we will auto assign a zero probability to that test data, which is “infinitely wrong” because it will lead to a infinity cross entropy from real world to our model.
This also shows the overfitting problem, that we fit our training data too well that we cannot generalize our model to the real world well.
Example on the slide
because there’s no casino roll such that and in the training data. However we all know this is not true!
Pseudocounts
To solve this problem, we can add “hallucinated counts” to all the cases, so that even the cases which are never seen in the training time will get some counts. Therefore at test time there will be no zero probability assigned to any case.
How many Pseudocounts do we add
Imagine if the total frequency of the cases in training is 100, and we add 10000 counts to one case, then that case will be assigned a very high probability in the test time. What we just have done is equal to saying that “I strongly believe this will happen”. In another word, we put lots of belief in that case.
However, if we just add one Pseudocount to each case, then the probability of frequent cases will decrease, but not too much. Their rankings in probability won’t change, because it’s just their denominators in MLE calculation have changed, by the same amount. The spared probability will be distributed to cases never seen in the training. They will be very small, but this eliminates the problem of zero probability in testing. This is what we call smoothing.
We can also see this as Bayesian Estimation under a uniform prior with “parameter strength”, while we add pseudocounts to cases.
Supplementary
Density Estimation
Density estimation can be viewed as singlenode graphical models. It is the building block of general GM.
For density estimation we have:
 MLE(maximum likelihood estimate)
 Bayesian estimate
Discrete Distributions
Bernoulli distribution:
Multinomial distribution:
It’s generally similar to Binomial. However, the “1” in parameter indicates there will be only one trial. Therefore it’s similar to Bernoulli, except now we have instances that could happen, each with a probability , where $\sum_{i=1}^{k}{\theta_i}=1$.
Suppose $n$ is the observed vector with:
MLE: constrained optimization with Lagrange multipliers
Objective Function:
constraint:
Constrained cost function with a Lagrange multiplier:
Bayesian estimation
Dirichlet prior:
Posterior distribution of :
Posterior mean estimation:
Sequential Bayesian updating
Start with Dirichlet prior:
Observe N’ samples with sufficient statistics . Posterior becomes:
Hierarchical Bayesian Models
are the parameters for the likelihood
are the parameters for the prior
We can have hyperparameters, etc.
We stop when the choice of hyperparameters makes no difference to the marginal likelihood; typically make hyperparameters constants.
Limitation of Dirichlet Prior
Dirichlet prior could only put emphasis/bias on all coordinates or one single coordinate; it cannot, for example, emphasize two coordinates simultaneorly. The prior corresponding to different hyperparamter is shown in the following picture:
The Logistic Normal Prior
Pro: covariance structure
Con: nonconjugate
Continuous Distributions
Uniform Probability Density Function
Normal (Gaussian) Probability Density Function
Multivariate Gaussian
MLE for a multivariateGaussian:
It can be shown that the MLE for and are:
where the scatter matrix is:
Note that may not be full rank (eg. if ), in which case is not invertible
Bayesian Parameter Estimation for a Gaussian
Reasons for Bayesian Approach:
 Update estimate sequentially over time
 We may have prior knowledge about the expected magnitude of parameters
 The MLE for $\Sigma$ may not be full rank if we don’t have enough data
Now that we only consider conjugate priors, and consider various cases of increasing complexity:
 Known , unknown
 Known , unknown
 Unknown and
Unknown , known
 Normal Prior:
 Joint Probability:
 Posterior:
where:
The posterior mean is a convex combination of sample mean and prior mean, and the posterior precision is the precision of the prior plus of contribution for each observed data point.
Known , unkown
 Conjugate prior for :

Joint Probability:
 Therefore Posterior: where and
Unkown , unkown
 Univariate Case

Conjugate prior for and : (NormalInverseGamma)

 Multivariate Case

Conjugate prior for and : (NormalInverseWishart)

Two nodes fully observed BNs
 Conditional Mixtures
 Linear/Logistic Regression
 Classification
 Generative and discriminative approaches
Classification
Goal: wish to learn
 Generative: Modeling the joint distribution of all data, i.e.:
 Discriminative: Modeling only the conditional distribution, i.e.:
Conditional Gaussian
The data:
is a class indicator vector (onehot encoding):
is a conditional Gaussian variable with a classspecific mean:
MLE of Conditional Gaussian
 Data loglikelihood

MLE:
, the fraction of sample of class
, the average of samples of class
Bayesian Estimation of Conditional Gaussian

Prior:

Posterior mean:
where
,
Gausian Distriminative Analysis:

Joint probability of a datum and its label is:

Predict the conditional probability of label given the datum:
 Frequentist Approach: fit , and from data first
 Bayesian Approach: compute the posterior dist. of the parameters first
Linear Regression
The data: , where is an input vector and is a response vector(either continuous or discrete)
A regression scheme can be used to model directly rather than
A discrimintative probabilitstic model
Assume , where is an error term of unmodeled effects or random noise. Assume , then
And by that each are i.i.d, we have:
hence:
we notice that maximizing this term w.r.t $\theta$ is equivalent to minimizing the MSE(mean square error):
ML Structure Learning for Completely Observed GMs
Two “Optimal” Approaches
“Optimal” means the employed algorithms are guaranteed to return a structure that maximizes the objective. Some popluar heuristics, however, provide no gurantee on attaining optimality, interpretability or even do not have an explicit objective. e.g. structured EM, module network, greedy structural search, deep learning via autoencoders, etc.
Will learn two classes of algorithms for guaranteed structure learning, albeit only applying to certain families of graphs:
 Trees: The ChowLiu Algorithm
 Pairwise MRFs: covariance selection, neighborhoodselection
Structural Search
 graphs over nodes
 trees over nodes
But we can find exact solution of an optimal tree under MLE:
 MLE score decomposable to edgerelated elements
 In a tree each node has only one parent
Lead to the ChowLiu Algorithm
ChowLiu Algorithm
 For each pair of variable and :
 Compute empirical distribution:
 Compute mutual information:
 Define a graph with node :
 Edge get weight:
 Compute maximum weight spanning tree
 Direction: pick any node as root, do BFS(breadth first search) do define directions, i.e.: we define an edge from node to node if we reach from in running BFS.
Structure Learning for General Graphs
Theorem: The problem of learning a BN structure with at most parents is NPhard for any (fixed) .
Most structure learning approaches use heuristics
 Exploit score decomposition:
 e.g: Greedy search through space of nodeorders
 Local search of graph structures