Lecture 3: Undirected Graphical Models
An introduction to undirected graphical models
In addition to the I-map concept that was introduced in the last lecture, today’s lecture also includes minimal I-map.
A DAG is a minimal I-map if it is an I-map for a distribution , and if the removal of even a single edge from renders it not an I-map.
A distribution may have several minimal I-maps, each corresponding to a specific node-ordering.
The fact that is a minimal I-map for is far from a guarantee that captures the independence structure in .
“Bayes-ball” algorithm is an algorithm that we can apply to retrieve independences directly from a graphical model. We say is d-separated from given if we cannot send a ball from any node in to any node in . The conditional probability statement (“given ”) is represented by shading the node in the graph. Examples of three basic directed graphical structures are shown below.
In (a) and (b), the shaded node blocks the ball from going between nodes and . This gives the independence relation that was introduced in the last lecture: .
(c), also called the “V-structure”, is a special case. Opposite from the first two exmaples, the ball can go between and if the node is shaded, and is blocked otherwise. Therefore, the graph on the right yields .
With these basic structures, we can apply the rules on a DAG. For example, let us try to find whether and are independent given and .
After shading and , the ball cannot go from to through because it is blocked; however, , , and forms a “V-structure”, so the ball can go along the path , , , . Therefore, the independence statement is invalid.
Limits of Directed and Undirected GMs
From a representational perspective, we aim to find a graph that precisely captures the independencies in a given distribution . This goal of learning GMs motivates the following definition.
We say that a graph is a perfect map (P-map) for a set of independencies if . We say that is a perfect map for if . That is, .
- The P-map of a distribution is unique up to I-equivalence between networks. That is, a distribution P can have many P-maps, but all of them are I-equivalent.
Arbitrary distribution ’s, however, do not necessarily attain perfect maps as either undirected or directed GMs. Two such examples are shown below.
Left: A distribution with no possible DGM representation, which entails and . Right: The v-structure is a distribution with no UGM representation.
Undirected Graphical Models - Overview
- There can only be symmetric relationships between a pair of nodes (random variables). In other words, there is no causal effect from one random variable to another.
- The model can represent properties and configurations of a distribution, but it cannot generate samples explicitly.
- Each node has strong correlations with its neighbors.
Let each node represents an image patch. It is impossible to tell what is inside this image patch by isolating it from others. However, when we look at its neighboring image patches, we can see that it’s an image patch of water. Due to the fact that the relationships between neighboring image patches should be symmetric, an image is best represented by an undirected graphical model. This particular undirected graphical model is also known as the grid model.
- Cliques are subgraphs that are fully connected.
- A maximal clique is a clique such that any superset (any bigger subgraph that contains this subgraph) is not a complete graph.
- A sub-clique is a not-necessarily-maximal clique.
Each clique can be associated with a potential function , which can be understood as a provisional function of its arguments that assigns a pre-probabilistic score of their joint distribution. This potential function can be somewhat arbitrary, but must be non-negative.
Why cliques? Each component of the clique contributes to the overall potential function.
Potential functions are not necessarily probabilistic:
This model implies that . This independence statement implies (by definition) that the joint must factorize as:
Probability distributions can be used as potential functions. However, in this case, we cannot let all potentials be either marginal probabilities or conditional probabilities. So the potential function for this graph cannot be probability distributions.
Gibbs Distribution and Undirected Graphical Model Definition
Given an undirected graph and clique potentials functions associated with cliques of , we say is a Gibbs distribution over if it can be represented as
where is also known as the partition function. Upper case denotes the set of all cliques, and lower case denotes a clique associated with a set of random variables .
An undirected graphical model represents a distribution defined by an undirected graph , a set of positive potential functions and the associated cliques of , such that
Note that this distribution is the Gibbs distribution.
Example UGM Models
Depending on the question of interest, different representations may be more appropriate.
Using Max Cliques
We only need to represent discrete nodes with two 3D tables instead of one 4D table.
Using Pairwise Cliques
We only need to represent discrete nodes with five 2D tables instead of one 4D table.
Using Canonical Representation
Even if we use fine-grained representation, the Markov network is often overparameterized. For any given distribution, there are multiple choices of parameters to describe in the model. As shown above, we can either choose max cliques or pairwise cliques to represent this model. Furthermore, ambiguities can arise in clique structures. For example, given a pair of cliques and , the information about can be placed in either of the two cliques, resulting in many ways to specify the samme distribution.
The canonical representation provides a natural approach to avoid this problem. It is defined over all non-empty cliques as shown below.
Global Markov Independency
Suppose we are given the following UGM, denoted by :
separates and if every path from a node in to a node in passes through a node in :
A probability distribution satisfies the global Markov property if for any disjoint X,Y,Z such that Y separates X and Z, X is independent of Z given Y.
Local Markov Independency
For each node , there is a unique Markov blanket of , denoted , which is the set of neighbors of in the graph.
The local Markov independencies () associated with is:
In other words, is independent of the rest of the nodes given its immediate neighbors .
Soundness and Completeness of Global Markov Property
The global Markov property for UGMs is similar to its variant for DGMs, in the sense that they both attain similar soundness and completeness results.
Theorem: Let be a distribution over , and a Markov network structure over . If is a Gibbs distribution that factorizes over , then is an I-map for .
Proof: Let be three disjoint subsets in such that separates and in . We will show that .
First, we observe that there is no direct edge from to . Assuming that is a partition of , we know that any clique in is fully attained in either or . Let be the indices of the set of cliques that are contained in , and be the set defined for . We know that
None of the terms in the first product contains variable from the latter. Hence, we can rewrite this product in the form:
and we observe that independence follows.
If is a strict subset of . Let . We can partition into two disjoint sets and such that separates from in . Using our argument from the partition case, we have that . Apply decomposition property of probability we attain that .
Completeness (Hammersley-Clifford theorem)
Theorem: Let be a positive distribution over , and a Markov network graph over . If is an I-map for , then is a Gibbs distribution that factorizes over .
This result shows that, for positive distributions, the global independencies imply that the distribution factorizes according to the network structure. Thus, for this class of distributions, we have that a distribution $P$ factorizes over a Markov network if and only if is an I-map for .
Other Markov Properties
For UGMs, we defined I-maps in terms of global Markov properties. We will now define local independence. Intuitively, when two variables are not directly linked, there must be some way of rendering them conditionally independent. Specifically, we can require that $X$ and $Y$ be independent given all other nodes in the graph.
Let be a Markov network. We define the pairwise independencies associated with to be
To illustrate this idea, observe that in the figure above, the variables of interests, and , are conditionally independent given all other nodes in the graph, .
Pairwise and local indepdencies are also related. Their relationships are described in the following propositions and theorem.
1. For any Markov network and any distribution , we have that if then .
2. For any Markov network and any distribution , we have that if then .
3. Let be a positive distribution. If satisfies , then satisfies .
The followings are equivalent for a positive distribution :
Since we don’t want to constraint the clique potentials to be positive in all situations, exponential form is used to represent a clique potential in an unconstrained form using a real-value “energy” funtion :
This then gives the joint probability a nice additive structure
where the sum in the exponent is called the “free energy”:
This form of representation is called the “Boltzmann distribution” in physics, and a log-linear model in statstics.
Undirected Graph Exmples
In this section, we cover several well-known undirected graphical models: Boltzmann Machine (BM), Ising model, Restricted Boltzmann Machine (RBM), and Conditional Random Field (CRF).
Boltzmann Machine (BM)
Boltzmann Machine is a fully connected graph with pairwise (edge) potentials on binary-valued nodes. One example is shown in the following figure:
Its probability distribution can be written as:
It could also be written in a quadratic way:
Hence the overall free energy function has the form:
which can then be solved using quadratic programming.
In the Ising model, nodes are arranged in a regular topology (often a regular packing grid) and connected only to their geometric neighbors. It is like a sparse Boltzmann Machine. There is also the multi-state Ising model (also called Potts model), in which nodes can take multiple values instead of just binary values. One example of Ising model is shown in the following figure:
Its probability distribution can be written as
Restricted Boltzmann Machine (RBM)
The Restricted Bolzmann Machine is a bipartite graph with connections between one layer of hidden units and one layer of visible units. One example is shown in the following figure:
Its probability distribution can be written as
RBM has some appealing properties. For example, factors are marginally dependent and factors are conditionally independent given observations on the visible nodes. They enable one to use iterative Gibbs sampling for inference and learning on RBM. If the edges in RBM were directed, there would be plenty of V-structures in the graph (lots of dependences) that increase the inference difficulty.
Conditional Random Field (CRF)
Conditional random field is an analogous form of HMM in the undirected case. It allows arbitrary dependencies on the input. For example, when labeling , future observations can be taken into account. An example of CRF is shown in the figure:
The probability distribution could be written as