Exponential Family PCA

The conventional view of PCA is a geometric one, finding a low-dimensional projection that minimizes the squared-error loss. An alternate view is a probabilistic one: if the data consist of samples drawn from a probability distribution, then PCA is an algorithm for finding the parameters of the generative distribution that maximize the likelihood of the data. The squared-error loss function corresponds to an assumption that the data is generated from a Gaussian distribution. Collins, Dasgupta & Schapire [CDS02] demonstrated that PCA can be generalized to a range of loss functions by modeling the data with different exponential families of probability distributions such as Gaussian, binomial, or Poisson. Each such exponential family distribution corresponds to a different loss function for a variant of PCA, and Collins, Dasgupta & Schapire [CDS02] refer to the generalization of PCA to arbitrary exponential family data-likelihood models as ``Exponential family PCA'' or E-PCA.

An E-PCA model represents the reconstructed data using a
low-dimensional weight vector , a basis matrix , and a
*link function* :

(4) |

Each E-PCA model uses a different link function, which can be derived from its data likelihood model (and its corresponding error distribution and loss function). The link function is a mapping from the data space to another space in which the data can be linearly represented.

The link function is the mechanism through which E-PCA generalizes dimensionality reduction to non-linear models. For example, the identity link function corresponds to Gaussian errors and reduces E-PCA to regular PCA, while the sigmoid link function corresponds to Bernoulli errors and produces a kind of ``logistic PCA'' for 0-1 valued data. Other nonlinear link functions correspond to other non-Gaussian exponential families of distributions.

We can find the parameters of an E-PCA model by maximizing the
log-likelihood of the data under the model, which has been
shown [CDS02] to be equivalent to minimizing a generalized
Bregman divergence

between the low-dimensional and high-dimensional representations, which we can solve using convex optimization techniques. (Here is a convex function whose derivative is , while is the convex dual of . We can ignore for the purpose of minimizing equation 5 since the value of is fixed.) The relationship between PCA and E-PCA through link functions is reminiscent of the relationship between linear regression and Generalized Linear Models [MN83].

To apply E-PCA to belief compression, we need to choose a link function which accurately reflects the fact that our beliefs are probability distributions. If we choose the link function

then it is not hard to verify that and . So, equation 5 becomes

If we write , then equation 7 becomes

Our choice of loss and link functions has two advantages: first, the exponential link function constrains our low-dimensional representation to be positive. Second, our error model predicts that the variance of each belief component is proportional to its expected value. Since PCA makes significant errors close to 0, we wish to increase the penalty for errors in small probabilities, and this error model accomplishes that.

If we compute the loss for all , ignoring terms that depend only on the data , then^{6}

The introduction of the link function raises a question: instead of
using the complex machinery of E-PCA, could we just choose some
non-linear function to project the data into a space where it is linear, and
then use conventional PCA? The difficulty with this approach is of course
identifying that function; in general, good link functions for E-PCA
are not related to good nonlinear functions for application before
regular PCA. So, while it might appear reasonable to use PCA to find a
low-dimensional representation of the *log* beliefs, rather than use
E-PCA with an *exponential* link function to find a representation of the
beliefs directly, this approach performs poorly because the surface
is only locally well-approximated by a log projection. E-PCA can be
viewed as minimizing a weighted least-squares that chooses the distance metric
to be appropriately local. Using conventional PCA over log beliefs also
performs poorly in situations where the beliefs contain extremely small or
zero probability entries.

Finding the E-PCA Parameters

Algorithms for conventional PCA are guaranteed to converge to a unique answer independent of initialization. In general, E-PCA does not have this property: the loss function (8) may have multiple distinct local minima. However, the problem of finding the best given and is convex; convex optimization problems are well studied and have unique global solutions [Roc70]. Similarly, the problem of finding the best given and is convex. So, the possible local minima in the joint space of and are highly constrained, and finding and does not require solving a general non-convex optimization problem.

Gordon [Gor03] describes a fast, Newton's Method approach for
computing and which we summarize here. This algorithm
is related to Iteratively Reweighted Least Squares, a popular
algorithm for generalized linear regression [MN83]. In
order to use Newton's Method to minimize equation (8), we need its
derivative with respect to and :

and

If we set the right hand side of equation (14) to zero, we can
iteratively compute
, the column of
, by Newton's method. Let us set
, and linearize about
to find roots of . This gives

Note that equation 15 is a formulation of Newton's method for finding roots of , typically written as

(17) |

We need an expression for :

We define in terms of the operator that returns a diagonal matrix:

(21) |

where

(22) |

Combining equation (15) and equation (20), we get

(23) | |||

(24) | |||

(25) |

which is a weighted least-squares problem that can be solved with standard linear algebra techniques. In order to ensure that the solution is numerically well-conditioned, we typically add a regularizer to the divisor, as in

where is the identity matrix. Similarly, we can compute a new by computing , the row of , as

We now have an algorithm for automatically finding a good low-dimensional representation for the high-dimensional belief set . This algorithm is given in Table 1; the optimization is iterated until some termination condition is reached, such as a finite number of iterations, or when some minimum error is achieved.

- Collect a set of sample beliefs from the high-dimensional belief space
- Assemble the samples into the data matrix
- Choose an appropriate loss function,
- Fix an initial estimate for and randomly
- do
- For each column ,
- Compute using current estimate from equation (26)
- For each row ,
- Compute using new estimate from equation (27)
- while

The steps 7 and 9 raise one issue. Although solving for each row of or column of separately is a convex optimization problem, solving for the two matrices simultaneously is not. We are therefore subject to potential local minima; in our experiments we did not find this to be a problem, but we expect that we will need to find ways to address the local minimum problem in order to scale to even more complicated domains.

Once the bases are found, finding the low-dimensional representation of a high-dimensional belief is a convex problem; we can compute the best answer by iterating equation (26). Recovering a full-dimensional belief from the low-dimensional representation is also very straightforward:

Our definition of PCA does not explicitly factor the data into , and as many presentations do. In this three-part representation of PCA, contains the singular values of the decomposition, and and are orthonormal. We use the two-part representation because there is no quantity in the E-PCA decomposition which corresponds to the singular values in PCA. As a result, and will not in general be orthonormal. If desired, though, it is possible to orthonormalize as an additional step after optimization using conventional PCA and adjust accordingly.

- ... distribution.
^{5} - If we had chosen the link function we would have arrived at the normalized KL divergence, which is perhaps an even more intuitively reasonable way to measure the error in reconstructing a probability distribution. This more-complicated link function would have made it more difficult to derive the Newton equations in the following pages, but not impossible; we have experimented with the resulting algorithm and found that it produces qualitatively similar results to the algorithm described here. Using the normalized KL divergence does have one advantage: it can allow us to get away with one fewer basis function during planning, since for unnormalized KL divergence the E-PCA optimization must learn a basis which can explicitly represent the normalization constant.
- ... then
^{6} - E-PCA is related to Lee and Seung's [LS99] non-negative matrix factorization. One of the NMF loss functions presented by Lee & Seung [LS99] penalizes the KL-divergence between a matrix and its reconstruction, as we do in equation 8; but, the NMF loss does not incorporate a link function and so is not an E-PCA loss. Another NMF loss function presented by Lee & Seung [LS99] penalizes squared error but constrains the factors to be nonnegative; the resulting model is an example of a (GL)M, a generalization of E-PCA described by Gordon [Gor03].