Hoffman, UAI 1999

Probabilistic Latent Sematic Analysis

This paper introduces the Probabilistic latent semantic analysis (PLSA) technique. As its name suggests, PLSA evolves from a statistical view of Latent Semantic Analysis (LSA), and both of them aim to discover hidden semantic relationships between two co-occurring objects.

LSA is a well known technique that performs Singular Value Decomposition to downsize the co-occurrence table (matrix). The underlying idea is to map high-dimensional observations (co-occurrence counts) to a lower dimensional representation in a so-called latent semantic space. Due to its generality, LSA has been widely used since it was proposed; however, it has several limitations. The resulting dimensions generated from LSA are sometimes difficult to interpret; LSA assumes a join Gaussian distribution of co-occurring data, which does not match observed data (Poisson distribution). Also, theoretical foundation for LSA is still lacking.

In contrast, PLSA explicitly defines a principled data-generation process, lending itself solid statistics foundations. It is based on a latent variable model for co-occurrence data (called the aspect model), which associates an unobserved class variable with each observation. In the setting of text documents analysis, an observation corresponds to a word w occurring in a particular document d, and the unobserved class variable z can be thought of as the associated topic. Then the joint probability model of w and d is defined by the mixture:

$P(d,w) = P(d)P(w|d), where P(w|d) = \sum_{z \in Z} P(w|z)P(z|d)$

According to this definition, PLSA defines a generative model for word/document co-occurrences by first selecting a document di with probability P(di), then picking a latent class zk with probability P(zk | di ), and finally generating a word wj with probability P(wj | zk). The underlying assumption is that d and w re conditionally independent given the latent variable z.

To balance the number of free parameters for both word and document distributions, the model can be equivalently expressed in a symmetric formulation, where w and d are both generated from the latent class z in similar ways:

$P(d,w) = \sum_{z \in Z} P(z)P(w|z)P(d|z)$

To estimate the parameters (P(z), P(w | z), P(d | z)) in this latent variable mixture model, a standard procedure is Expectation Maximization (EM). EM tries to maximize likelihood of the training data, and hence prone to over-fitting. To address this problem, the paper proposes a generalization of EM called Tempered Expectation Maximization (TEM). TEM is essentially EM combined with deterministic annealing, with annealing parameters tuned on a held-out dataset.

The paper evaluates the model performance in two tasks: perplexity minimization for different types of linguistic data collections, and automated document indexing. The results demonstrate the advantages of explicitly minimizing perplexity by TEM, and indicate substantial and consistent improvements of PLSA over standard LSA, even in applications which superficially do not seem to be directly related to perplexity reduction.

Though the method is introduced in the context of text document analysis, the co-occurrence of any couple of discrete variables may be modeled by PSLA in exactly the same way. Due to its great flexibility, PLSA has been widely and successfully used in variety of application domain, including information retrieval, text learning ,and co-citation analysis etc.

On the other hand, PLSA also has its drawbacks (not in the paper). For one thing, the number of parameters grows linearly with the number of documents. In addition, although PLSA is a generative model of the documents in the collection it is estimated on, it is not a proper generative model for new documents. That's where Latent Dirichlet Allocation, Blei et. al. 2003 comes to help.