Algorithms for Generalized Topic Modeling
October 26, 2016

Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling [Arora et al.'12, Bansal et al.'14, Anandkumar et al.'14]. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution \vec a_i over words, and a document is assumed to be generated by first selecting a mixture \vec w over topics, and then generating words i.i.d. from the associated mixture \vec w^T A. Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture. In this work we consider a generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a document classifier. That is, we efficiently learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly. More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning. We use tools from learning theory, perturbation theory, and matrix concentration to achieve our results. Based on joint work with Avrim Blum.