Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution \vec a_i over words, and a document is assumed to be generated by first selecting a mixture \vec w over topics, and then generating words i.i.d. from the associated mixture \vec w^T A. Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture.
In this work we consider a generalization of this framework in which words are no longer assumed to be and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a document classifier. That is, we efficiently learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly.
More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning.
We use tools from learning theory, perturbation theory, and matrix concentration to achiev
Based on joint work with Avrim Blum.
Nika Haghtalab is a Ph.D. student at the Computer Science department of Carnegie Mellon University co-advised by Avrim Blum and Ariel Procaccia. Her research interests are in learning theory, economics, and algorithms. Nika is a recipient of IBM and Microsoft Research Ph.D. fellowships.