The success of many recent methods in machine learning --including deep learning --- relies upon finding a succinct coding of the "semantic content" of the input. In this talk we focus on methods of extracting semantic content in natural language, such as word embeddings and sentence/paragraph embeddings, which represent semantic content by vectors. (However our techniques seem relevant to other settings.)
We give a new theoretical model for language generation, whereby a text corpus is imagined as being generated by a random walk in a latent variable space, and the word production is via a loglinear distribution. This model is shown to imply several empirically discovered past methods for word embedding like word2vec, GloVe, PMI etc. It also casts new light into the structure of word embeddings, such as how word embeddings behave for a words with multiple meanings (polysemy), and how to to extract those "meanings" out of the embedding.
I also sketch more recent improvements to this model (inspired by Wieting et al 2016) which lead to some simple embeddings for longer units of text such as sentences and paragraphs. These are completely unsupervised (ie need no labeled data) and yet work better in several applications than embeddings based upon methods like recurrent neural nets, which are more complicated and less transparent.
Based upon joint work with Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski. (The first paper is in TACL 2016, and the others are in review.)
Sanjeev Arora is Charles C. Fitzmorris Professor of Computer Science at Princeton University. His research area spans several areas of theoretical Computer Science including computational complexity and algorithm design, and theoretical problems in machine learning. He has received the ACM-EATCS Gödel Prize (in 2001 and 2010), Packard Fellowship (1997), the ACM Infosys Foundation Award in the Computing Sciences (2012), the Fulkerson Prize (2012), and the Simons Investigator Award (2012).
Faculty Host: Gary Miller