We meet every Tuesday from 4:15 - 5:15 pm to discuss recent research in the field of learning from text (and other related areas).
All are welcome.
You can subscribe to our mailing list to get announcements of upcoming talks, discussions, etc.
Please do send us a note if you'd like to give a talk.
|
Discussion led by Frank Lin Title: Learning Bigrams from Unigrams Authors: Xiaojin Zhu, Andrew B. Goldberg, Michael Rabbat, Robert Nowak Abstract: Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bagof-words documents. In experiments on seven corpora, we observed that our learned bigram language models: i) achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind “oracle bigram models” trained on the corresponding ordered documents; ii) assign higher probabilities to sensible bigram word pairs; iii) improve the accuracy of ordered document recovery from a bag-of-words. Our approach opens the door to novel phenomena, for example, privay leakage from index files. |
Visit our friends: