Welcome to the Spring 2008 CMU Text Learning Group

We meet every Tuesday from 4:15 - 5:15 pm to discuss recent research in the field of learning from text (and other related areas).
All are welcome.

You can subscribe to our mailing list to get announcements of upcoming talks, discussions, etc.
Please do send us a note if you'd like to give a talk.


Schedule
DateRoomSpeakerTopicSlides
Jan 22ndNSH 3001Frank LinClassifying Political Blogs using Link structures
Jan 29thWean 8220Vitor CarvalhoFine-tuning ranking models: a two-step optimization approach.
Feb 5thWean 8220Yifen HuangExploring hierarchical user feedback in email clustering
Feb 12thCancelled - -
Feb 19thWean 8220Richard WangLanguage-Independent Set Expansion of Named Entities using the Web
Feb 26thWean 8220Einat MinkovLearning to Walk Structured Text Networks
Mar 4thWean 8220Mehrbod SharifiFinding Domain Specific Polar Words for Sentiment Classification
Mar 11thCancelled - -
Mar 11thWean 7220-Round table discussion
Mar 25thWean 4616Andrew ArnoldCombining lexical and structure-based frequency features for protein-name extraction in biological publications
Apr 1stCancelled - -
Apr 8thWean 7220-Roundup of ICWSM
Apr 15thWean 7220Purnamrita SarkarFast Algorithms for Proximity Search on Large Graphs
Apr 22ndWean 7220Andy CarlsonBootstrapping Information Extraction from Semi-structured Web Pages
Apr 29thWean 7220 - Group discussion
May 6thWean 7220Einat MinkovPath constrained graph walks
May 13thNSH 3001Richard WangPaper discussion
May 20thNSH 3001Frank LinPaper discussion - semi supervised graph learning algiorithms
May 27thNSH 3305Einat MinkovPaper discussion
Jun 24thGoogleFrank LinPaper discussion
Paper to be discussed in the upcoming talk


Discussion led by Frank Lin
Title: Learning Bigrams from Unigrams
Authors: Xiaojin Zhu, Andrew B. Goldberg, Michael Rabbat, Robert Nowak
Abstract: Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bagof-words documents. In experiments on seven corpora, we observed that our learned bigram language models: i) achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind “oracle bigram models” trained on the corresponding ordered documents; ii) assign higher probabilities to sensible bigram word pairs; iii) improve the accuracy of ordered document recovery from a bag-of-words. Our approach opens the door to novel phenomena, for example, privay leakage from index files.

Visit our friends: