Jan 15, 2010

Jaime Carbonell, LTI

Active and Proactive Learning Methods with Applications to MT and CompBio


Whereas active learning is deeply studied, proactive learning from multiple, potentially-unreliable, variable-cost sources is only recently receiving significant attention.  With web-scale labeling games and Amazon's Mechanical Turk, learning from multiple unreliable sources becomes a practical necessity, especially for language-related tasks. The talk will cover proactive learning, and touch upon applications in MT and Proteomics, and extensions of the work to rare categories.

Jan 22, 2010

Bhiksha Raj, LTI

Topic Models for Sound Processing


Topic models are usually applied to counts data to extract underlying patterns and clusters. When applied to text, the resulting analyses can give rise to topic-like clusterings of words, often analogized to topics. However, these models are equally applicable to any other form of multinomial data and can be used to deal with several problems in speech and audio processing, especially these involving mixtures.


Sounds, particularly speech, are typically characterized through spectro-temporal representations such as short-time Fourier transforms.  These representations naturally lend themselves to a histogram-based interpretation: the energy in any time-frequency bin for the signal is a scaled count of the number of quanta of energy in that frequency at that time.  When so abstracted, such a quanta-based representation instantly becomes indistinguishable from the histogram-based characterizations of other forms of counts data.


In this talk we will show how topic-like models can be used for the analysis of audio data. We will show how the basic model, extensions that employ sparsity priors, and convolutive versions of the model can be used to tackle various previously difficult-to-handle problems such as signal de-noising, bandwidth expansion, analysis of mixed signals, signal prediction, pitch tracking, de-reverberation etc.

Feb 5, 2010

Jason Baldridge, Univ. of Texas at Austin

We don't know what we don't know: important, but often-overlooked, considerations for active learning


Active learning seeks to utilize human annotators and machine learners to maximal effect to create accurate classifiers and informative labeled data sets given a limited budget. The basic idea is quite

straightforward: use a machine learned classifier to create or identify the best examples for a human expert to label, retrain the classifier on those examples, and repeat. In principle, this should avoid wasting human effort on trivial, less informative examples.

Indeed, active learning experiments that use an already annotated corpus to simulate the annotator typically show substantial annotation cost reductions over naive data point selection.


In practice, however, there are a number of important considerations that can undermine the utility of active learning; they in fact have as much or more impact as the selection strategy itself. I will discuss some such issues that arose during a pilot active learning experiment for speeding up the creation of interlinear glossed texts for endangered languages (in our case, the Mayan language Uspanteko). These include annotator expertise and confidence, measuring annotation cost and evaluating different methods, user interface design, starting conditions for active example selection and machine label suggestions, granularity of data point selection, task scope, and shifting analysis. These factors also interact with the linguistic analysis being performed (e.g., morpheme segmentation and gloss labeling) and the typical workflow linguists use to analyze a new language. Based on these considerations, I'll suggest some directions for developing experiments that explore the scenarios and parameters under which annotation projects can reasonably expect to obtain an actual benefit from active learning.


[This talk discusses joint work with Katrin Erk, Taesun Moon, and Alexis Palmer as part of the EARL project.]


Bio: Jason Baldridge is an assistant professor in the Department of Linguistics at the University of Texas at Austin. He received his Ph.D. from the University of Edinburgh in 2002 and was then a post-doctoral researcher there on the ROSIE project until 2005. His main research interests include categorial grammars, active learning, discourse structure, coreference resolution, and georeferencing. He is one of the co-creators of OpenNLP and has been active for many years in the creation and promotion of open source software for natural language processing.

Feb 12,2010

Alan W Black, LTI

Speech Synthesis: past, present and future and its relation to Speech Technology


This talk will look at the past, present and future of speech synthesis and how it relates to speech processing development in general.

Specifically I will outline the advances in synthesis technology giving analogies to the developments in other speech and language processing fields (e.g. ASR and SMT) where knowledge-based techniques gave way to data-driven techniques, which in turn have pushed both machine learning technologies and later re-introduced techniques to include higher level knowledge in our data-driven approaches.


We will give overviews of diphone, unit selection, statistical parametric synthesis, voice morphing technologies and how synthesis can be optimized for the desired task.  We will also address issues of evaluation, both in isolation and when embedded in real tasks.  While widening our view of speech processing we will also present the publicly used Let's Go Spoken Dialog System (and its evaluation platform Let's Go Lab), our rapid language adaptation system (CMUSPICE) allowing construction of ASR and TTS support in new languages by non-speech experts and out hands-free real-time two-way speech to speech translation system showing how system integration can cause cross technology innovation.


Bio: Alan W Black is an Associate Professor in the Language Technologies Institute at Carnegie Mellon University.  He previously worked in the Centre for Speech Technology Research at the University of Edinburgh, and before that at ATR in Japan.  He is one of the principal authors of the free software Festival Speech Synthesis System, the FestVox voice building tools and CMU Flite, a small footprint speech synthesis engine.

  He received his PhD in Computational Linguistics from Edinburgh University in 1993, his MSc in Knowledge Based Systems also from Edinburgh in 1986, and a BSc (Hons) in Computer Science from Coventry University in 1984.


Although much of his core research focuses on speech synthesis, he also works in real-time hands-free speech-to-speech translation systems (Croatian, Arabic and Thai), spoken dialog systems, and rapid language adaptation for support of new languages.  Alan W Black was an elected member of the IEEE Speech Technical Committee (2003-2007).  He is currently on the board of ISCA and on the editorial board of Speech Communications. He was program chair of the ISCA Speech Synthesis Workshop 2004, and was general co-chair of Interspeech 2006 -- ICSLP. In 2004, with Prof Keiichi Tokuda, he initiated the now annual Blizzard Challenge, the largest multi-site evaluation of corpus-based speech synthesis techniques.

Feb 19, 2010

Ani Nenkova, UPenn

Fully automatic evaluation for text summarization


In this talk I will present some of our recent results on automatic evaluation of news summaries using little or no human involvement. In particular I will present a fully automatic method for content selection evaluation in summarization that does not require the creation of human model summaries. Our work exploits the assumption that the distribution of words in the input and a good summary of that input should be similar to each other. Results on a large-scale evaluation from the Text Analysis Conference show that input-summary comparisons are very effective for the evaluation of content selection. Our automatic methods rank participating systems similarly to manual model-based pyramid evaluation and to manual human judgments of summary responsiveness, with correlation of 0.88 and 0.73 respectively.  I will also talk about our promising results on automatic evaluation of linguistic quality of summaries, which is an area of research that has received little attention till recently.


This is joint work with Annie Louis and Emily Pitler.


Bio: Ani Nenkova is an assistant professor at the University of Pennsylvania. Her main areas of research are automatic summarization, discourse and text quality. She obtained her PhD degree in computer science from Columbia University in 2006. She also spent a year and a half as a postdoctoral fellow at Stanford University before joining Penn in Fall 2007.

Mar 5, 2010

Jacob Eisenstein, MLD


Putting language in context with hierarchical Bayesian models


Language is shaped by a network of preferences and constraints that reflect semantic, discourse, and social phenomena. Hierarchical Bayesian models offer a principled methodology for incorporating such high-level and extra-linguistic context, based on a "generative story" of how each document or utterance was produced. This permits the incorporation of linguistic insight at the modeling level, while letting the data speak through learned parameters.  In this talk I will describe applications of this idea in syntax, semantics, discourse, and sentiment analysis.  The resulting systems learn rich linguistic structures with minimal supervision by exploiting visual communication, cross-lingual patterns, and unconstrained free-text annotations.

Mar 19, 2010

Alexander I. Rudnicky, LTI

Language Based Communication between Humans and Robots


Robots are on their way to becoming an ubiquitous part of human life as companions and workmates. Integration with human activities requires effective communication between humans and robots. Humans need to be able to explain their intentions and robots need to be able to share information about themselves and ask humans for guidance. Language-based interaction (in particular spoken language) offers significant advantages for efficient communication particularly in groups. We have been focusing on three aspects of the problem: (a) managing multi-party dialogs (defining  the mechanisms that regulate an agent's participation in a conversation); (b) effective coordination and sharing of information between humans and robots (such as mechanisms for grounding descriptions of the world in order to support a common frame of reference); (c) instruction-based learning (to support dynamic definition of new behavior patterns through spoken as well as multi-modal descriptions provided by the human). This talk describes the TeamTalk system, the framework for exploring these issues.



Dr. Rudnicky's research has spanned many aspects of spoken language, including knowledge-based recognition systems, language modeling, architectures for spoken language systems, multi-modal interaction, the design of speech interfaces and the rapid prototyping of speech-to-speech translation systems. Dr. Rudnicky has been active in research into spoken dialog, and has made contributions to dialog management, language generation and the computation of confidence metrics for recognition and understanding. His recent interests include the automatic creation of summaries from event streams, automated meeting understanding and summarization, and language-based human-robot communication. Dr. Rudnicky is currently a Principal Systems Scientist in the Computer Science Department at Carnegie Mellon University and is on the faculty of its Language Technologies Institute.

Mar 26, 2010

Sanjeev Khudanpur, JHU

Discovering the Language of Surgery: Automatic Gesture Induction for Manipulative Tasks


We describe a framework for modeling and recognition of gestures used in manipulative tasks such as robot assisted minimally invasive surgery.  The key ingredient of our framework is a hidden Markov model (HMM) of the kinematic signal [alternatively, the endoscopic video] based on which the recognition must be performed: with the states of the HMM corresponding to gestures or sub-gestures, recognition reduces to a standard inference problem.  The topology and transition probabilities of the HMM capture gesture dynamics and the compositional structure of the task being performed, while the emission probabilities of the HMM capture the stochastic variability between different realizations of the same gesture.


Two important design considerations in using HMMs for gesture recognition are addressed in this talk: how to automatically learn the inventory of gestures or sub-gestures needed to model the manipulative task, and how to select kinematic [video] features that carry the most information for discriminating between gestures.  A modified procedure for successive refinement of HMM topology is developed to address the former, while an iterative application of heteroscedastic LDA is found to be quite successful for the latter.


HMMs estimated using these techniques are used to recognize suturing trials by a number of surgeons with different levels of expertise using da Vinci surgical robot.  Gesture recognition accuracies over 80%, the ability to automatically discover key gestures and subgestures, and the ability to automatically align trials of two different surgeons for comparison are demonstrated.

Apr 2, 2010

Philip Resnik, UMCP

Translation as a Collaborative Activity


Although machine translation has made a great deal of recent progress, fully automatic high quality translation remains far out of reach for the vast majority of the world’s languages. A variety of projects are now emerging that tap into Web-based communities of people willing to help in the translation process, but bilingual expertise is quite sparse compared to the availability of monolingual volunteers.  In this talk, I'll discuss a new approach to the problem of achieving cost-effective translation with high quality, in which monolingual participants collaborate via an iterative protocol.  Motivated by concepts in information theory and discourse analysis, the approach brings together elements of machine translation, linguistic annotation, and human computer interaction.


This is joint work with Ben Bederson, Chang Hu, and Olivia Buzek.



Philip Resnik is an associate professor at the University of Maryland, with joint appointments in the Department of Linguistics and at the Institute for Advanced Computer Studies.  He received his Ph.D.in Computer and Information Science at the University of Pennsylvania in 1993, and has held research positions at Bolt Beranek and Newman, IBM TJ Watson Research Center, and Sun Microsystems Laboratories.  His research interests include the combination of knowledge-based and statistical methods in NLP, machine translation, and computational social science.

Apr 9, 2010

Tom M. Mitchell, MLD

Read the Web


We describe research to develop a never-ending language learner that runs 24 hours per day, forever, and that each day has two goals.  The first is to extract more information from the web to populate its growing knowledge base of structured knowledge.  The second is to learn to read better than yesterday, as evidenced by its ability to go back to the same web pages it read yesterday, and extract more facts more accurately today.   This research project is both an attempt at a new approach to natural language processing, and a case study in how we might design an architecture for never-ending learning.  This talk will describe our approach, and experimental results from our NELL system which has been running nonstop since January 2, 2010, and which has already extracted a structured knowledge base containing approximately a quarter of a million beliefs from a corpus containing half a billion web pages.



Tom M. Mitchell is the E. Fredkin University Professor and head of the Machine Learning Department at Carnegie Mellon University. His research interests lie in machine learning, natural language processing, artificial intelligence, and cognitive neuroscience.

Mitchell believes the field of machine learning will be the fastest growing branch of computer science during the 21st century.  His home page is www.cs.cmu.edu/~tom

Apr 23, 2010

Dan Roth, UIUC

Constraints Driven Structured Learning with Indirect Supervision



Making decisions in natural language understanding tasks often involves assigning values to sets of interdependent. Supporting good performance in these cases (sometimes called "structured tasks") frequently necessitates performing global inference that accounts for these interdependencies. This talk will focus on training global models and propose new learning algorithms that significantly reduce the need for supervision in this process. Our learning framework is "Constraints Driven" in the sense that it allows and even gains from global inference that combines statistical models with expressive declarative knowledge (encoded as constraints).

      We consider both structured output prediction problems and cases where the goal is to make decisions that crucially depend on a latent structure, and present a unified and principled learning framework that encompasses both notions of structure. While obtaining direct supervision for structures is difficult, we show that it is often easy to obtain a related binary indirect supervision signal, and discuss several options for deriving this supervision signal, including inducing it from the world's response to the model's actions. We introduce a learning framework that jointly learns from direct and indirect forms of supervision, and show the significant contribution of easy-to-get indirect binary supervision on several important NLP tasks.


Short Bio:

Dan Roth is a Professor in the Department of Computer Science and the Beckman Institute at the University of Illinois at Urbana-Champaign. He is the director of a DHS Center for Multimodal Information Access & Synthesis (MIAS) and has faculty positions also at the Statistics and Linguistics Departments and the School of Library and Information Sciences.

       Roth is a Fellow of AAAI for his contributions to the foundations of machine learning and inference and for developing learning centered solutions for natural language processing problems. He has published broadly in machine learning, natural language processing, knowledge representation and reasoning and learning theory, and has developed advanced machine learning based tools for natural language applications that are being used widely by the research community.

Prof. Roth has given keynote talks in major conferences, including AAAI, EMNLP, ICMLA and presented several tutorials in universities and conferences including at ACL and EACL. Roth was the program chair of CoNLL'02 and of ACL'03, and is or has been on the editorial board of several journals in his research areas; he is currently an associate editor for the Journal of Artificial Intelligence Research and the Machine Learning Journal.

Prof. Roth got his B.A Summa cum laude in Mathematics from the Technion, Israel, and his Ph.D in Computer Science from Harvard University in 1995.

Apr 30, 2010

Rebecca Hwa, Upitt

Applications of Information Visualization for Natural Language Processing



In this talk, I present two interactive NLP applications that use information visualization to help users to explore and analyze text. The first system, named Pictor, is a browser designed to facilitate the analysis of quotations in a collection of news text. Based on user queries, it groups relevant quotes into "threads" to illustrate the development of subtopics over time. We will present case studies to demonstrate how the system supports a richer understanding of news events. The second system, called The Chinese Room, helps users who do not understand the source language to explore imperfect outputs from MT systems. Through a visualization of multiple linguistic resources, our system enables users to identify potential translation mistakes and make educated guesses as to how to correct them.  Our experimental result suggests that users of our prototype are able to correct some difficult translation errors that they would have found baffling otherwise.

     *Pictor is a collaborative project with Noah Smith, Alan Black, Ric Crabbe, Nathan Schneider, Philip Gianfortoni, Dipanjan Das, and Michael Heilman. The Chinese Room is a collaborative project with Josh Albrecht and Liz Marai.


Short Bio:

Rebecca Hwa is an Associate Professor in the Department of Computer Science at the University of Pittsburgh. Before joining Pitt, she was a postdoc at University of Maryland. She received her PhD in Computer Science from Harvard University in 2001 and her B.S. in Computer Science and Engineering from UCLA in 1993. Dr. Hwa's primary research interests include multilingual processing, machine translation, and semi-supervised learning methods. Additionally, she has collaborated with colleagues on information visualization, sentiment analysis, and bioinformatics. She is a recipient of the NSF CAREER Award. Her work has also been supported by NIH and DARPA. Dr. Hwa currently serves as the chair person of the executive board of the North American Chapter of the Association for Computational Linguistics.



Fall 2010

Sept 3, 2010

Noah Smith, LTI

Text-Driven Forecasting:  Meaning as a Real Number



Text-driven forecasting is the challenge of making concrete, testable predictions about future events and trends from publicly available text data.  This talk considers a few recent success stories that use various kinds of text (expert-written analysis, blog posts, tweets) to predict interesting things about the future in various domains (finance, political discourse, and public opinion polls). Forecasting challenges much of the standard methodology in NLP while opening up a new driving force for useful models of real-world text that are grounded in real-world events.



Noah Smith is an assistant professor in the School of Computer Science at Carnegie Mellon University. He received his Ph.D. in Computer Science, as a Hertz Foundation Fellow, from Johns Hopkins University in 2006 and his B.S. in Computer Science and B.A. in Linguistics from the University of Maryland in 2001. His research interests include statistical natural language processing, especially unsupervised methods, machine learning for structured data, and applications of natural language processing. He serves on the editorial board of the journal Computational Linguistics and received a best paper award at the ACL 2009 conference. His research group, Noah's ARK, is supported by the NSF, DARPA, Qatar NRF, IARPA, Portugal FCT, and gifts from Google, HP Labs, IBM Research, and Yahoo Research.

Sept 10, 2010

Carolyn Rose, LTI

Displayed Bias as a Reflection of Both Speaker and Intended Hearer in Conversational Settings



A variety of recent text mining techniques have been developed to detect the bias or stance of an author or speaker based on linguistic properties of their contributions to a discourse.  As one example, work on sentiment analysis typically seeks to detect an author's opinion of a product based on characteristics of a posted product review, often taken out of context. Insights from the fields of rhetoric, sociolinguistics, and discourse analysis argue that the way contributions to a discourse are formulated communicates not only personal characteristics of the projected source of the contribution but also reflects assumed characteristics of the intended recipient of the message. Approaches that neglect the influence of the intended recipient may be vulnerable to making attributions to the source of a contribution that are not an accurate reflection of that author or speaker's personal stance.  Thus, an opportunity to strengthen computational work on bias detection is to factor out these influences when making attributions to the source of the message.  In this talk I will apply variations of Latent Dirichlet Allocation to the problem of modeling displayed bias in two conversational settings, namely a chat corpus where pairs of participants with competing design goals work towards a consensus on a power plant design task, and a political newsgroup forum where participants who self-identify as politically left or politically right discuss and debate a variety of political issues with each other.  Using this technology as a lens, I will present analyses that demonstrate the joint influence of the source and recipient's respective stance on the formulation of the contributions to the discourse.  The picture is further enriched when considering the way in which the rhetorical style of interaction potentially mediates these effects.  Thus, finally, in connection with the chat corpus I will present analyses that suggest potential mediating effects of a construct from systemic functional linguistics associated with projected authoritativeness of the speaker in relation to that of the intended recipient.



Carolyn Rose is an Assistant Professor in the School of Computer Science at Carnegie Mellon University, with a joint appointment between the Language Technologies Institute and the Human-Computer Interaction Institute.  She serves as a member of the Executive Committee of the Pittsburgh Science of Learning Center and as Co-Leader of the Social and Communicative Factors in Learning thrust within that center.  She earned a Master's degree in Computational Linguistics from the Department of Philosophy at Carnegie Mellon in Spring of 1994 and then a Ph.D. in Language and Information Technologies from the Language Technologies Institute in Fall of 1997. Her research integrates perspectives on conversation analysis from machine learning, sociolinguistics, discourse analysis, education, and psychology.  She ranks in the top 20 in the Microsoft Academic Search Computers and Education list under both the past 5 year and 10 year categories.  She serves on the editorial boards of the International Journal of Human-Computer Studies and the Journal of Educational Data Mining in addition to serving as the Secretary/Treasurer of the International Society of the Learning Sciences.  Her research group is currently supported by the National Science Foundation and the Office of Naval Research and has also received gifts from Worth Publishing, Inc., Verilogue, Inc., and Microsoft Research India.

Sept 17, 2010

Justine Cassell, HCII

Understanding and Modeling Dialogue among Peers and its Role in Language-Learning



It is well documented that children learn a tremendous amount from interactions with their peers -- including skills that adults have a hard time teaching.  In this talk I focus on how peer interaction scaffolds children's learning of linguistic pragmatics (who to use what dialect or register with, how to be contingent to the previous utterance, how to make conversation reciprocal), and the relationship between these pragmatics skills and issues of identity and culture (how one self-identifies, how one is identified by others). My data come both from the study of child-child dialogues, and the study of child-virtual peer dialogues (where virtual peers are autonomous or semi-autonomous life-size virtual children with the ability to engage real children in interaction).  But while these virtual peers are valuable tools for the study of pragmatics, and for scaffolding the use of advanced pragmatics skills, they present particular and interesting challenges to building embodied dialogue systems, and modeling multimodal interaction.

              Examples will be drawn from my work with a number of different populations, including children who grow up speaking a non-mainstream dialect of English, and children with autism spectrum disorder.



Justine Cassell is Department Head of the Human-Computer Interaction Institute of the School of Computer Science at Carnegie Mellon University. Justine came to CMU from Northwestern, where she was founding director of the Center for Technology & Social Behavior and the Technology & Social Behavior Joint Ph.D. in Communication & Computer Science at Northwestern University, with positions in the Departments of Communication Studies and Electrical Engineering & Computer Science, and courtesy appointments in Education, Psychology, and Linguistics. Prior to her time at Northwestern, Justine was a tenured faculty member at the MIT Media Lab. Cassell's research builds on her multidisciplinary background: she holds undergraduate degrees in Comparative Literature from Dartmouth and in Lettres Modernes from the Universite de Besançon (France). She holds a M.Litt. in Linguistics from the University of Edinburgh, and a double Ph.D. from the University of Chicago in Linguistics and Psychology.

Sept 24, 2010

Abdur Chowdhury, Twitter, Inc.

Discovery &  Emergence



Often as computer scientists we focus on faster algorithms, such as approximations of solutions in linear time over large data sets or similar problems. Rather than focus on algorithms in this talk, we ask the question "What possibilities emerge from surfacing the world's conversations to others". Specifically we explore Twitter Trends as a discovery tool and show how awareness of the thoughts of others can cause the emergence of new behaviors.



Dr. Abdur Chowdhury serves as Twitter's Chief Scientist. Prior to that Dr. Chowdhury co-founded Summize a real-time search engine sold to Twitter in 2008. Dr Chowdhury has held positions at AOL as Chief Architect for Search, Georgetown's Computer Science Department and University of Maryland's Institute for Systems Research. His research interest lay in Information Retrieval focusing on making information accessible.

Oct 1, 2010

Bob Murphy, Computational Biology, CMU

Representation and Learning of Protein Distributions and Cellular Organization



Systems biology seeks to build detailed, mechanistic, predictive models of the behavior of biological systems, and use them to detect and treat disease. Since many diseases are associated with changes in the distribution of proteins within cells, and since there are tens of thousands of different proteins expressed in a typical cell, automated methods for interpreting microscope images to determine the distribution of all proteins within cells will be essential for building predictive models.  We have extensively demonstrated the feasibility of using machine learning methods to recognize major subcellular patterns. However, such supervised learning methods are confounded by proteins that are found in more than one cellular compartment.  We have therefore developed methods for unmixing location patterns into combinations of fundamental patterns, so that each protein can be represented by the amounts that are found in distinct structures.  We have also developed the first approach for learning generative models of protein subcellular patterns from microscope images.  The combination of these methods permit subcellular pattern information from large and diverse image collections to be integrated into cellular systems simulations.



 Robert F. Murphy is the Ray and Stephanie Lane Professor of Computational Biology and Director of the Lane Center for Computational Biology in the School of Computer Science at Carnegie Mellon University. He is also Professor of Biological Sciences, Biomedical Engineering, and Machine Learning and was founding director (with Ivet Bahar) of the joint CMU-Pitt Ph.D. Program in Computational Biology. He served as the first full-term chair of NIH’s Biodata Management and Analysis Study Section, was named a Fellow of the American Institute for Medical and Biological Engineering in 2006, and received an Alexander von Humboldt Foundation Senior Research Award in 2008. Dr. Murphy has co-edited two books and published over 170 research papers. He is Past-President of the International Society for Advancement of Cytometry, was named as the first External Senior Fellow of the School of Life Sciences in the Freiburg (Germany) Institute for Advanced Studies, and is a member of the National Advisory General Medical Sciences Council.


Dr. Murphy’s group pioneered the application of machine learning methods to high-resolution fluorescence microscope images depicting subcellular location patterns in the mid 1990’s. He leads an NIH-funded project for proteome-wide determination of subcellular location in 3T3 cells, and his current research interests include image-derived models of cell organization and active machine learning approaches to experimental biology.

Oct 8, 2010

Ben Carterette, University of Delaware

Measuring Search Engine Utility



Information retrieval systems are evaluated by their ability to find and rank relevant material in large collections of semi-structured data.  The dominant method for evaluation is the use of static, portable test collections consisting of full-text documents, model information needs, and judgments of the relevance of documents to those needs.  Modern test collections are invaluable tools, but nevertheless are lacking for the purpose of evaluating the utility of a system to its users:  they strip away almost all information about the user in favor of a relatively simple notion of individual document relevance.  I will present an overview of some ongoing work on the development of test collections that allow more precise measurements of the utility of a search engine and discuss difficulties in escaping the standard evaluation methodology.



Ben Carterette is an Assistant Professor of Computer and Information Sciences at the University of Delaware.  He has published extensively on constructing and using test collections for low cost as well as experimental design methodology and analysis for IR.  In addition to co-organizing two ACM SIGIR workshops on test collections that go beyond binary independent relevance judgments, he has co-coordinated evaluation competitions/workshops for TREC (the Text REtrieval Conference):  the Million Query track from 2007–2009 and the new Session track in 2010.  He completed his Ph.D. in 2008 at the University of Massachusetts Amherst.

Oct  22, 2010

Paul Bennett, Microsoft

Class-Based Contextualized Search



Information retrieval has made significant progress in returning relevant results for a single query.  However, much search activity is conducted within a much richer context of a current task focus, recent search activities as well as longer-term preferences.  For example, our ability to accurately interpret the current query can be informed by knowledge of the web pages a searcher was viewing when initiating the search or recent actions of the searcher such as queries issued, results clicked, and pages viewed.  We develop a framework based on classification that enables representation of a broad variety of context including the searcher's long-term interests, recent activity, and current focus as a class intent distribution.  We then demonstrate how that can be used to improve the quality of search results.  In order to make such an approach feasible, we need reasonably accurate classification into a taxonomy, a method of extracting and representing a user's query and context as a distribution over classes, and a method of using this distribution to improve the retrieval of relevant results.  We describe recent work to address each of these challenges.


This talk presents joint work with Nam Nguyen, Krysta Svore, Susan Dumais, and Ryen White.



Paul Bennett is a Researcher in the Context, Learning & User Experience for Search (CLUES) group at Microsoft Research where he works on using machine learning technology to improve information access and retrieval.   His recent research has focused on classification-enhanced information retrieval, pairwise preferences, human computation, and text classification while his previous work focused primarily on ensemble methods, active learning, and obtaining reliable probability estimates, but also extended to machine translation, recommender systems, and knowledge bases.  He completed his dissertation on combining text classifiers using reliability indicators in 2006 at Carnegie Mellon where he was advised by Profs. Jaime Carbonell and John Lafferty.

Oct 29, 2010

Slav Petrov,

Google Inc

Coarse-to-Fine Inference in Natural Language Processing



State-of-the-art NLP models are anything but compact. Syntactic parsers have huge grammars, machine translation systems have huge transfer tables, and so on across a range of tasks. Exhaustive inference becomes prohibitive with such complexity, requiring efficient approximations to infer optimal structures. Hierarchical coarse-to-fine methods address this challenge, by exploiting a sequence of models which introduce complexity gradually. At the top of the sequence is a trivial model in which learning and inference are both cheap. Each subsequent model refines the previous one, until a final, full-complexity model is reached. Each refinement introduces only limited complexity, making inference very efficient. In this talk, I describe two coarse-to-fine systems. In the domain of syntactic parsing, complexity is in the grammar. I will present a latent-variable approach in which an X-bar grammar is iteratively refined. The final grammars produce the best parsing accuracies across an array of languages, but are impractical to work with because of their size. We therefore introduce a coarse-to-fine inference scheme, in which the final grammar is projected onto a hierarchy of coarser grammars. This hierarchy admits an efficient incremental inference scheme and reduces parsing times by orders of magnitude. In the domain of machine translation, complexity arises because there and too many target language word types. To manage this complexity, we translate into target language clusterings of increasing vocabulary size. This approach gives dramatic speed-ups while actually increasing final translation quality.



Slav Petrov is a Research Scientist at Google and works on problems at the intersection of natural language processing and machine learning. He also teaches Statistical Natural Language Processing at New York University. Slav did his PhD at UC Berkeley, where he worked on syntactic parsing, speech recognition, and machine translation with Dan Klein. Prior to Berkeley, he spent a year as an exchange student at Duke University, working with Carlo Tomasi on gesture recognition. Slav holds a Master's degree from the Free University of Berlin, where he won the RoboCup world championship in robotic soccer under the supervision of Raul Rojas.

Nov 5, 2010

Katharina Morik, Technical University Dortmund


Data Mining – Learning under Resource Constraints



Data Mining started in the nineties with the claim that real-world data collections as they are stored in data bases require less sophisticated and more scalable algorithms than the then dominating statistical routines. New tasks like frequent set mining occurred. At the same time, sophisticated pre-processing and sampling methods allowed data analysis to cope with large data sets. 

Currently, we are again challenged by data masses at an even larger scale, collected at distributed sites, in heterogeneous formats and by applications that demand real-time response. Storage, runtime, and execution time for real-time behavior are the constrained resources, which need to be handled by new learning methods.

The talk will give an overview of learning under resource constraints and present applications that illustrate the new challenge, in more detail.

           The overwhelming dimensionality of genomic data (about 200.000 features) demands fast and robust methods of stable feature selection. The small set of observations (about 100 patients) demands the integration of different populations.  The two problems need to be solved, if we aim at a personalized medicine.

           The new challenge is well illustrated by data analysis for ubiquitous systems.   Logged data from a mobile device can be compressed by a data streaming algorithm such that further learning uses only the aggregated data.  The prediction of file access allows decreasing upload-time and tailoring the operating system’s services. In sum, this could save energy and let the battery last longer.

Implementing algorithms on GPGPUs is shortly discussed.



Katharina Morik is one of the pioneers of Artificial Intelligence in Germany. She received her Ph D in Hamburg and worked there in the group of Wolfgang Wahlster that developed the HAMburg Application-oriented Natural language System.  She moved to Berlin and started the first German project on Machine Learning there. Since 1991 she is a full professor at the Technical University Dortmund, Germany. The first efficient implementation of the Support Vector Machine by Thorsten Joachims was developed in her lab as well as the Open Source tool RapidMiner which is for the third time the most used Open Source Data Mining Tool world-wide (KDnuggets). Her current interests are in information extraction from texts as well as in mining very large and high dimensional data.

Nov 12, 2010

Julia Hirschberg, Columbia University

WordsEye:  Creating 3D Scenes from Natural Language Text



3D graphics scenes are diffcult to create, and require users to learn to use a series of complex menus, dialog boxes, and often tedious direct manipulation techniques. By giving up some amount of control afforded by such interfaces, it is possible for users to create 3D scenes by simply describing the picture they want to create.  WordsEye is a program we are building in conjunction with collaborators at the Oregon Health and Science University to perform such “text to scene” conversion.  We will describe the current version of WordsEye; the enhancements we are implementing, based upon a Scenario-Based Lexical Knowledge Resource (SBLR) which we are creating; some Amazon Mechanical Turk annotations we are gathering to help us populate the SBLR; and a trial we have recently conducted at the Harlem Educational Activities Fund (HEAF) this past summer, testing the value of WordsEye as an alternative literacy approach.



Julia Hirschberg is Professor of Computer Science at Columbia University.  Her research focuses on prosody in speech generation and understanding, on speech summarization, emotional speech, and interaction in spoken dialogue systems.  She has served as President of the International Speech Communication Association (ISCA), co-editor-in-chief of Speech Communication, and editor-in-chief of Computational Linguistics.  She is a fellow of the American Association for Artificial Intelligence and an ISCA Fellow.

Nov 19, 2010

Nigel G. Ward, University of Texas at El Paso

Prosody and Prediction for Dialog Systems, and in particular for language modeling, adaptation and turn-taking



Humans in dialog have a remarkable ability to predict, moment-by-moment, what the interlocutor is likely to do next, due in part to the information available in various prosodic signals and markers.  This talk gives three illustrations of how this can be modeled and used.  First, for language modeling, we show how prosodic features such as speaking rate and pitch height, computed over small fixed-width windows, are predictive of the upcoming word.  Second, for dialog management, we find that turn-by-turn responsiveness on the three "emotional" dimensions of activation, valence, and power can give a sense of rapport.  Third, for turn-taking, attention to prosodic cues can make interactions more efficient and more natural.  Thus, predictions can be made and exploited, but so far only in certain specific ways, and only after labor-intensive analysis and development.  We are working towards a general framework for such modeling, ultimately to include algorithms for discovering models from data, but numerous challenges arise, including some intrinsic to the nature of prosody, some due to the distance between the surface manifests and the underlying multi-dimensional cognitive states, and some reflecting the time course of these cognitive states in dialog.



Nigel G. Ward received his Ph.D. in Computer Science from the University of California at Berkeley in 1991. After ten years on the faculty of the University of Tokyo he joined the University of Texas at El Paso in 2002.  Ward's research areas lie at the intersection of spoken language and human-computer interaction. One focus is improving the usability of today's spoken dialog systems, another is the study of fundamental issues in dialog modeling using a variety of methods: statistical, linguistic, systems-building, and experimental.

Dec 3, 2010

Regina Barzilay, MIT

Learning to Behave by Reading



In this talk, I will address the problem of grounding linguistic analysis in control applications, such as game playing. We assume access to natural language documents that describe the desired behavior of a control algorithm (e.g., game strategy guides).  Our goal is to demonstrate that knowledge automatically extracted from such documents can improve performance of the target application.


First, I will present a reinforcement learning algorithm for learning to map natural language instructions to executable actions.  This technique has enabled automation of tasks that until now have required human participation --- for example, automatically configuring software by consulting how-to guides. Next, I will present a Monte-Carlo search algorithm for game playing that incorporates information from game strategy guides. In this framework, the task of text interpretation is formulated as a probabilistic model that is trained based on feedback from Monte-Carlo search. When applied to the Civilization strategy game, a language-empowered player outperforms its traditional counterpart by a significant margin.


This is joint work with Branavan, Harr Chen, David Silver and Luke Zettlemoyer.



Regina Barzilay is an associate professor in the Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory.  Her research interests are in natural language processing. She is a recipient of various awards including of the NSF Career Award, the MIT Technology Review TR-35 Award, and Best Paper Awards in top NLP conferences. She is a PC co-chair for EMNLP 2011.