As the amount of speech data available increases rapidly, so does the need for efficient search and understanding. Techniques such as Spoken Term Detection (STD), which focuses on finding instances of a particular spoken word or phrase in a corpus, try to address this problem by locating the query word with the desired meaning. However, STD may not provide the desired result, if the Automatic Speech Recognition (ASR) system in the STD pipeline has limited performance, or the meaning of the item retrieved is not the one intended. In this thesis, we propose different features that can improve the performance on search and understanding of noisy conversational speech.

First, we describe a Word Burst phenomenon which leverages the structural property of conversational speech. Word Burst refers to a phenomenon in conversational speech in which particular content words tend to occur in close proximity of each other as a byproduct of the topic under discussion. We design a decoder output rescoring algorithm according to Word Burst phenomenon to refine our recognition results for better STD performance. Our rescoring algorithm significantly reduced the false alarm that were produced by the STD system. We also leverage Word Burst as a feature for identifying recognition errors in conversational speech. Our experiments show that including Word Burst feature can provide significant improvement. With this feature, we demonstrate that higher level information, such as structural property can improve search and understanding without the need for language-specific resources or external knowledge.

Second, we identify the mismatch between different decoder output created by the same ASR system can be leveraged as a feature for better STD performance. After the decoding process of an ASR system, the result can be stored in the format of lattice or confusion networks. The lattice has richer historical information for each word, while the confusion network maintain a simple and more compact format. Each of this format contains unique information that is not presented in the other format. By combining the STD result generated from these two decoder output, we can achieve improvement on STD systems as well. This feature shows that unexplored information could be stored in different output generated by the identical ASR system.

Last but not least, we presented a feature based on distributed representations of spoken utterances. Distributed representations group similar words closer in a vector space according to its context. Every word that shows up in a regular context will be projected into the vector space closely to each other. As a feature space, we not only project the word in the space, but also project the utterances that contains multiple words into the space. We apply this feature to Spoken Word Sense Induction (SWSI) task, which differentiates target keyword instances by clustering according to context. We compare this approach with several existing approaches and shows that it achieves the best performance, regardless of the ASR quality.

Thesis Committee:
Alexander Rudnicky (Chair)
Alan W Black
Alexander G Hauptmann
Gareth J.F. Jones (Dublin City University)

Copy of Thesis Document

Intelligent communication requires reading between the lines, which in turn, requires rich background knowledge about how the world works. However, learning unspoken commonsense knowledge from language is nontrivial, as people rarely state the obvious, e.g., "my house is bigger than me." In this talk, I will discuss how we can recover the trivial everyday knowledge just from language without an embodied agent. A key insight is this: the implicit knowledge people share and assume systematically influences the way people use language, which provides indirect clues to reason about the world. For example, if "Jen entered her house’", it must be that her house is bigger than her.

In this talk, I will first present how we can organize various aspects of commonsense — ranging from naive physics knowledge to more pragmatic connotations — by adapting representations of frame semantics. I will then discuss neural network approaches that complement the frame-centric approaches. I will conclude the talk by discussing the challenges in current models and formalisms, pointing to avenues for future research.

Yejin Choi’s primary research interests are in the fields of Natural Language Processing, Machine Learning, Artificial Intelligence, with broader interests in Computer Vision and Digital Humanities.

Language and X {vision, mind, society...}: Intelligent communication requires the ability to read between the lines and to reason beyond what is said explicitly. Her recent research has been under two broad themes: (i) learning the contextual, grounded meaning of langauge from various contexts in which language is used — both physical (e.g., visual) and abstract (e.g., social, cognitive), and (ii) learning the background knowledge about how the world works, latent in large-scale multimodal data. More specifically, her research interests include:

  • Language Grounding with Vision: Learning semantic correspondences between language and vision at a very large scale, addressing tasks such as image captioning, multimodal knowledge learning, and reasoning.
  • Procedural Language: Learning to interpret instructional language (e.g., cooking recipes) as action diagrams, and learning to compose a coherent natural language instruction that accomplishes a given goal and an agenda.
  • Knowledge and Reasoning: Statistical learning of commonsense knowledge from large-scale multimodal data, for example, learning physical properties (e.g., size) of common objects.
  • Language Generation: Situated language generation, conversation, storytelling, integrating multimodality and stochastic knowledge about actions, events, and affects.
  • Connotation and Intention: Statistical models to infer the communicative goals and the (hidden) intent of the author, e.g., deceptive intent, by learning statistical regularities in how something is said (form & style) in addition to what is said (content).

Instructor: Graham Neubig

Discourse relations such as ‘contrast’, ‘cause’ or ‘background’ are often postulated to explain our ability to construct coherence in discourse. Within discourse analysis frameworks such as Rhetorical Structure Theory (RST), it is assumed that discourse relations can be structured hierarchically, forming a graph or tree of discourse units. In this talk I will empirically examine properties of discourse graphs using multifactorial methods. Taking advantage of the richly annotated GUM corpus (Zeldes 2017) with 64,000 tokens annotated for 4,700 instances of 20 discourse relations in four English genres, I will suggest refinements to proposed constraints on discourse structures. Using ensemble methods and RNNs trained on multiple annotation layers in the corpus, we can visualize ‘heat maps’ for areas of referential accessibility in discourse graphs, and identify and disambiguate discourse markers in a manner that is sensitive to utterance level context..

Amir Zeldes is a computational linguist specializing in corpus linguistics, the extraction and analysis of linguistic structures in digital text collections. His main areas of interest are at the syntax-semantics interface: He is interested in how we say what we want to say, and especially in the kinds of discourse models we retain across sentences. This includes representing entity models of who or what has been mentioned, how they are introduced and referred back to, but also relationships between utterances as a complex discourse is constructed, such as expressing causality, signaling support for arguments and opinions with evidence, contrasts and more.

He is also very interested in how we learn to be productive in our first, second and subsequent languages, producing some (but not only, and not just any) utterances and combinations we have never heard before. He believes that very many factors constantly and concurrently influence the choice between competing constructions, which means that we need multifactorial methods and multilayer corpus data in order to understand what it is that we do when we produce and understand language.



Instructor: Graham Neubig

We believe that Personalized Recommender Systems should not only produce good recommendations that suit the taste of each user but also provide an explanation that shows why each recommendation would be interesting or useful to the user, to be more effective. Explanations may serve many different purposes. They can show how the system works (transparency) or help users make an informed choice (effectiveness). They may be evaluated on whether they convince the user to make a purchase (persuasiveness) or whether they help the user make a decision quickly (efficiency). In general, providing an explanation has been shown to build user’s trust in the recommender system [42].

Most often, the type of explanation that can be generated is constrained by the type of the model. In this thesis, we focus on generating recommendations and explanations using knowledge graphs as well as neural networks.

Knowledge graphs (KG) show how the content associated with users and items are interlinked to each other. Using KGs have been shown to improve recommender accuracies in the past. In the first part of this thesis,we show how recommendation accuracy can be improved using a logic programming approach onKGs. Additionally, we propose how explanations could be produced in such a setting by jointly ranking KG entities and items.

KGs however operate in the domain of discrete entities, and are therefore limited in their ability to deal with natural language content. Free form text such as reviews are a good source of information about both the user as well as the item. In the second part of this thesis, we shift our focus to neural models that are more amenable to natural language inputs, and we show how a teacher-student like architecture could be used to transform latent representations of user and item into that of their joint review to improve recommendation performance. We also show how such a framework could be used to select / predict a candidate review that would be most similar to the joint review. Such a review could possibly serve as an explanation of why the user would potentially like the item.

Different users are interested in different aspects of the same item. Therefore, most times, it is impossible to find a single review that would reflect all the interests of a user. A  succinct explanation shown to a user for an item is ideally a personalized summary of all relevant reviews for that item. In the final part of this thesis, we propose a neural model that can generate a personalized abstractive summary as explanation and describe how such a model could be evaluated.

Thesis Committee:
William W. Cohen (Chair)
Maxine Eskenazi
Ruslan Salakhutidinov
Jure Leskovec (Stanford University)

Copy of Proposal Document

An agent following instructions requires a robust understanding of language and its environment. In this talk, I will propose a model for mapping instructions to actions that jointly reasons about natural language and raw visual input obtained from a camera sensor. Unlike existing approaches that decompose the problem to separately built models, our approach does not require intermediate representations, planning procedures, or training different models for visual and language reasoning. To train, we design a reinforcement learning algorithm to address key problems in learning for natural language understanding, including learning in a few-sample regime and exploiting annotated training demonstrations. Our approach significantly outperforms supervised learning and common reinforcement learning methods.

Yoav Artzi is an Assistant Professor in the Department of Computer Science and Cornell Tech at Cornell University. His research focuses on learning expressive models for natural language understanding, most recently in situated interactive scenarios. He received best paper awards in EMNLP 2015 and ACL 2017 and a Google faculty award. Yoav holds a B.Sc. summa cum laude from Tel Aviv University and a Ph.D. from the University of Washington.

Instructor: Graham Neubig

While search engines are widely used to find educational material, current search technology is optimized to provide information of generic relevance, not results that are oriented toward a specific user's background and learning goals. As a result, users often do not get effective access to the materials best suited for their learning needs.  Moreover, little is known about the relationship between search interaction over time and actual learning outcomes. With collaborators, I have been exploring new content representations and interaction features, implicit assessment methods, and retrieval algorithms for search engines for better understanding and support of human learning, broadly defined. This talk will summarize progress from recent projects toward that goal, including new types of retrieval models that try to directly optimize expected learning gains, and user studies exploring the relationship between search quality, interaction patterns, and learning outcomes.

Kevyn Collins-Thompson is an Associate Professor of Information and Computer Science at the University of Michigan. His research explores theoretical models, algorithms, and software systems for optimally connecting people with information, especially toward educational goals. His research on personalization has been applied to real-world systems ranging from intelligent tutoring systems to commercial Web search engines. Kevyn has also pioneered techniques for modeling the reading difficulty of text, creating risk-averse search engines that maximize effectiveness while minimizing worst-case errors, and understanding and supporting how people learn language. He received his Ph.D. from the Language Technologies Institute at Carnegie Mellon University, where his advisor was Jamie Callan. Before joining the University of Michigan in 2013, he was a researcher for five years in the Context, Learning, and User Experience for Search (CLUES) group at Microsoft Research.

Instructor: Graham Neubig

Sound event detection (SED) is the task of detecting the type and the onset and offset times of sound events in audio streams. It is useful for purposes such as multimedia retrieval and surveillance. Sound event detection is difficult in several aspects when compared with speech recognition: first,sound events are much more variable than phonemes, notably in terms of duration but also in terms of spectral characteristics; second, sound events often overlap with each other, which does not happen with phonemes.

To train a system for sound event detection, it is conventionally necessary to know the type, onset time and offset time of each occurrence of a sound event. We call this type of annotation strong labeling. However, such annotation is not available in amounts large enough to support deep learning. This is due to multiple reasons: first, it is tedious to manually label each sound event with exact timing information; second, the onsets and offsets of long-lasting sound events (e.g. car passing by) and repeating sound events (e.g. footsteps) may not be well-defined.

In reality, annotation of sound events often comes without exact timing information. We call such annotation weak labeling. Even though it contains incomplete information compared to strong abeling, weak labeling may come in larger amounts and is well worth exploiting. In this thesis, we propose to train deep learning models for SED using various levels of weak labeling. We start with sequential labeling, i.e. we know the sequences of sound events occurring in the training data, but without the onset and offset times. We show that the sound events can be learned and localized by a recurrent neural network (RNN) with a connectionist temporal classification (CTC) output layer, which is well suited for sequential supervision. Then we relax the supervision to presence/absence labeling, i.e. we only know whether each sound event is present or absent in each training recording. We solve SED with presence/absence labeling in the multiple instance learning (MIL) framework, and propose to analyze the network's behavior on transient, continuous and intermittent sound events.

As we explore the possibility of learning to detect sound events with weak labeling, we are often faced with the problem of data scarcity. To overcome this difficulty, we resort to transfer learning, in which we train neural networks for out-of-domain tasks on large data, and use the trained networks to extract features for SED. We make special effort to ensure the temporal resolution of such transfer learning feature extractors.

Thesis Committee:
Florian Metze (Chair)
Alexander Waibel
Alexander Hauptmann
Aren Jansen (Google)

Copy of Proposal Document

This talk will discuss two lines of work involving general-purpose neural network sentence encoders: learned functions that map natural language sentences to vectors (or sets of vectors) that are meant to capture their meanings in machine-readable form.

The bulk of the talk will focus SNLI and MultiNLI, two new datasets for the task of natural language inference (aka recognizing textual entailment), in which a model must read two sentences and evaluate whether the first sentence entails or contradicts the second. These datasets make it possible to evaluate in a uniquely direct way the degree to which sentence encoding-based models understand language, and also—with nearly one million examples—offer a valuable data source for pretraining. The talk will close with some discussion of another open problem in sentence understanding—the role of syntactic parse trees in neural network-based modeling—and will present some results on models that attempt to learn a parser using only the supervision signal supplied by a downstream semantic task, and with no access to parsed training data data.

Sam Bowman is a second year assistant professor at New York University, appointed in the Center for Data Science and the Department of Linguistics. He is the co-director of the Machine Learning for Language group and the CILVR applied machine learning lab. He completed a PhD in Linguistics in 2016 at Stanford University with Chris Manning and Chris Potts, and undergraduate and master's degrees in Linguistics at the University of Chicago. Sam has also spent time at Google Brain, TTI-Chicago, and Johns Hopkins University, and received a 2017 Google Faculty Research Award.

Sam's research focuses on the goal of building artificial neural network models for problems in natural language understanding, and includes work on model design, data collection, evaluation, unsupervised learning, and transfer learning.

Dependencies among texts arise when speakers and writers copy manuscripts, cite the scholarly literature, speak from talking points, repost content on social networking platforms, or in other ways transform earlier texts.  While in some cases these dependencies are observable—e.g., by citations or other links—we often need to infer them from the text alone.  In our Viral Texts project, for example, we have built models of reprinting for noisily-OCR'd nineteenth-century newspapers to trace the flow of news, literature, jokes, and anecdotes throughout the United States.  Our Oceanic Exchanges project is now extending that work to information propagation across language boundaries.  Other projects in our group involve inferring and exploiting text dependencies to model the writing of legislation, the impact of scientific press releases, and changes in the syntax of language.


In this talk, I will discuss methods both for inferring these dependency structures and for exploiting them to improve other tasks.  First, I will describe a new directed spanning tree model of information cascades and a new contrastive training procedure that exploits partial temporal ordering in lieu of labeled link data.  This model outperforms previous approaches to network inference on blog datasets and, unlike those approaches, can evaluate individual links and cascades.  Then, I will describe methods for extracting parallel passages from large multilingual, but not parallel, corpora by performing efficient search in the continuous document-topic simplex of a polylingual topic model.  These extracted bilingual passages are sufficient to train translation systems with greater accuracy than some standard, smaller clean datasets.  Finally, I will describe methods for automatically detecting multiple transcriptions of the same passage in a large corpus of noisy OCR and for exploiting these multiple witnesses to correct noisy text.  These multi-input encoders provide an efficient and effective approximation to the intractable multi-sequence alignment approach to collation and allow us to produce transcripts with more than 75% reductions in error.

David Smith is an assistant professor in the College of Computer and Information Science at Northeastern University. He is also a founding member of the NULab for Texts, Maps, and Networks, Northeastern's center for digital humanities and computational social sciences. His work on natural language processing focuses on applications to information retrieval, the social sciences, and humanities, on inferring network structures, and on computational linguistic models of structure learning and historical change.

Instructor: Graham Neubig

Good speech recognition systems are vitally useful to many businesses - be it in the form of a virtual assistant taking commands, understanding user feedback in the form of video reviews or improved customer service. However, world class speech recognition systems are to be had only by sharing intimate user data with third party providers, or by recruiting from among the tens of graduates of the world’s top speech and language technology programs. 

At Baidu SVAIL, we have been working on developing speech recognition systems that can be built, debugged and improved by a team with little to no experience in speech recognition (but with a solid understanding of machine learning). We believe that a highly simplified speech recognition pipeline should democratize speech recognition research, just like CNNs revolutionized computer vision. Along this endeavor we developed Deep Speech 1 as a proof-of-concept to show that a such a model could be highly competitive with state-of-art models. With Deep Speech 2 we showed that such models generalize well to different languages, and we even deployed it for serious applications used by millions of people daily. This talk presents Deep Speech 3 - the next (and hopefully, the final) generation of speech recognition models which further simplifies the model and enables end-to-end training while using a pre-trained language model.

Sanjeev Satheesg has been a deep learning researcher, and is currently leading the speech team at the Silicon Valley AI Lab at Baidu USA.  SVAIL has been focused on the mission of using hard AI technologies to impact hundreds of millions of users.  Sanjeev has a masters’ degree from Stanford, where he worked with Fei-Fei Li and Andrew Ng.

Faculty Host: Graham Neubig


Subscribe to LTI