Natural Language Understanding (NLU) systems need to encode human generated text (or speech) and reason over it at a deep semantic level. Any NLU system typically involves two main components: The first is an encoder, which composes words (or other basic linguistic units) within the input utterances to compute encoded representations, that are then used as features in the second component, a predictor, to reason over the encoded inputs and produce the desired output. We argue that performing these two steps over the utterances alone is seldom sufficient for understanding language, as the utterances themselves do not contain all the information needed for understanding them. We identify two kinds of additional knowledge needed to fill the gaps: background knowledge and contextual knowledge. The goal of this thesis is to build end-to-end NLU systems that encode inputs along with relevant background knowledge, and reason about them in the presence of contextual knowledge.

The first part of the thesis deals with background knowledge. While distributional methods for encoding inputs have been used to represent meaning of words in the context of other words in the input, there are other aspects of semantics that are out of their reach. These are related to commonsense or real world information which is part of shared human knowledge but is not explicitly present in the input. We address this limitation by having the encoders also encode background knowledge, and present two approaches for doing so. The first is by modeling the selectional restrictions verbs place on their semantic role fillers. We use this model to encode events, and show that these event representations are useful in detecting newswire anomalies. Our second approach towards augmenting distributional methods is to use external knowledge bases like WordNet. We compute ontology-grounded token-level representations of words and show that they are useful in predicting prepositional phrase attachments and textual entailment.

The second part of the thesis focuses on contextual knowledge. Machine comprehension tasks require interpreting input utterances in the context of other structured or unstructured information. This can be challenging for multiple reasons. Firstly, given some task-specific data, retrieving the relevant contextual knowledge from it can be a serious problem. Secondly, even when the relevant contextual knowledge is provided, reasoning over it might require executing a complex series of operations depending on the structure of the context and the compositionality of the input language. To handle reasoning over contexts, we first describe a type constrained neural semantic parsing framework for question answering (QA). We achieve state of the art performance on WIKITABLEQUESTIONS, a dataset with highly compositional questions over semi-structured tables. Proposed work in this area includes application of this framework to QA in other domains with weaker supervision. To address the challenge of retrieval, we propose to build neural network models with explicit memory components that can adaptively reason and learn to retrieve relevant context given a question.

Thesis Committee:
Eduard Hovy (Chair)
Chris Dyer
William Cohen
Luke Zettlemoyer (University of Washington)

Copy of Proposal Document

Event extraction has been well studied for more than two decades, primarily through the lens of the Message Understanding Conferences (MUC) and Automatic Content Extraction (ACE) programs. However, event extraction methods to date do not yet offer a satisfactory solution to providing concise, structured, document-level summaries of events in news articles. Prior work in ACE focuses on fine-grained sentence-level events, which do not offer good document-level summaries of events. Previous work under MUC relied heavily on handcrafted rules for highly specific domains, resulting in models that are not easily generalizable to new domains.

In this thesis, we propose a new framework for extracting document-level event summaries called macro-events, unifying together aspects of both information extractionand text summarization. The goal of this work is to extract concise, structured representations of documents that can clearly outline the main event of interest and all the necessary argument fillers to describe the event. Unlike work in abstractive and extractive summarization, we seek to create template-based, structured summaries, rather than plain text summaries.

We propose two novel methods to address this problem. First, we introduce a structured prediction model based on the Learning to Search framework for jointly learning argument fillers both across and within event argument slots. Second, we propose a deep neural model that treats the problem as machine comprehension, which does not require training with any on-domain macro-event labeled data. Our initial experiments on filling macro-event templates for two domains (attacks and elections) show strong performance under both models compared to existing baselines.

Thesis Committee:
Yiming Yang (Co-Chair)
Jaime Carbonell (Co-Chair)
Alexander Hauptmann
Michael Mauldin

Copy of Thesis Document

Recurrent neural networks have proven to be an extremely effective tool for language modeling and other sequence prediction tasks. These models typically predict the next element in the sequence token-by-token, with tokens being expressed on the word or character level, with each token represented by a single dense embedding vector. However, using this sort of fixed granularity may be sub optimal; we can think of cases where it may be more natural to process multi-token chunks such as multi-word phrases, or multi-character morphemes. Additionally, it may be useful to express individual tokens through multiple embeddings, as in the case of words with more than one meaning.

In this work, we propose a new language modeling paradigm that has the ability to perform both prediction and moderation of information flow at multiple granularities: neural lattice language models. These models work by constructing a lattice of possible multi-token segmentations and marginalizing across all possible paths through this lattice to calculate probability of a sequence or optimize parameters. We evaluate our approach on a large-scale language modeling task and show that a neural lattice model over English that handles multi-word phrases is able to improve perplexity by 5.13 points relative to a word-level baseline, and a Chinese neural lattice language model that handles multi-character tokens is able to improve perplexity by 6.21 points relative to a character-level baseline.

MS Thesis Committee:
Graham Neubig (Chair)
Florian Metz
Taylor-Berg Kirkpatrick

Copy of MS Thesis Document

The 2017 Annual Jelinek Memorial Workshop on Speech and Language Technology will be held at Carnegie Mellon University Language Technologies Institute.

It is a continuation of the Johns Hopkins University CLSP summer workshop series from 1995-2016. It consists of a two-week summer school, followed by a six-week workshop. Notable researchers and students come together to collaborate on selected research topics. The Workshop is named after the late Fred Jelinek, its former director and head of the Center for Speech and Language Processing.

The summer school is meant to be an introduction to the state-of-the-art research in the speech and language technology area for graduate and undergraduate students. It also contains an introduction to this year's workshop research topics.

View the program details and register .

In this talk, I will describe the research I have carried out in the Speech Processing and Transmission Laboratory (LPTV, Laboratorio de Procesamiento y Transmision de Voz) over the last 17 years. The LPTV is located at the Universidad de Chile and was founded in 2000.1 will discuss our seminal work on uncertainty and how the first results were achieved, which we believe to be the first use of uncertainty modeling in HMMs. I will also talk about our experience with speech technology for telephone applications and second-language learning, and discuss some relevant work on stochastic weighted Viterbi, multi-classifier fusion, computer-aided pronunciation training (CAPT) and voice over IP (VoIP).  Finally, I will describe our state-of-the-art robotic platform that we have implemented to pursue our research on voice-based human-robot interaction.  In this context, we will describe locally-normalized features that address the time-varying channel problem. I will show demos and discuss ideas on voice-based human-robot interaction. Finally, I will summarize our results on multidisciplinary research in signal processing.

Nestor Becerra Yoma received his Ph.D. degree from University of Edinburgh, UK, and the M.Sc. and B.Sc. degrees from UNICAMP (Campinas State University), São Paulo, Brazil, all of them in Electrical Engineering, in 1998, 1993 and 1986, respectively.  From 2000 he has been a Professor in the Department of Electrical Engineering, at the Universidad de Chile in Santiago, where he is currently a Full Professor lecturing on telecommunications and speech processing. During the 2016-2017 academic year he has been a visiting professor at Carnegie Mellon University. At the Universidad de Chile, he launched the Speech Processing and Transmission Laboratory to carry out research on speech technology applications on human-robot interaction, language learning, Internet and telephone line. His research interests also include multidisciplinary research on signal processing in fields such as astronomy, mining and volcanology. He is the author of more than 40 journal articles, 40 conference papers, and three patents. Professor Becerra Yoma is a former Associate Editor of the IEEE Transactions on Speech and Audio Processing.


11-364: An Introduction to Deep Learning, the first undergraduate course in deep learning at Carnegie Mellon, will be hosting a poster presentation session on May 3. Topics will include deep reinforcement learning, autoencoders, deep style transfer, and generative adversarial networks, among others.

Students and faculty of SCS and other members of the Carnegie Mellon community are invited to attend."  

Faculty: James Baker

The Internet has been witnessing an explosion of video content. According to a Cisco study, video content accounted for 64% of the entire world's internet traffic in 2014, and this percentage is estimated to reach 80% by 2019. However, existing video search solutions are still based on text matching, and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all.

In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content. To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency, and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. Finally, we apply the proposed video engine to tackle text-and-visual question answering problem called MemexQA.

The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. We implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search system that is capable of indexing and searching a collection of 100 million videos.

Thesis Committee:
Alex Hauptmann
Teruko Mitamura
Louis-Philippe Morency
Tat-Seng Chua (National University of Singapore)

Copy of Thesis Document

Understanding images requires rich commonsense knowledge that is not often written down and hard for computers to acquire. Traditional approach to overcoming this lack of knowledge in computer vision has been to manually summarize them in the form of structured data bases. While such efforts are impressive, they suffer two critical issues when applied to practical tasks like visual recognition: {\bf Scalability} and {\bf Usability}.

This Ph.D. thesis has made progress toward solving both issues. First, instead of manually labeling everything, we developed a system that can let computers learn visual knowledge in a more automatic way. More specifically, we let computers learn by looking at photos on the Internet. We show that even with traditional, imperfect vision and natural language technologies, the system is still able to acquire various types of explicit visual knowledge at a large scale, and potentially become better as the system learns from previous iterations.

Second, for usability, we explore end-to-end approaches that directly attempt to solve specific vision problems in the hope of obtaining useful implicit knowledge, or visual commonsense. We show that 1) it is indeed possible to obtain generalizable vector representations of visual commonsense from noisy web image-query pairs directly without extra manual clean-up; 2) such implicit knowledge can be useful for related tasks such as object detection, or more structured tasks like image caption generation, etc.

To conclude the thesis work, we note the mutually-beneficial, mutually-dependent aspects of explicit and implicit knowledge, and propose a unified framework as our first step towards joint learning and reasoning with visual knowledge bases. We hope to 1) design generic representation to encode both types of knowledge; 2) develop proper algorithms to optimize the representation; 3) showcase the efficiency and effectiveness of the framework on downstream tasks that require holistc image understanding.

Thesis Commitee:
Abhinav Gupta (Chair)
Tom Mitchell
Martial Hebert
Fei-Fei Li (Stanford University)
Andrew Zisserman (University of Oxford)

Copy of Proposal Document

We often come across events on our daily commute such as a traffic jam, a person running a red light, or an ambulance approaching. These are complex events that human can effortlessly recognize and to which we make proper reactions. Being capable of recognizing complex events in a reliable way, like humans can, will facilitate many important applications such as self-driving cars, smart security sys tems, and elderly care systems. Nonetheless, existing computer vision and multime dia research focuses mainly on detecting elementary visual concepts (for example, actions, objects, and scenes). Such detections alone are generally insufficient for decision making. Hence we have a pressing need for complex event detection systems. Much research emphasis should be laid upon developing such systems.

Compared to elementary visual concept detection, complex event detection is much more difficult in representing both the task and the data that describe the task. Unlike elementary visual concepts, complex events are higher level abstractions of longer temporal span, and they have richer content with more dramatic variations. The web videos which describe those events are generally much larger in size, noisier in content, and sparser in labels than the images used for concept detection re search. Thus, complex event detection introduces several novel research challenges that have not been sufficiently studied in the literature. From this dissertation, we propose a set of algorithms to address such challenges. These algorithms enable us to build a multimedia event detection (MED) system which is practically useful for complex event detection.

The suggested algorithms significantly improve the accuracy and speed of our MED system by addressing the aforementioned challenges. For example, our new data augmentation step and new way of integrating multi-modal information significantly reduce the impact of the large event variation problem; our two-stage Convolutional Neural Network (CNN) training method allows us to get in-domain CNN features using noisy labels; our new feature smoothing technique is a thorough solution to the problem that noisy and uninformative background contents dominate the video representations; and so forth.

We have implemented most of the proposed methods into the CMU-Elamp system. They have been one of the major reasons for its leading performances in the TRECVID MED competition 2011 ~ 2015, the most representative task for MED. Our governing aim, however, has been to deliver enduring lessons that can be widely used. Given the complexity of our task and the significance of those improvements, we believe that our algorithms and lessons derived could be generalized to other tasks. Indeed, our methods have been used by other researchers on tasks such as medical video analysis and image segmentation.

Thesis Committee:
Alexander G. Hauptmann (Chair)
Bhiksha Raj Ramakrishnan
Louis-Philippe Morency
Leonid Sigal (Disney Research)

Copy of Draft Thesis Document


Subscribe to LTI