11-364: An Introduction to Deep Learning, the first undergraduate course in deep learning at Carnegie Mellon, will be hosting a poster presentation session on May 3. Topics will include deep reinforcement learning, autoencoders, deep style transfer, and generative adversarial networks, among others.

Students and faculty of SCS and other members of the Carnegie Mellon community are invited to attend."  

Faculty: James Baker

The Internet has been witnessing an explosion of video content. According to a Cisco study, video content accounted for 64% of the entire world's internet traffic in 2014, and this percentage is estimated to reach 80% by 2019. However, existing video search solutions are still based on text matching, and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all.

In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content. To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency, and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. Finally, we apply the proposed video engine to tackle text-and-visual question answering problem called MemexQA.

The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. We implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search system that is capable of indexing and searching a collection of 100 million videos.

Thesis Committee:
Alex Hauptmann
Teruko Mitamura
Louis-Philippe Morency
Tat-Seng Chua (National University of Singapore)

Copy of Thesis Document

Understanding images requires rich commonsense knowledge that is not often written down and hard for computers to acquire. Traditional approach to overcoming this lack of knowledge in computer vision has been to manually summarize them in the form of structured data bases. While such efforts are impressive, they suffer two critical issues when applied to practical tasks like visual recognition: {\bf Scalability} and {\bf Usability}.

This Ph.D. thesis has made progress toward solving both issues. First, instead of manually labeling everything, we developed a system that can let computers learn visual knowledge in a more automatic way. More specifically, we let computers learn by looking at photos on the Internet. We show that even with traditional, imperfect vision and natural language technologies, the system is still able to acquire various types of explicit visual knowledge at a large scale, and potentially become better as the system learns from previous iterations.

Second, for usability, we explore end-to-end approaches that directly attempt to solve specific vision problems in the hope of obtaining useful implicit knowledge, or visual commonsense. We show that 1) it is indeed possible to obtain generalizable vector representations of visual commonsense from noisy web image-query pairs directly without extra manual clean-up; 2) such implicit knowledge can be useful for related tasks such as object detection, or more structured tasks like image caption generation, etc.

To conclude the thesis work, we note the mutually-beneficial, mutually-dependent aspects of explicit and implicit knowledge, and propose a unified framework as our first step towards joint learning and reasoning with visual knowledge bases. We hope to 1) design generic representation to encode both types of knowledge; 2) develop proper algorithms to optimize the representation; 3) showcase the efficiency and effectiveness of the framework on downstream tasks that require holistc image understanding.

Thesis Commitee:
Abhinav Gupta (Chair)
Tom Mitchell
Martial Hebert
Fei-Fei Li (Stanford University)
Andrew Zisserman (University of Oxford)

Copy of Proposal Document

We often come across events on our daily commute such as a traffic jam, a person running a red light, or an ambulance approaching. These are complex events that human can effortlessly recognize and to which we make proper reactions. Being capable of recognizing complex events in a reliable way, like humans can, will facilitate many important applications such as self-driving cars, smart security sys tems, and elderly care systems. Nonetheless, existing computer vision and multime dia research focuses mainly on detecting elementary visual concepts (for example, actions, objects, and scenes). Such detections alone are generally insufficient for decision making. Hence we have a pressing need for complex event detection systems. Much research emphasis should be laid upon developing such systems.

Compared to elementary visual concept detection, complex event detection is much more difficult in representing both the task and the data that describe the task. Unlike elementary visual concepts, complex events are higher level abstractions of longer temporal span, and they have richer content with more dramatic variations. The web videos which describe those events are generally much larger in size, noisier in content, and sparser in labels than the images used for concept detection re search. Thus, complex event detection introduces several novel research challenges that have not been sufficiently studied in the literature. From this dissertation, we propose a set of algorithms to address such challenges. These algorithms enable us to build a multimedia event detection (MED) system which is practically useful for complex event detection.

The suggested algorithms significantly improve the accuracy and speed of our MED system by addressing the aforementioned challenges. For example, our new data augmentation step and new way of integrating multi-modal information significantly reduce the impact of the large event variation problem; our two-stage Convolutional Neural Network (CNN) training method allows us to get in-domain CNN features using noisy labels; our new feature smoothing technique is a thorough solution to the problem that noisy and uninformative background contents dominate the video representations; and so forth.

We have implemented most of the proposed methods into the CMU-Elamp system. They have been one of the major reasons for its leading performances in the TRECVID MED competition 2011 ~ 2015, the most representative task for MED. Our governing aim, however, has been to deliver enduring lessons that can be widely used. Given the complexity of our task and the significance of those improvements, we believe that our algorithms and lessons derived could be generalized to other tasks. Indeed, our methods have been used by other researchers on tasks such as medical video analysis and image segmentation.

Thesis Committee:
Alexander G. Hauptmann (Chair)
Bhiksha Raj Ramakrishnan
Louis-Philippe Morency
Leonid Sigal (Disney Research)

Copy of Draft Thesis Document

During social interactions we express ourselves not only through words but also through facial expressions, tone of voice, gesture and body posture.  However, today’s computers are still unable to reliably read and understand such nonverbal language.  Current home and mobile assistants such as Apple’s Siri, Google’s Home, and Amazon’s Echo are limited to only listening to their users.  I envision a future where a personalised robotic assistant will complement the speech signal by reading the user’s facial expressions, body gestures, emotions and intentions.  In addition, better automatic analysis of human behavior has numerous applications in the fields of human computer interaction, education and healthcare.

In this talk I will provide a brief history of work on nonverbal behavior analysis and explore the challenges that we still face.  I will particularly focus on my work on facial behavior analysis, including facial expression recognition, eye gaze estimation, and emotion recognition.  I will also discuss the applications of such technologies in healthcare settings.

Tadas Baltrušaitis is a post-doctoral associate at the Language Technologies Institute, Carnegie Mellon University working with Prof. Louis-Philippe Morency.  His primary research interests lie in the automatic understanding of nonverbal human behavior, computer vision, and multimodal machine learning.  In particular, he is interested in the application of such technologies to healthcare settings, with a focus on mental health.  Before joining CMU, he was a post-doctoral researcher at the University of Cambridge, where he also received his Ph.D. and Bachelor’s degrees in Computer Science.  His Ph.D. research focused on automatic facial expression analysis in especially difficult real world settings.

Instructor: Alex Hauptmann

When disaster occurs, online posts in text and video, phone messages, and even newscasts expressing distress, fear, and anger toward the disaster itself or toward those who might address the consequences of the disaster such as local and national governments or foreign aid workers represent an important source of information about where the most urgent issues are occurring and what these issues are.  However, these information sources are often difficult to triage, due to their volume and lack of specificity.  They represent a special challenge for aid efforts by those who do not speak the language of those who need help – especially when bilingual informants are few and when the language of those in distress is one with few computational resources.  We are working in a large DARPA effort which is attempting to develop tools and techniques to support the efforts of such aid workers very quickly, by leveraging methods and resources which have already been collected for use with other, High Resource Languages.  Our particular goal is to develop methods to identify sentiment and emotion in spoken language for Low Resource Languages.

Our effort to date involves two basic approaches:  1) training classifiers to detect sentiment and emotion in High Resources Languages such as English and Mandarin which have relatively large amounts of data labeled with emotions such as anger, fear, and stress and using these directly of adapted with a small amount of labeled data in the LRL of interest, and 2) employing a sentiment detection system trained on HRL text and adapted to the LRL using a bilingual lexicon to label transcripts of LRL speech.  These labels are then used as labels for the aligned speech to use in training a speech classifier for positive/negative sentiment.  We will describe experiments using both such approaches, as well as experiments classifying news broadcasts that contain information about disasters.

Julia Hirschberg is the is the Percy K. and Vida L. W. Hudson Professor and Chair of Computer Science at Columbia University. She previously worked at Bell Laboratories and AT&T Labs where she created the HCI Research Department.  She served on the Association for Computational Linguistics executive board (1993-2003), the International Speech Communication Association board (1999-2007; 2005-7 as president), and the International Conference on Spoken Language Processing board since 1996.  She has been editor of Computational Linguistics and Speech Communication, is a fellow of AAAI, ISCA, ACL, ACM, IEEE, and a member of the National Academy of Engineering.  She has received the IEEE James L. Flanagan Speech and Audio Processing Award and the ISCA Medal for Scientific Achievement.  She currently the serves on the IEEE Speech and Language Processing Technical Committee, is co-chair of CRA-W Board, and has worked for diversity for many years at AT&T and Columbia.  She works on spoken language processing and NLP, studying text-to-speech synthesis, spoken dialogue systems, entrainment in conversation, detection of deceptive and emotional speech, hedging behavior, and linguistic code-switching (language mixing).

Faculty Host: Carolyn Rosé

The field of Artificial Intelligence and Law studies how legal reasoning can be formalized in order, eventually, to be able to develop systems that assist lawyers in the tasks of researching, drafting and evaluating arguments in a professional setting. To further this goal, researchers have been developing systems, which, to a limited extent, autonomously engage in legal reasoning, and argumentation on closed domains. However, populating such systems with formalized domain knowledge is the main bottleneck preventing them from making real contributions to legal practice. Given the recent advances in natural language processing, the field has begun to apply more sophisticated methods to legal document analysis and to tackle more complex tasks. Meanwhile, the LegalTech sector is thriving and companies/startups have also been trying to tap into the legal industry’s need to make large-scale document analysis tasks more efficient, and to use predictive analytics for better decision making. This talk will present an overview of the history and state of the art in academic AI & Law, as well as selected examples of current developments in the private sector. Aspects in focus are case-based reasoning, legal text analytics, and the collaborative LUIMA project conducted by CMU, the University of Pittsburgh, and Hofstra University.

Mattias Grabmair is a postdoctoral associate at in the Language Technologies Institute at Carnegie Mellon University working with Prof. Eric Nyberg on solving problems in intelligent legal information management and intelligent natural language dialogue systems while also teaching at the institute.  His work is best described as (Legal) Knowledge Engineering or (Legal) Data Science.  It draws from artificial intelligence & law, knowledge representation & reasoning, natural language processing, applied machine learning, information retrieval as well as computational models of argument. He obtained a diploma in law from the University of Augsburg, Germany, as well as a Master of Laws (LLM) and a Ph.D. in Intelligent Systems specializing in AI & Law under Prof. Kevin Ashley at the University of Pittsburgh.

Recurrent neural networks such as LSTMs have become an indispensable tool for building probabilistic sequence models.  With discussion of the statistical motivations, I'll give some not-so-obvious ways that expressive LSTMs can be harnessed to help model sequential data:

1. To score chunks of candidate latent structures in their fully observed context.  The chunks can be assembled by dynamic programming, which preserves tractable marginal inference.  (Applications: string transduction, parsing, ...)
2. To predict sequences of events in real time.  This resembles neural language modeling, but the real-time setting means that you are predicting each event jointly with the entire preceding interval of non-events.  (Applications: social media, patient histories, consumer actions, ...)
3. To classify latent syntactic properties of a language from its observed surface ordering.  This essentially converts a hard and misspecified unsupervised learning problem to a simpler supervised one.  To deal with the shortage of supervised languages to train on, we manufacture new synthetic languages.  (Applications: grammar induction, etc.)

Jason Eisner is Professor of Computer Science at Johns Hopkins University, where he is also affiliated with the Center for Language and Speech Processing, the Machine Learning Group, the Cognitive Science Department, and the national Center of Excellence in Human Language Technology.  His goal is to develop the probabilistic modeling, inference, and learning techniques needed for a unified model of all kinds of linguistic structure.  His 100+ papers have presented various algorithms for parsing, machine translation, and weighted finite-state machines; formalizations, algorithms, theorems, and empirical results in computational phonology; and unsupervised or semi-supervised learning methods for syntax, morphology, and word-sense disambiguation.  He is also the lead designer of Dyna, a new declarative programming language that provides an infrastructure for AI research. He has received two school-wide awards for excellence in teaching.

To effectively sort and present relevant information pieces (e.g., answers, passages, documents) to human users, information systems rely on ranking models. Existing ranking models are typically designed for a specific task and therefore are not effective for complex information systems that require component changes or domain adaptations. For example, in the final stage of question answering, information systems such as IBM Watson DeepQA rank all results according to their evidence scores and judge the likelihood that each is correct or relevant. However, as information systems become more complex, determining effective ranking approaches becomes much more challenging.

Prior work includes heuristic ranking models that focus on a particular type of information object (e.g. a retrieved document, a factoid answer) using manually designed features specific to that information type. These models, however, do not use other, non-local features (e.g. features of the upstream/downstream information source) to locate relevant information. To address this gap, my research seeks to define a ranking approach that should easily and rapidly adapt to any version of system pipelines with an arbitrary number of phases.

We describe a general ranking approach for multi-phase and multi-strategy information systems, which produce and rank significantly more candidate results than the single phase and single strategy information systems to achieve acceptable robustness and overall performance. Our approach allows each phase in a system to leverage information propagated from preceding phases to inform the ranking decision. By collecting ranking features from the derivation paths that generate candidate results, the particular derivation path chosen can be used to predict result correctness or relevance. Those ranking features can be detected from an abstracted system object graph which represents all of the objects created during system execution (e.g. provenance) and object dependencies. This ranking approach has been applied to different domains including question answering and biomedical information retrieval. Experimental results showed that our proposed approach significantly outperforms comparable answer ranking models on the two domains.

Thesis Committee:
Eric Nyberg (Chair)
Teruko Mitamura
Jaime Carbonell
Bown Zhou (IBM T.J. Watson Research Center)

Copy of Thesis Document

In this talk, I will give an overview of some research projects at MSR aiming at building an open-domain neural dialogue system. We group dialogue bots based on users' goals into three categories: task completion bots, information access bots, and social bots. We explore different neural network models and deep reinforcement learning technologies to build response generation engines for all the bots. We will review our experimental settings, recent results tested on simulator users and real users, share the lessons we learned and discuss future work.

Jianfeng Gao is a Partner Research Manager in Deep Learning Technology Center (DLTC) at Microsoft Research, Redmond.  H works on deep learning for text and image processing (MS internal access) and lead the development of AI systems for dialogue, machine reading comprehension, question answering, and enterprise applications.  they have developed a series of deep semantic similarity models (DSSM, also a.k.a. Sent2Vec), which have been used for a wide range of text and image processing tasks.

From 2006 to 2014, he was Principal Researcher at Natural Language Processing Group at Microsoft Research, Redmond, where he worked on Web search, query understanding and reformulation, ads prediction, and statistical machine translation.  From 2005 to 2006, he was a research lead in Natural Interactive Services Division at Microsoft, where he worked on Project X, an effort of developing natural user interface for Windows.  From 1999 to 2005, he was Research Lead in Natural Language Computing Group at Microsoft Research Asia, where together with he colleagues, he developed the first Chinese speech recognition system released with Microsoft Office, the Chinese/Japanese Input Method Editors (IME) which were the leading products in the market, and the natural language platform for Windows Vista.


Subscribe to LTI