The 2017 Annual Jelinek Memorial Workshop on Speech and Language Technology will be held at Carnegie Mellon University Language Technologies Institute.

It is a continuation of the Johns Hopkins University CLSP summer workshop series from 1995-2016. It consists of a two-week summer school, followed by a six-week workshop. Notable researchers and students come together to collaborate on selected research topics. The Workshop is named after the late Fred Jelinek, its former director and head of the Center for Speech and Language Processing.

The summer school is meant to be an introduction to the state-of-the-art research in the speech and language technology area for graduate and undergraduate students. It also contains an introduction to this year's workshop research topics.

View the program details and register .

In this talk, I will describe the research I have carried out in the Speech Processing and Transmission Laboratory (LPTV, Laboratorio de Procesamiento y Transmision de Voz) over the last 17 years. The LPTV is located at the Universidad de Chile and was founded in 2000.1 will discuss our seminal work on uncertainty and how the first results were achieved, which we believe to be the first use of uncertainty modeling in HMMs. I will also talk about our experience with speech technology for telephone applications and second-language learning, and discuss some relevant work on stochastic weighted Viterbi, multi-classifier fusion, computer-aided pronunciation training (CAPT) and voice over IP (VoIP).  Finally, I will describe our state-of-the-art robotic platform that we have implemented to pursue our research on voice-based human-robot interaction.  In this context, we will describe locally-normalized features that address the time-varying channel problem. I will show demos and discuss ideas on voice-based human-robot interaction. Finally, I will summarize our results on multidisciplinary research in signal processing.

Nestor Becerra Yoma received his Ph.D. degree from University of Edinburgh, UK, and the M.Sc. and B.Sc. degrees from UNICAMP (Campinas State University), São Paulo, Brazil, all of them in Electrical Engineering, in 1998, 1993 and 1986, respectively.  From 2000 he has been a Professor in the Department of Electrical Engineering, at the Universidad de Chile in Santiago, where he is currently a Full Professor lecturing on telecommunications and speech processing. During the 2016-2017 academic year he has been a visiting professor at Carnegie Mellon University. At the Universidad de Chile, he launched the Speech Processing and Transmission Laboratory to carry out research on speech technology applications on human-robot interaction, language learning, Internet and telephone line. His research interests also include multidisciplinary research on signal processing in fields such as astronomy, mining and volcanology. He is the author of more than 40 journal articles, 40 conference papers, and three patents. Professor Becerra Yoma is a former Associate Editor of the IEEE Transactions on Speech and Audio Processing.


11-364: An Introduction to Deep Learning, the first undergraduate course in deep learning at Carnegie Mellon, will be hosting a poster presentation session on May 3. Topics will include deep reinforcement learning, autoencoders, deep style transfer, and generative adversarial networks, among others.

Students and faculty of SCS and other members of the Carnegie Mellon community are invited to attend."  

Faculty: James Baker

The Internet has been witnessing an explosion of video content. According to a Cisco study, video content accounted for 64% of the entire world's internet traffic in 2014, and this percentage is estimated to reach 80% by 2019. However, existing video search solutions are still based on text matching, and could fail for the huge volumes of videos that have little relevant metadata or no metadata at all.

In this thesis, we propose an accurate, efficient and scalable search method for video content. As opposed to text matching, the proposed method relies on automatic video content understanding, and allows for intelligent and flexible search paradigms over the video content. To achieve this ambitious goal, we propose several novel methods focusing on accuracy, efficiency, and scalability in the novel search paradigm. First, we introduce a novel self-paced curriculum learning theory that allows for training more accurate semantic concepts. Second, we propose a novel and scalable approach to index semantic concepts that can significantly improve the search efficiency with minimum accuracy loss. Third, we design a novel video reranking algorithm that can boost accuracy for video retrieval. Finally, we apply the proposed video engine to tackle text-and-visual question answering problem called MemexQA.

The extensive experiments demonstrate that the proposed methods are able to surpass state-of-the-art accuracy on multiple datasets. We implement E-Lamp Lite, the first of its kind large-scale semantic search engine for Internet videos. According to National Institute of Standards and Technology (NIST), it achieved the best accuracy in the TRECVID Multimedia Event Detection (MED) 2013, 2014 and 2015, the most representative task for content-based video search. To the best of our knowledge, E-Lamp Lite is the first content-based semantic search system that is capable of indexing and searching a collection of 100 million videos.

Thesis Committee:
Alex Hauptmann
Teruko Mitamura
Louis-Philippe Morency
Tat-Seng Chua (National University of Singapore)

Copy of Thesis Document

Understanding images requires rich commonsense knowledge that is not often written down and hard for computers to acquire. Traditional approach to overcoming this lack of knowledge in computer vision has been to manually summarize them in the form of structured data bases. While such efforts are impressive, they suffer two critical issues when applied to practical tasks like visual recognition: {\bf Scalability} and {\bf Usability}.

This Ph.D. thesis has made progress toward solving both issues. First, instead of manually labeling everything, we developed a system that can let computers learn visual knowledge in a more automatic way. More specifically, we let computers learn by looking at photos on the Internet. We show that even with traditional, imperfect vision and natural language technologies, the system is still able to acquire various types of explicit visual knowledge at a large scale, and potentially become better as the system learns from previous iterations.

Second, for usability, we explore end-to-end approaches that directly attempt to solve specific vision problems in the hope of obtaining useful implicit knowledge, or visual commonsense. We show that 1) it is indeed possible to obtain generalizable vector representations of visual commonsense from noisy web image-query pairs directly without extra manual clean-up; 2) such implicit knowledge can be useful for related tasks such as object detection, or more structured tasks like image caption generation, etc.

To conclude the thesis work, we note the mutually-beneficial, mutually-dependent aspects of explicit and implicit knowledge, and propose a unified framework as our first step towards joint learning and reasoning with visual knowledge bases. We hope to 1) design generic representation to encode both types of knowledge; 2) develop proper algorithms to optimize the representation; 3) showcase the efficiency and effectiveness of the framework on downstream tasks that require holistc image understanding.

Thesis Commitee:
Abhinav Gupta (Chair)
Tom Mitchell
Martial Hebert
Fei-Fei Li (Stanford University)
Andrew Zisserman (University of Oxford)

Copy of Proposal Document

We often come across events on our daily commute such as a traffic jam, a person running a red light, or an ambulance approaching. These are complex events that human can effortlessly recognize and to which we make proper reactions. Being capable of recognizing complex events in a reliable way, like humans can, will facilitate many important applications such as self-driving cars, smart security sys tems, and elderly care systems. Nonetheless, existing computer vision and multime dia research focuses mainly on detecting elementary visual concepts (for example, actions, objects, and scenes). Such detections alone are generally insufficient for decision making. Hence we have a pressing need for complex event detection systems. Much research emphasis should be laid upon developing such systems.

Compared to elementary visual concept detection, complex event detection is much more difficult in representing both the task and the data that describe the task. Unlike elementary visual concepts, complex events are higher level abstractions of longer temporal span, and they have richer content with more dramatic variations. The web videos which describe those events are generally much larger in size, noisier in content, and sparser in labels than the images used for concept detection re search. Thus, complex event detection introduces several novel research challenges that have not been sufficiently studied in the literature. From this dissertation, we propose a set of algorithms to address such challenges. These algorithms enable us to build a multimedia event detection (MED) system which is practically useful for complex event detection.

The suggested algorithms significantly improve the accuracy and speed of our MED system by addressing the aforementioned challenges. For example, our new data augmentation step and new way of integrating multi-modal information significantly reduce the impact of the large event variation problem; our two-stage Convolutional Neural Network (CNN) training method allows us to get in-domain CNN features using noisy labels; our new feature smoothing technique is a thorough solution to the problem that noisy and uninformative background contents dominate the video representations; and so forth.

We have implemented most of the proposed methods into the CMU-Elamp system. They have been one of the major reasons for its leading performances in the TRECVID MED competition 2011 ~ 2015, the most representative task for MED. Our governing aim, however, has been to deliver enduring lessons that can be widely used. Given the complexity of our task and the significance of those improvements, we believe that our algorithms and lessons derived could be generalized to other tasks. Indeed, our methods have been used by other researchers on tasks such as medical video analysis and image segmentation.

Thesis Committee:
Alexander G. Hauptmann (Chair)
Bhiksha Raj Ramakrishnan
Louis-Philippe Morency
Leonid Sigal (Disney Research)

Copy of Draft Thesis Document

During social interactions we express ourselves not only through words but also through facial expressions, tone of voice, gesture and body posture.  However, today’s computers are still unable to reliably read and understand such nonverbal language.  Current home and mobile assistants such as Apple’s Siri, Google’s Home, and Amazon’s Echo are limited to only listening to their users.  I envision a future where a personalised robotic assistant will complement the speech signal by reading the user’s facial expressions, body gestures, emotions and intentions.  In addition, better automatic analysis of human behavior has numerous applications in the fields of human computer interaction, education and healthcare.

In this talk I will provide a brief history of work on nonverbal behavior analysis and explore the challenges that we still face.  I will particularly focus on my work on facial behavior analysis, including facial expression recognition, eye gaze estimation, and emotion recognition.  I will also discuss the applications of such technologies in healthcare settings.

Tadas Baltrušaitis is a post-doctoral associate at the Language Technologies Institute, Carnegie Mellon University working with Prof. Louis-Philippe Morency.  His primary research interests lie in the automatic understanding of nonverbal human behavior, computer vision, and multimodal machine learning.  In particular, he is interested in the application of such technologies to healthcare settings, with a focus on mental health.  Before joining CMU, he was a post-doctoral researcher at the University of Cambridge, where he also received his Ph.D. and Bachelor’s degrees in Computer Science.  His Ph.D. research focused on automatic facial expression analysis in especially difficult real world settings.

Instructor: Alex Hauptmann

When disaster occurs, online posts in text and video, phone messages, and even newscasts expressing distress, fear, and anger toward the disaster itself or toward those who might address the consequences of the disaster such as local and national governments or foreign aid workers represent an important source of information about where the most urgent issues are occurring and what these issues are.  However, these information sources are often difficult to triage, due to their volume and lack of specificity.  They represent a special challenge for aid efforts by those who do not speak the language of those who need help – especially when bilingual informants are few and when the language of those in distress is one with few computational resources.  We are working in a large DARPA effort which is attempting to develop tools and techniques to support the efforts of such aid workers very quickly, by leveraging methods and resources which have already been collected for use with other, High Resource Languages.  Our particular goal is to develop methods to identify sentiment and emotion in spoken language for Low Resource Languages.

Our effort to date involves two basic approaches:  1) training classifiers to detect sentiment and emotion in High Resources Languages such as English and Mandarin which have relatively large amounts of data labeled with emotions such as anger, fear, and stress and using these directly of adapted with a small amount of labeled data in the LRL of interest, and 2) employing a sentiment detection system trained on HRL text and adapted to the LRL using a bilingual lexicon to label transcripts of LRL speech.  These labels are then used as labels for the aligned speech to use in training a speech classifier for positive/negative sentiment.  We will describe experiments using both such approaches, as well as experiments classifying news broadcasts that contain information about disasters.

Julia Hirschberg is the is the Percy K. and Vida L. W. Hudson Professor and Chair of Computer Science at Columbia University. She previously worked at Bell Laboratories and AT&T Labs where she created the HCI Research Department.  She served on the Association for Computational Linguistics executive board (1993-2003), the International Speech Communication Association board (1999-2007; 2005-7 as president), and the International Conference on Spoken Language Processing board since 1996.  She has been editor of Computational Linguistics and Speech Communication, is a fellow of AAAI, ISCA, ACL, ACM, IEEE, and a member of the National Academy of Engineering.  She has received the IEEE James L. Flanagan Speech and Audio Processing Award and the ISCA Medal for Scientific Achievement.  She currently the serves on the IEEE Speech and Language Processing Technical Committee, is co-chair of CRA-W Board, and has worked for diversity for many years at AT&T and Columbia.  She works on spoken language processing and NLP, studying text-to-speech synthesis, spoken dialogue systems, entrainment in conversation, detection of deceptive and emotional speech, hedging behavior, and linguistic code-switching (language mixing).

Faculty Host: Carolyn Rosé

The field of Artificial Intelligence and Law studies how legal reasoning can be formalized in order, eventually, to be able to develop systems that assist lawyers in the tasks of researching, drafting and evaluating arguments in a professional setting. To further this goal, researchers have been developing systems, which, to a limited extent, autonomously engage in legal reasoning, and argumentation on closed domains. However, populating such systems with formalized domain knowledge is the main bottleneck preventing them from making real contributions to legal practice. Given the recent advances in natural language processing, the field has begun to apply more sophisticated methods to legal document analysis and to tackle more complex tasks. Meanwhile, the LegalTech sector is thriving and companies/startups have also been trying to tap into the legal industry’s need to make large-scale document analysis tasks more efficient, and to use predictive analytics for better decision making. This talk will present an overview of the history and state of the art in academic AI & Law, as well as selected examples of current developments in the private sector. Aspects in focus are case-based reasoning, legal text analytics, and the collaborative LUIMA project conducted by CMU, the University of Pittsburgh, and Hofstra University.

Mattias Grabmair is a postdoctoral associate at in the Language Technologies Institute at Carnegie Mellon University working with Prof. Eric Nyberg on solving problems in intelligent legal information management and intelligent natural language dialogue systems while also teaching at the institute.  His work is best described as (Legal) Knowledge Engineering or (Legal) Data Science.  It draws from artificial intelligence & law, knowledge representation & reasoning, natural language processing, applied machine learning, information retrieval as well as computational models of argument. He obtained a diploma in law from the University of Augsburg, Germany, as well as a Master of Laws (LLM) and a Ph.D. in Intelligent Systems specializing in AI & Law under Prof. Kevin Ashley at the University of Pittsburgh.

Recurrent neural networks such as LSTMs have become an indispensable tool for building probabilistic sequence models.  With discussion of the statistical motivations, I'll give some not-so-obvious ways that expressive LSTMs can be harnessed to help model sequential data:

1. To score chunks of candidate latent structures in their fully observed context.  The chunks can be assembled by dynamic programming, which preserves tractable marginal inference.  (Applications: string transduction, parsing, ...)
2. To predict sequences of events in real time.  This resembles neural language modeling, but the real-time setting means that you are predicting each event jointly with the entire preceding interval of non-events.  (Applications: social media, patient histories, consumer actions, ...)
3. To classify latent syntactic properties of a language from its observed surface ordering.  This essentially converts a hard and misspecified unsupervised learning problem to a simpler supervised one.  To deal with the shortage of supervised languages to train on, we manufacture new synthetic languages.  (Applications: grammar induction, etc.)

Jason Eisner is Professor of Computer Science at Johns Hopkins University, where he is also affiliated with the Center for Language and Speech Processing, the Machine Learning Group, the Cognitive Science Department, and the national Center of Excellence in Human Language Technology.  His goal is to develop the probabilistic modeling, inference, and learning techniques needed for a unified model of all kinds of linguistic structure.  His 100+ papers have presented various algorithms for parsing, machine translation, and weighted finite-state machines; formalizations, algorithms, theorems, and empirical results in computational phonology; and unsupervised or semi-supervised learning methods for syntax, morphology, and word-sense disambiguation.  He is also the lead designer of Dyna, a new declarative programming language that provides an infrastructure for AI research. He has received two school-wide awards for excellence in teaching.


Subscribe to LTI