Deep learning has been proven very successful in many applications that require advanced pattern matching, including computer vision. However, it is still unclear how deep learning could be involved in other tasks such as logic reasoning. In this talk, I introduce two of our recent works on this direction, Visual Question and Answering and Computer Go. We show that with different architecture, we could achieve state-of-the-art performance against existing approaches.

Yuandong Tian is a Research Scientist in Facebook AI Research, working on Deep Learning and Computer Vision. Prior to that, he was a Software Engineer in Google Self-driving Car team in 2013-2014. He received Ph.D in Robotics Institute, Carnegie Mellon University on 2013, Bachelor and Master degree of Computer Science in Shanghai Jiao Tong University. He is the recipient of 2013 ICCV Marr Prize Honorable Mentions for his work on global optimal solution to nonconvex optimization in image alignment.

Modern intelligent agents will need to learn the manipulation actions that humans perform. They will need to recognize these actions when they see them and they will need to perform these actions themselves. The lesson from the findings on mirror neurons is that the two processes of interpreting visually observed action and generating actions, should share the same underlying cognitive process. The talk will present a cognitive system that interprets human manipulation actions from perceptual information (image and depth data) and consists of perceptual modules and reasoning modules that are in interaction with each other.

The talk focuses on two core problems at the heart of manipulation action understanding: a) the grounding of relevant information about actions in perception (the perception - action integration problem), and b) the organization of perceptual and high-level symbolic information for interpreting the actions (the sequencing problem). At the high level, actions are represented with the Manipulation Action Context-free Grammar (MACFG), a syntactic grammar and associated parsing algorithms, which organizes actions as a sequence of sub-events. Each sub-event is described by the hand (as well as grasp type), movements (actions) and the objects and tools involved, and the relevant information about these quantities is obtained from biologically inspired perception modules. These modules recognize the hand grasp, manipulation action consequences and object-wise spatial primitives. Furthermore, a probabilistic semantic parsing framework based on CCG (Combinatory Categorial Grammar) theory is adopted to model the semantic meaning of human manipulation actions.

Analogically, understanding manipulation actions is like understanding language, while executing them is like generating language. Experiments on two tasks, 1) a robot observing people performing manipulation actions, and 2) a robot then executing manipulation actions accordingly, are conducted to validate the formalism.

Dr. Yezhou Yang is a Postdoctoral Research Associate at the Computer Vision Lab and the Automation, Robotics and Cognition (ARC) Lab, with the University of Maryland Institute for Advanced Computer Studies, working with his PhD advisors: Prof. Yiannis Aloimonos and Dr. Cornelia Fermuller. His main interests lie in Cognitive Robotics, Computer Vision and Robot Vision, especially exploring visual primitives in human action understanding from visual input, grounding them by natural language as well as high-level reasoning over the primitives for intelligent robots. He was a recipient of the Qualcomm Innovation Fellowship 2011, the UMD CS Department Dean's Fellowship award and the Microsoft Research Asia Young Researcher Scholarship 2009. He received a B.A. in Computer Science from Zhejiang University in 2010, and a Ph.D. in Computer Science from the University of Maryland, College Park in 2015.

From the clink of a mug placed onto a saucer to the bustle of a busy café, our days are filled with visual experiences that are accompanied by characteristic sounds.  These sounds, when paired with their corresponding videos, can provide a rich training signal that allows us to learn visual representations of objects, materials, and scenes.  In this talk, I'll first address the material-understanding task of predicting what sound an object makes when it is hit or scratched.  I'll present an algorithm that learns to predict plausible soundtracks for silent videos of people striking objects with a drumstick. The sounds predicted by this model convey information about materials and physical interactions, and they frequently fool human subjects in "real or fake" psychophysical studies. I will then apply similar ideas to show that ambient audio -- e.g., crashing waves, people speaking in a crowd -- can be used to learn about objects and scenes.  By training a convolutional network to predict held-out sound for internet videos, we can learn image representations that perform well on object recognition tasks, and which contain units that are selective for sound-producing objects.

Andrew Owens is a graduate student at the MIT Computer Science and Artificial Intelligence Laboratory, working under the supervision of Bill Freeman and Antonio Torralba.  Before that, he obtained his B.A. in Computer Science at Cornell University in 2010.  He is a recipient of a Microsoft Research PhD Fellowship, an NDSEG Fellowship, and a Best Paper Honorable Mention Award at CVPR 2011.

A first person camera placed at the person's head captures candid moments in our life, providing detailed visual data of how we interact with people, objects, and scenes. It reveals our future intention and momentary visual sensorimotor behaviors. With the first person vision, can we build a computational model for personalized intelligence that predicts what we see and act by "putting yourself in her/his shoes"?

We provide three examples. (1) At physical level, we predict the wearer's intent in a form of force and torque that control the movements. Our model integrates visual scene semantics, 3D reconstruction, and inverse optimal control to compute the active force (peddaling and braking while biking) and experienced passive force (gravity, air drag, and friction) in a first person sport video. (2) At spatial scene level, we predict plausible future trajectories of ego-motion. The  predicted paths avoid obstacles, move between objects, even turn around a corner into invisible space behind objects. (3) At object level, we study the holistic correlation of visual attention with motor action by introducing "action-objects" associated with seeing and touching actions. Such action-objects exhibit characteristic 3D spatial distance and orientation with respect to the person, which allow us to build a predictive model using EgoNet. We demonstrate that we can predict momentary visual attention and motor actions without gaze tracking and tactile sensing for first person videos.

This is a join work with Hyun Soo Park.

Jianbo Shi studied Computer Science and Mathematics as an undergraduate at Cornell University where he received his B.A. in 1994. He received his Ph.D. degree in Computer Science from University of California at Berkeley in 1998. He joined The Robotics Institute at Carnegie Mellon University in 1999 as a research faculty, where he lead the Human Identification at Distance(HumanID) project, developing vision techniques for human identification and activity inference. In 2003 he joined University of Pennsylvania where he is currently a Professor of Computer and Information Science. In 2007, he was awarded the Longuet-Higgins Prize for his work on Normalized Cuts.

His current research focuses on first person human behavior analysis and image recognition-segmentation. His other research interests include image/video retrieval, 3D vision, and vision based desktop computing. His long-term interests center around a broader area of machine intelligence, he wishes to develop a "visual thinking" module that allows computers not only to understand the environment around us, but also to achieve cognitive abilities such as machine memory and learning.

Sponsored in part by Disney Research

In recent years, commodity 3D sensors, such as the Microsoft Kinect, have become easily and widely available. These advances in sensing technology have inspired significant interest in using the captured data for mapping and understanding 3D environments. In this talk, I will present our current research in this fascinating field, show potential future research directions, and talk about long-term goals. More specifically, I will show how we can now easily obtain a 3D reconstruction of an environment, and how we can exploit these results in order to infer semantics of a scene.

Matthias Niessner is a visiting assistant professor at Stanford University. Previous to his appointment at Stanford, he earned his PhD from the University of Erlangen-Nuremberg, Germany under the supervision of Günther Greiner. His research focuses on different fields of computer graphics and computer vision, including the reconstruction and semantic understanding of 3D scene environments. Additional information.

Sponsored in part by Disney Research.

A successful autonomous system needs to not only understand the visual world but also communicate its understanding with humans. To make this possible, language can serve as a natural link between high level semantic concepts and low level visual perception. In this talk, I'll present our recent work in the interdisciplinary domain of vision and language. I’ll show how we can exploit the alignment between movies and books in order to build more descriptive captioning systems. I’ll also discuss our efforts towards automatic understanding of stories behind long and complex videos.

Sanja Fidler is an Assistant Professor at the Department of Computer Science, University of Toronto.  Previously she was a Research Assistant Professor at TTI-Chicago, a philanthropically endowed academic institute located in the campus of the University of Chicago. She was a postdoctoral fellow at University of Toronto during 2011-2012. She completed her PhD in computer science at University of Ljubljana in 2010, and was a visiting student at UC Berkeley in the final year of her PhD. She is serving as an Area Chair in the upcoming CVPR’16, EMNLP’16 and ACCV’16, and as a Program Chair of 3DV’16. Her main research interests are object recognition, 3D scene understanding, and combining vision and language.

Sponsored in part by Disney Research.

The focus of this talk will be on detailed scene understanding from RGB-D images. We approach this problem by studying a variety of central vision problems like bottom-up grouping, object detection, instance segmentation, and pose estimation in context of RGB-D images, and finally aligning CAD models to objects in the scene. This results in a detailed output which goes beyond what most current computer vision algorithms produce: a bounding box or a segmentation mask for the object of interest; and is useful for a variety of real world applications like perceptual robotics, and augmented reality. 

A central question in this work is how to learn good features for depth images in view of the fact that labeled RGB-D datasets are much smaller than labeled RGB datasets (such as ImageNet) typically used for feature learning, and I will describe our "cross-modal distillation" technique which allows us to leverage easily available annotations on RGB images to learn representations on depth images. In addition, I will very briefly also talk about some work on vision and language that I did on an internship at Microsoft Research.

Saurabh Gupta is a Ph.D. student at UC Berkeley, where he is advised by Jitendra Malik. His research interests include computer vision and machine learning. During his PhD he has studied the problem of scene understanding from RGB-D images. His work has been supported by the Berkeley Fellowship and the Google Fellowship in Computer Vision.

Sponsored in part by Disney Research

Object recognition from images is a longstanding and challenging problem in computer vision. The main challenge is that the appearance of objects in images is affected by a number of factors, such as illumination, scale, camera viewpoint, intra-class variability, occlusion, truncation, and so on. How to handle all these factors in object recognition is still an open problem. In this talk, I present my efforts in building 3D object representations for object recognition. Compared to 2D appearance based object representations, 3D object representations can capture the 3D nature of objects and better handle viewpoint variation, occlusion and truncation in object recognition.

Yu Xiang is a Postdoctoral Researcher in the Computer Science Department at Stanford University. His research focuses on understanding objects and scenes from images and videos, with emphasis on recognizing both semantic and 3D geometric properties of objects and scenes. His current work attempts to develop 3D object representation and recognition methods that can be useful for real world applications. Yu Xiang received his Ph.D. in computer vision from the University of Michigan in 2015, M.S. degree in computer science from Fudan University in 2010, and B.S. degree in computer science from Fudan University in 2007.

Host: Kris Kitani

One primary goal of AI from its very beginning has been to develop systems that can understand an image in a meaningful way. While we have seen tremendous progress in recent years on naming-style tasks like image classification or object detection, a meaningful understanding requires going beyond this paradigm. Scenes are inherently 3D, so our understanding must also capture the underlying 3D and physical properties. Additionally, our understanding must be human-centric since any man-made scene has been built with humans in mind. Despite the importance of obtaining a 3D and human-centric understanding, we are only beginning to scratch the surface on both fronts: many fundamental questions, in terms of how to both frame and solve the problem, remain unanswered.

In this talk, I will discuss my efforts towards building a physical and human-centric understanding of images. I will present work addressing the questions: (1) what 3D properties should we model and predict from images, and do we actually need explicit 3D training data to do this? (2) how can we reconcile data-driven learning techniques with the physical constraints that exist in the world? and (3) how can understanding humans improve traditional 3D and object recognition tasks?

David Fouhey is a Ph.D. student at the Robotics Institute of Carnegie Mellon University, where he is advised by Abhinav Gupta and Martial Hebert. His research interests include computer vision and machine learning with a particular focus on scene understanding. David's work has been supported by both NSF and NDSEG fellowships. He has spent time at Microsoft Research and University of Oxford's Visual Geometry Group.

Sponsored in part by Disney Research

Crowd annotated datasets have become a mainstay in computer vision, enabling some of the most significant discoveries of recent years. However, the research community has yet to fully exploit the full range of human intelligence available in the crowd. This talk demonstrates that, by going beyond naive annotation, researchers can access the vast potential of crowdsourcing. Using polling, active learning, intelligent annotation protocols, and other techniques we are able to leverage the crowd to discover a taxonomy of visual attributes, build detectors with minimal supervision, and economically label massive datasets.

Genevieve Patterson is a Ph.D. Candidate in Computer Vision at Brown University. Her work on crowd-driven visual classification was recently awarded runner-up for Best Paper at the AAAI Conference on Human Computation (HCOMP). She built and maintains the SUN Attribute dataset, a widely used resource for scene understanding. Genevieve received her master's in Electrical Machines from the University of Tokyo, where she won the ICEMS 2009 Outstanding Paper award for her work on transverse-flux motor design. She earned bachelor’s degrees in Electrical Engineering and Mathematics from the University of Arizona.

Sponsored in part by Disney Research.


Subscribe to VASC