In this talk, I will describe our recent work (presented at ICCV 2017) on monocular camera based 3D geometry reconstruction of a non-rigid dynamic scene.   We aim to answer an open question in multi-view geometry, namely, “Is it possible to recover the 3D structure of a complex dynamic environment from two image frames captured by a single moving monocular camera?”   Traditional methods for solving this task either employ stereo camera, require multiple image frames of a long video sequence, or assume a pre-segmented deformable object following simple low-order linear shape model.  Contrary to these,  we do not make such strong assumptions;  rather,  we show that: under very mild conditions monocular 3D reconstruction of a complex dynamic scene from two frames is possible.  Our new method achieves the state of the art performance on standard benchmark datasets, including “KITTI” for autonomous driving,  and “Sintel”– an open-source animation movie.   If time allows, at the beginning of the talk, I may also present a brief introduction to other research activities in computer vision at ANU and ACRV.

Dr. Hongdong Li is a Reader/Associate Professor with the Computer Vision Group of ANU (Australian National University).  He is also a Chief Investigator of ACRV (Australia Centre for Robotic Vision).   His research interests include 3D vision reconstruction, visual perception for robot navigation, as well as mathematical optimization in computer vision.  He teaches “Computer Vision” and “Robotics” courses at the ANU.  During 2009-2010 as a NICTA Scientist he worked on the “Australia Bionic Eyes” project – whose goal is to develop an artificial retina implant to help blind and visually impaired people to restore vision.  Dr Li is an Associate Editor for IEEE T-PAMI, has served Program Committees in recent ICCV, ECCV and CVPR and was a winner of a prestigious IEEE CVPR Best Paper Award,   ICCV Marr Prize  (honorable mention),   ICPR Best Paper Award and  IEEE ICIP Best Student Paper Award, alongside several other paper awards jointly with students and coauthors.   He was a Program Co-Chair for ACRA 2015 – Australia Conference on Robotics and Automation, and is a Program Co-Chair for ACCV 2018.

Sponsored in part by Oculus Research Pittsburgh.

Visual reasoning is a core capability of artificial intelligence. It is a necessity for effective communication, planning, and for question/answering tasks. In this talk, I discuss some recent explorations into visual reasoning for question/answering, game playing and dialog. I also describe our new reinforcement learning platform ELF; an Extensive, Lightweight and Flexible research platform for real-time strategy games.

C. Lawrence Zitnick is a research lead at Facebook AI Research, and an affiliate associate professor at the University of Washington. He is interested in a broad range of topics related to artificial intelligence including object recognition, the relation of language and imagery, and methods for gathering common sense knowledge. He developed the PhotoDNA technology used by Microsoft, Facebook, Google, and various law enforcement agencies to combat illegal imagery on the web. Before joining FAIR, he was a principal researcher at Microsoft Research. He received the PhD degree in robotics from Carnegie Mellon University.

Sponsored in part by Oculus Research Pittsburgh.


This talk tells two tales about image-classification systems, both of which are motivated by the real-world deployment of such systems.

The first tale introduces a new convolutional neural network architecture, called multi-scale DenseNets, with the ability to adapt dynamically to computational resource limits at inference time. The network uses progressively growing multi-scale convolutions, dense connectivity, and a series of classifiers at intermediate layers of the network. At inference time, it spends less computation on “easy” images, and uses the surplus computation to obtain higher accuracy on “hard” images.

The second tale introduces a practical defense method against adversarial examples. Unlike prior work that focuses on robustness via regularization, we obtain robustness via input transformation. Our defense successfully counters 60% of white-box attacks and 90% of black-box attacks by all popular methods. Moreover, our defense is difficult to attack with current methods, in particular, because it is non-differentiable and randomized.

Laurens van der Maaten is a Research Scientist at Facebook AI Research in New York. Prior, he worked as an Assistant Professor at Delft University of Technology (The Netherlands) and as a post-doctoral researcher at University of California, San Diego. He received his PhD from Tilburg University (The Netherlands) in 2009. Laurens is interested in a variety of topics in machine learning and computer vision. Specific research topics include learning embeddings for visualization, large-scale learning, visual reasoning, and cost-sensitive learning.

Robot-object interaction requires several key perceptual building blocks including object pose estimation, object classification, and partial-object completion. These tasks form the perceptual foundation for many higher level operations including object manipulation and world-state estimation. Most existing approaches to these problems in the context of 3D robot perception assume an existing database of objects that the robot expects to encounter. In real-world settings, robots will inevitably be required to interact with previously unseen objects; novel approaches are required to allow for generalization across highly variable objects. We introduce a new approach: Bayesian Eigenobjects (BEOs), which comprise a novel object representation for robots designed to facilitate this generalization. BEOs allow a robot to observe a previously unseen object from a single viewpoint and jointly estimate that object's class, pose, and hidden geometry. BEOs significantly outperform competing approaches to joint classification and completion and are the first representation to enable joint estimation of class, pose, and 3D geometry.

Ben Burchfiel is a PhD Candidate at Duke University in the field of Robotics, Computer Vision, and Machine Learning advised by Dr. George Konidaris. His primary work lies in the area of robot perception: how can robots better interpret and reason about the world around them. Ben's research seeks to enable robots to move out of the lab and into real-world settings by allowing them to reason about novel objects using information from previous interactions with other objects. Ben received his Bachelor's degree in Computer Science from the University of Wisconsin-Madison in 2012 and his Master's degree in Computer Science from Duke University in 2015. Ben's other interests include Reinforcement Learning (and its inverse), data-driven grounded symbolic planning, and reasoning under uncertainty with sub-optimal (noisy) data.


Recent advances in computational face research make possible a growing range of scientific, behavioral, and commercial applications. Many companies are focusing on the future of computational face products and services, but number of critical research questions remain to be solved. These include 3D face alignment from 2D image, face analysis under extreme pose variation and occlusion, and the manipulation of facial likeness and expressions in videos.

These issues are strongly connected. Most face alignment and analysis methods treat the face as a 2D object, flat like a sheet of paper. That works well provided images are frontal or nearly so and pitch and yaw remain modest. In real-world conditions, these constraints often are violated by moderate to large head rotation and the system’s ability to track and measure facial expressions degrades.

To answer these questions, this talk will zoom in on results and implications from recent computational face challenges I co-organized: The 3D Face Alignment in the Wild challenge (3DFAW 2016 @ ECCV’16) and the 3rd Facial Expression Recognition and Analysis challenge (FERA 2017 @ FG’17).

Advances in facial expression transfer bring exciting opportunities to behavioral, and commercial applications. They also introduce significant potential threats, especially in the age of social media. Altered images can create false representations and videos can amplify that effect. What's interesting is that the very same technology that enables these threats also enables their countermeasures. I will briefly touch on these topics as well.

László A. Jeni, PhD., is Project Scientist at the Carnegie Mellon University, Pittsburgh, PA, USA. He received his M.S. degree in Computer Science from the Eötvös Loránd University, Hungary, and his Ph.D. degree in Electrical Engineering and Information Systems from the University of Tokyo, Japan. He worked as a Senior Computer Vision Specialist at Realeyes – Emotional Intelligence, before joining the Robotics Institute. His research interests are in the fields of Computer Vision and Machine Learning. He develops advanced methods of 2D and 3D automatic analysis and synthesis of facial expressions; and applies those tools to research in human emotion, non-verbal communication, and assistive technology. He has co-organized the 3D Face Alignment workshop in 2016, the third Facial Expression Recognition and Analysis Challenge in 2017, and he is an Area Chair at IEEE FG 2018. His honors include best paper awards at IEEE HSI 2011 and at IEEE FG 2015 conferences.

Sponsored in part by Disney Research

The goal of my research is to develop human-centered algorithms for intelligent and autonomous systems. The research emphasizes modeling the perception, intent, and behavior of humans inside and around a vehicle. Over a decade has passed since the DARPA Grand Challenges, and the way in which we transport people and goods has yet to radically change. That is why I work to disrupt and transform transportation systems, with the urgent outcome of reduction in road traffic injuries and development of assistive technologies. The future of scalable and affordable self-driving cars is excitingly near!

My goal for this talk is to analyze human activities in the context of driving, navigation, and collaboration. I will discuss vision and learning algorithms for semantic video analysis, attention and situational awareness modeling, human state and style recognition, and event anticipation. I will propose holistic multi-modal (cameras, radar, lidar, IMU/GPS), multi-cue (driver/pedestrian body pose, head, hand, foot) frameworks in order to answer two predictive safety related questions: what is going to happen in a scene in the near future? and, which of the surrounding agents are most relevant to the navigation task? An approach for modeling and evaluating the complex interplay between maneuvering task, object attributes, scene context, intent, and future scene state will take us a step closer towards a safe, enjoyable, and personalized driving experience.

Eshed Ohn-Bar is a postdoctoral researcher in the Computer Vision and Robotics Research (CVRR) lab and the Laboratory for Intelligent and Safe Automobiles (LISA) at UCSD. Eshed is interested in machine vision and learning, with a focus on recognition and understanding of human behavior for intelligent and interactive environments. He helped organize four workshops on understanding hand activity at CVPR (2015, 2016) and the IEEE Intelligent Vehicles Symposium (2015, 2016). He received the best paper award at the workshop on Analysis and Modeling of Faces and Gestures 2013, best industry related paper finalist at ICPR 2014, and best Piero Zamperoni student paper award finalist at ICPR 2016. Eshed received the B.S. degree in mathematics from UCLA in 2010, and the Ph.D. degree in electrical engineering from UCSD in 2016.

Sponsored in part by Disney Research.

Subscribe to VASC/RI