RI

We study a fundamental question in pose estimation from vision-only video data: should the pose of a camera be determined from fixed and known correspondences? Or should correspondences be simultaneously estimated alongside the pose?<p>

Determining pose from fixed correspondences is known as feature-based, where well-established tools from projective geometry are utilized to formulate and solve a plethora of pose estimation problems. Nonetheless, in degraded imaging conditions such as low light and blur, reliably detecting and precisely localizing interest points becomes challenging.

Conversely, estimating correspondences alongside motion is known as the direct approach, where image data are used directly to determine geometric quantities without relying on sparse interest points as an intermediate representation. The approach is in general more precise by virtue of redundancy as many measurements are used to estimate a few degrees-of-freedom. However, direct methods are more sensitive to changes in illumination.

In this work, we combine the best of the feature-based approaches with the precision of direct methods. Namely, we make use of densely and sparsely evaluated local feature descriptors in a direct image alignment framework to address pose estimation in challenging conditions. Applications include tracking planar targets under sudden and drastic changes in illumination as well as visual odometry in poorly-lit subterranean mines.

Motivated by the success of the approach, we introduce a novel formulation for the joint refinement of pose and structure across multiple views akin to feature-based bundle adjustment (BA). In contrast to minimizing the reprojection error using BA, initial estimates are refined such that the photometric consistency of their image projections is maximized without the need for correspondences. The technique is evaluated on a range of datasets and is shown to improve upon the accuracy of the current state-of-the-art in vision-based simultaneous localization and mapping (VSLAM).

Thesis Committee:
Brett Browning (Co-chair)
Simon Lucey (Co-chair)
Michael Kaess
Martial Hebert
Ian D. Reid (The University of Adelaide)

Perception and state estimation are critical robot competencies that remain difficult to harden and generalize. This is due in part to the incredible complexity of modern perception systems which are commonly comprised of dozens of components with hundreds of parameters overall. Selecting a configuration of parameters relies on a human's understanding of the parameters' interaction with the environment and the robot behavior, which we refer to as the "context." Furthermore, evaluating the performance of the system entails multiple empirical trials, which often poorly predict the generality of the system.

We depart from the conventional wisdom that perception systems must generalize to be successful and instead suggest that a perception system need only do well in situations it encounters over the course of its deployment. This thesis proposes that greater overall perceptual generality can be achieved by designing perception systems that adapt to their local contexts by re-selecting perception system parameters. Towards this end, we have completed work on improving stochastic model fidelity and discuss our proposed work on applying reinforcement learning techniques to learn parameter selection policies from perceptual experience.

Thesis Committee:
George Kantor (Chair)
Sebastian Scherer
Katharina Muelling
Ingmar Posner (University of Oxford)

Copy of Proposal Document

We are aiming at building robots that can interact with people in public spaces. Such a robot receives various sounds, such as surrounding noises and users' voices. In this talk I will present a machine learning-based method to estimate response obligation, i.e., whether the robot needs to respond to each input sound or not. This enables the robot to reject not only noises but also monologues and user utterances toward other users. Our method uses not only acoustic information but also users' motions and postures during a sound segment as features. In addition, user behaviors after a sound segment are taken into account to exploit typical user behaviors in human-robot interaction; for example, a user often stands still when he/she speaks to a robot. Experimental results showed our proposed model significantly outperformed a baseline. We found that user behaviors both during and after sound segments are helpful for estimating the response obligation.

Mikio Nakano is a principal researcher at Honda Research Institute Japan Co., Ltd. (HRI-JP). He received his M.S. degree in Coordinated Sciences and Sc.D. degree in Information Science from the University of Tokyo, respectively in 1990 and 1998.From 1990 to 2004, he worked for NTT (Nippon Telegraph and Telephone Corporation). He was a visiting scientist at MIT Laboratory for Computer Science from 2000 to 2002. He joined HRI-JP in 2004. He has been studying various types of dialogue system including conversational robots and text-based chatbots.

He was a science advisory committee member of SIGDIAL from 2007 to 2012. He also served as a general chair for SIGDIAL 2010 and an area chair for ACL 2012. He was a visiting professor at Waseda University from 2011 to 2016.

The Honda Research Institute Japan is an affiliated company of Honda Motor Company, and as well as its sister companies, Honda Research Institute USA and Honda Research Institute Europe, it is dedicated to fundamental research. HRI-JP focuses on "the intelligence supporting human and machine" by conducting research that takes a unique approach not bounded by conventional concepts.  Current research areas of HRI-JP include dialogue systems, robot audition, psychology of vision, and machine learning.

Perception and state estimation are critical robot competencies that remain difficult to harden and generalize. This is due in part to the incredible complexity of modern perception systems which are commonly comprised of dozens of components with hundreds of parameters overall. Selecting a configuration of parameters relies on a human's understanding of the parameters' interaction with the environment and the robot behavior, which we refer to as the "context." Furthermore, evaluating the performance of the system entails multiple empirical trials, which often poorly predict the generality of the system.

We depart from the conventional wisdom that perception systems must generalize to be successful and instead suggest that a perception system need only do well in situations it encounters over the course of its deployment. This thesis proposes that greater overall perceptual generality can be achieved by designing perception systems that adapt to their local contexts by re-selecting perception system parameters. Towards this end, we have completed work on improving stochastic model fidelity and discuss our proposed work on applying reinforcement learning techniques to learn parameter selection policies from perceptual experience.

Thesis Committee:
George Kantor (Chair)
Sebastian Scherer
Katharina Muelling
Ingmar Posner (University of Oxford)

Copy of Proposal Document

Given a single image of a scene, humans have few issues answering questions about its 3D structure like “is this facing upwards?” even though mathematically speaking this should be impossible. We have similarly have few issues accounting for this 3D structure in answering viewpoint independent questions like "is this the same carpet as the one in your office?'', even if the carpets were viewed from different views and have no pixels in common. 

At the heart of the issue is that images are the result of two phenomena: the underlying 3D shape, which we call the 3D structure, and canonical texture that is applied to this shape, which we call the style. In the 3D world, these phenomena are distinct, but when we observe the world, they become mixed. Although the identity of both structure and style gets lost in the process, if we know about regularities in both phenomena, we can narrow down the possible combinations that could have produced our image.

This dissertation aims to better enable computer to understand images in a 3D way by factoring the image into 3D structure and style. The key is that we can take advantage of regularity in both phenomena to inform our interpretation. For instance, we do not expect carpet texture on ceilings or 75 degree angles between walls. By using regularities, especially ones discovered from large-scale data, we can winnow away the possible combinations of 3D structure and style that could have produced our image.

Thesis Committee:
Abhinav Gupta (Co-Chair)
Martial Hebert (Co-Chair)
Deva Ramanan
William T. Freeman (Massachusetts Institute of Technology)
Andrew Zisserman (University of Oxford)

Copy of Thesis Document

Boris Sofman
As an engineer and researcher with experience in building diverse robotic systems - from consumer products to off-road autonomous vehicles and bomb-disposal robots - Boris is making it his life’s work to create products that people would not expect to be possible. He earned a B.S., M.S. and Ph.D. from the Robotics Institute of Carnegie Mellon University. Boris is an avid tennis player, but finds that Anki doesn’t allow him to play nearly as often as he’d like.

Hanns Tappeiner:
Hanns is passionate about creating products he always wanted but didn’t exist. He has worked extensively at refining the connections between operator and robot, developing deeper senses of feel and control. Hanns has designed robotics across the globe for companies in Germany, Italy, Austria and the US. Hanns studied at the University of Technology in Vienna before earning an M.S. in Robotics from the Robotics Institute at Carnegie Mellon University. On the weekend, Hanns is probably working or outdoors somewhere with his motorcycle.

Faculty Host: Martial Hebert

Robots still struggle with everyday manipulation tasks. An overriding problem with robotic manipulation is uncertainty in the robot's state and calibration parameters. Small amounts of uncertainty can lead to complete task failure. This thesis explores ways of tracking and calibrating noisy robot arms using computer vision, with an aim toward making them more robust. We consider three systems with increasing complexity: a noisy robot arm tracked by an external depth camera, a noisy arm that localizes itself using a hand-mounted depth sensor looking at an unstructured word, and a noisy arm that only has a single hand-mounted monocular RGB camera estimating its state while simultaneously calibrating its camera extrinsics. Using techniques taken from dense object tracking, fully dense SLAM and sparse general SLAM, we are able to automatically track the robot and extract its calibration parameters. We also provide analysis linking these problems together, while exploring the fundamental limitations of SLAM-based approaches for calibrating robot arms.

Thesis Committee:
Siddhartha Srinivasa (Co-chair)
Michael Kaess (Co-chair)
George Kantor
Andrew Davison (Imperial College, London)

Copy of Draft Document

Loose, granular terrain can cause rovers to slip and sink, inhibiting mobility and sometimes even permanently entrapping a vehicle. Traversability of granular terrain is difficult to foresee using traditional, non-contact sensing methods, such as cameras and LIDAR. This inability to detect loose terrain hazards has caused significant delays for rovers on both the Moon and Mars and, most notably, contributed to Spirit's permanent entrapment in soft sand on Mars. These delays are caused both by slipping in unidentified loose sand and by wasting time analyzing or completely circumventing benign sand. Reliable prediction of terrain traversability would greatly improve both the safety and the operational speed of planetary rover operations. This thesis leverages thermal inertia measurements and physics-based terramechanics models to develop algorithms for slip prediction in planetary granular terrain.

The ability of a rover to traverse granular terrain is a complex function of the geometry of the terrain, the rover's configuration, and the physical properties of the granular material, such as density and particle geometry. Vision-based traversability prediction methods are inherently limited. Subsurface characteristics are not exclusively correlated with visual appearance of the surface layer. Vision does not provide enough information to fully understand all the physical properties that influence mobility. The inherent difficulty of estimating traversability is compounded by the conservative nature of planetary rover operations. Mission operators actively avoid potentially hazardous regions, which makes strictly data-driven regression approaches difficult due to limited data.

Pre-proposal research has shown that thermal inertia is correlated to and improves estimates of traversability. This has been demonstrated both in terrestrial experiments and by using data from the Curiosity rover. Unlike visual appearance, thermal properties of a material are not only influenced by the surface of terrain but also by the physical properties of the underlying material. This thesis develops techniques for predicting the traversability of terrain by leveraging thermal inertia measurements to provide a greater understanding of material properties both at and below the surface.

The proposed research will develop computationally efficient traversability prediction technologies. Thermal inertia and geometric features, such as angle of repose, will be used to estimate granular terrain properties. Then surface geometry and soil parameters will be used as inputs to a learning-based slip prediction algorithm. The algorithm will be trained on both in-situ and synthetic data to reduce overfitting and increase prediction accuracy. Synthetic data will be generated using state-of-the-art terramechanics simulators that produce accurate slip estimates given known terrain properties but are too computationally inefficient to be used for tactical rover planning. Evaluation will occur on data from the Mars rovers. Results will be compared to vision-only methods in order to understand in what situations the addition of thermal inertia can improve traversability prediction.

Thesis Committee:
William "Red" Whittaker (Chair)
David Wettergreen
Steven Nuske
Issa Nesnas (Jet Propulsion Laboratory)

The last decade has seen remarkable advances in 3D perception for robotics. Advances in range sensing and SLAM now allow robots to easily acquire detailed 3D maps of their environment in real-time.

However, adaptive robot behavior requires an understanding the environment that goes beyond pure geometry. A step above purely geometric maps are so-called semantic maps, which incorporate task-oriented semantic labels in addition to 3D geometry. In other words, a map of what is  where . This is a straightforward representation that allows robots to use semantic labels for navigation and exploration planning.

In this proposal we develop learning-based approaches for semantic mapping with image and range sensors. We make three main contributions.

In our first contribution, which is completed work, we developed VoxNet, a system for accurate and efficient semantic classification of 3D point cloud data. The key novelty in this system is the integration of volumetric occupancy maps with spatially 3D Convolutional Neural Networks (CNNs). The system showed state-of-the-art performance in 3D object recognition and helicopter landing zone detection.

In our second contribution, motivated by the complementary information in image and point cloud data, we propose a CNN architecture fusing both modalities. The architecture consists of two interconnected streams: a volumetric CNN stream for the point cloud data, and a more traditional 2D CNN stream for the image data. We will evaluate this architecture for the tasks of terrain classification and obstacle detection in an autonomous All Terrain Vehicle (ATV).

In the final contribution, we propose a semantic mapping system for intelligent information gathering on Micro Aerial Vehicles (MAVs). In pursuit of a lightweight solution, we forego active range sensing and use monocular imagery as our main data source. This leads to various challenges, as we now must infer *where* as well as *what*. We outline our plan to solve these challenges using monocular cues, inertial sensing, and other information available to the vehicle.

Thesis Committee:
Sebastian Scherer (Chair)
Martial Hebert
Abhinav Gupta
Raquel Urtasun (University of Toronto)

Copy of Proposal Document

Pages

Subscribe to RI