Robotics Thesis Proposal
- Gates Hillman Centers
- Traffic21 Classroom 6501
- CHEN-HSUAN LIN
- Ph.D. Student
- Robotics Institute
- Carnegie Mellon University
Learning Dense 3D Object Reconstruction without Geometric Supervision
Geometric alignment across visual data has been the fundamental issue for effective and efficient computer vision algorithms. The established pixel correspondences between images indirectly infer the underlying 3D geometry, physically or semantically. While this builds the foundation of classical multi-view 3D reconstruction algorithms such as Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), 3D reconstruction from a single image remains an ill-posed problem. Recent learning-based methods (especially using deep learning) has shown remarkable advances by directly supervising with the ground-truth 3D geometry; however, they have relied mostly on datasets where crafted 3D models are manually paired and aligned with the images, which are not scalable and difficult to come by.
In this thesis work, we discuss the general importance of factoring geometric information out of image datasets and focus on the problem of reconstructing dense 3D object shapes from 2D image observations. We build up the theoretical foundations of learning-based geometric alignment, and we show how it can encourage invariant representations for discriminative tasks as well as match realistic geometric configurations in generative applications. We also extend it to dense 3D reconstruction, where motivated by 2D image alignment, we show how we can densely reconstruct 3D objects indirectly by optimizing for 2D pixel correspondences.
Finally, we propose a framework of self-supervised dense 3D object reconstruction from 2D images, where a space of dense 3D shapes can be learned from image data without available ground-truth 3D data or camera viewpoint information. We propose to learn it through an adversarial objective, which automatically disentangles the object identity and camera viewpoints. We also propose to analyze the learned 3D-reconstructive representations, which would supposedly be viewpoint-invariant, and discuss the potentials of such approach benefiting downstream visual recognition to be more effective an efficient.
Simon Lucey, Chair
Andrea Vedaldi (University of Oxford)