In this talk, I will discuss the task of inferring 3D structure underlying an image, in particular focusing on two questions - a) how we can plausibly obtain supervisory signal for this task, and b) what forms of representation should we pursue. I will first show that we can leverage image-based supervision to learn single-view 3D prediction, by using geometry as a bridge between the learning systems and the available indirect supervision. We will see that this approach enables learning 3D structure across diverse setups e.g. predicting deformable models or volumetric 3D for objects, or inferring layered-depth images for scenes. I will then advocate the case for inferring interpretable and compositional 3D representations. I will present a method that discovers the coherent compositional structure across objects in a unsupervised manner by attempting to assemble shapes using volumetric primitives, and then demonstrate the advantages of predicting similar factored 3D representations for complex scenes.
Shubham Tulsiani is a graduate student in the Computer Vision group at UC, Berkeley where he is advised by Prof. Jitendra Malik. His research interests lie at the intersection of recognition, pose estimation and 3D reconstruction from a single image. His work is supported by a Berkeley Graduate Fellowship.