Teaser image.
Models trained on our automatically generated data from time-lapse imagery can reliably estimate amodal 2D bounding box, segmentation as well as 3D shape and pose despite the complex occlusions presented in the input image.


Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.

Supplementary Video


For an in-depth description of WALT3D, please refer to our paper.

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion

Khiem Vuong, N. Dinesh Reddy, Robert Tamburo, Srinivasa G. Narasimhan
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2024.
[pdf] [supp] [bibtex]

Clip-Art Data Generation with 3D-based Compositing

Given a time-lapse video, we automatically generate 2D/3D training data under severe occlusions. We start by detecting each object in the video, and unoccluded (fully visible) objects are identified. Each unoccluded object is then reconstructed using the ground plane and camera parameters. With the 3D pose, unoccluded objects are composited back into the same location (i.e., clip-art style) in a geometrically consistent approach, ensuring physically accurate and realistic occlusion configurations. The composited image and its pseudo-groundtruth from off-the-shelf methods (e.g., segmentation, keypoints, shapes) are utilized to train a model that can produce accurate 2D/3D object reconstruction under severe occlusions.

Another example at a different location:

Comparison with 2D-based Compositing

Our 3D-based compositing method generates realistic and geometrically accurate occlusion configurations, in contrast to the 2D-based method (e.g., cars and people overlapping in an unfeasible way).

Paper thumbnail.

Qualitative Results

Our method produces accurate amodal segmentation, keypoints, as well as 3D poses and shapes across diverse poses and occlusion configurations.

Vehicle-People Occlusion

Vehicle-Vehicle Occlusion

People-People Occlusion

Potential Societal Impact

We do not perform any human subjects research from these cameras.


This work was supported in part by an NSF Grant CNS-2038612, a US DOT grant 69A3551747111 through the Mobility21 UTC and grants 69A3552344811 and 69A3552348316 through the Safety21 UTC.


If you have any question, please feel free to contact Khiem Vuong.

Last Modified: April 2nd, 2024