TesseTrack: End-to-End Learnable Multi-Person Articulated 3D Pose Tracking


We consider the task of 3D pose estimation and tracking of multiple people seen in an arbitrary number of camera feeds. We propose TesseTrack, a novel top-down approach that simultaneously reasons about multiple individuals’ 3D body joint reconstructions and associations in space and time in a single end-to-end learnable framework. At the core of our approach is a novel spatio-temporal formulation that operates in a common voxelized feature space aggregated from single- or multiple camera views. After a person detection step, a 4D CNN produces short-term person-specific representations which are then linked across time by a differentiable matcher. The linked descriptions are then merged and deconvolved into 3D poses. This joint spatio-temporal formulation contrasts with previous piece-wise strategies that treat 2D pose estimation, 2D-to-3D lifting, and 3D pose tracking as independent sub-problems that are error-prone when solved in isolation. Furthermore, unlike previous methods, TesseTrack is robust to changes in the number of camera views and achieves very good results even if a single view is available at inference time. Quantitative evaluation of 3D pose reconstruction accuracy on standard benchmarks shows significant improvements over the state of the art. Evaluation of multi-person articulated 3D pose tracking in our novel evaluation framework demonstrates the superiority of TesseTrack over strong baselines.


The complete pipeline of tessetrack has been illustrated. Initially, the video feed from multiple cameras is passed through shared HRNet to compute the features required for detection and 3D pose tracking. The final layer of the HRNet is passed through a 3D convolution to regress to the center of the human 3D bounding boxes. Each of the hypotheses is combined with the HRNet final layer to create a spatio-temporal Tube called tesseract. We use a learnable 3D tracking framework for a person association over time using spatio-temporal person descriptors. Finally, the associated descriptors are passed through deconvolution layers to infer the 3D pose. Note that the framework is end-to-end trainable except for the NMS layer in the detection network.



We demonstrate our results on different datasets to demosntrate the robustness of the algorithm.

Haggling Sequence of Panoptic Dataset

We show 3D pose tracking on haggling sequence of panoptic dataset.

Pizza Sequence of Panoptic Dataset

We show 3D pose tracking on Pizza sequence of panoptic dataset.

Monocular results on Panoptic dataset

We show robustness of the Tessetrack algorithm with monocular input as well.

Results on Tagging Sequence

We show robustness to view changes. Observe that the tracking is accurate although the input videos are being captured by handheld cameras

More Details

For an in-depth description of TesseTrack, please refer to our paper and the accompanying video.

"TesseTrack: End-to-End Learnable Multi-Person Articulated 3D Pose Tracking",

N. Dinesh Reddy, Laurent Guigues, Leonid Pischulini, Jayan Eledath, and Srinivasa Narasimhan
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021.
[PDF][Poster] [Supp] [Bibtex]
Link to paper