Recurrent Network Models for Human Dynamics

Katerina Fragkiadaki   Sergey Levine   Panna Felsen  Jitendra Malik
EECS, UC Berkeley

Abstract We propose the Encoder-Recurrent-Decoder (ERD) model for recognition and prediction of human body pose in videos and motion capture. The ERD model is a recurrent neural network that incorporates nonlinear encoder and decoder networks before and after recurrent layers. We test instantiations of ERD architectures in the tasks of motion capture (mocap) generation, body pose labeling and body pose forecasting in videos. Our model handles mocap training data across multiple subjects and activity domains, and synthesizes novel motions while avoiding drifting for long periods of time. For human pose labeling, ERD outperforms a per frame body part detector by resolving left-right body part confusions. For video pose forecasting, ERD predicts body joint displacements across a temporal horizon of 400ms and outperforms a first order motion model based on optical flow. ERDs extend previous Long Short Term Memory (LSTM) models in the literature to jointly learn repre- sentations and their dynamics. Our experiments show such representation learning is crucial for both labeling and prediction in space-time. We find this is a distinguishing fea- ture between the spatio-temporal visual domain in comparison to 1D text, speech or handwriting, where straightfor- ward hard coded representations have shown excellent results when directly combined with recurrent units.

The figure shows the ERD architectures used for motion, synthesis, kinematic tracking and kinematic forecasting.

The video above shows comparisons in motion completion between our mocap ERD (2nd column), a 3-layer LSTM (3rd column), CRBMs (4rth column), nearest neighbor NGRAM (5th column) and Gaussian process dynamical models (6th column). Ground-truth is shown in the 1st column.


Last update: Sept, 2015.