Bottom-Up and Top-Down Reasoning with Hierarchical Rectified Gaussians

Peiyun Hu Deva Ramanan

CVPR 2016 (Spotlight Presentation)

teaser Download Paper

Abstract

Convolutional neural nets (CNNs) have demonstrated remarkable performance in recent history. Such approaches tend to work in a “unidirectional” bottom-up feed-forward fashion. However, practical experience and biological evidence tells us that feedback plays a crucial role, particularly for detailed spatial understanding tasks. This work explores “bidirectional” architectures that also reason with top-down feedback: neural units are influenced by both lower and higher-level units.

We do so by treating units as rectified latent variables in a quadratic energy function, which can be seen as a hierarchical Rectified Gaussian model (RGs). We show that RGs can be optimized with a quadratic program (QP), that can in turn be optimized with a recurrent neural network (with rectified linear units). This allows RGs to be trained with GPU-optimized gradient descent. From a theoretical perspective, RGs help establish a connection between CNNs and hierarchical probabilistic models. From a practical perspective, RGs are well suited for detailed spatial tasks that can benefit from top-down reasoning. We illustrate them on the challenging task of keypoint localization under occlusions, where local bottom-up evidence may be misleading. We demonstrate state-of-the-art results on challenging benchmarks.

Coarse-to-fine prediction

insight

We visualize coarse-to-fine heatmap predictions, obtained by using multi-scale classifiers defined on features extracted from multiple layers in our probabilistic model. Predictions based on only the coarse-scale layer are identical. As one extracts features from additional lower layers, the top-down model does a better job of incorporating contextual evidence, which cleans up localization around the right knee and disambiguates the left/right ankle.

insight

Low-level activation

Inspired by neurological experiments, we compare low-level activations at two different times: at 0.5ms (during bottom-up processing) and at 30ms (during top-down feedback). The convolutional activation of same neural units appears different after top-down feedback. Activation on facial parts (hair for top and facial skin for bottom) become stronger while activation on the background (including clothing) is suppressed. This is best shown with the average activation across images in the right-most column.

mpii qualitative results

Human keypoint localization

mpii qualitative results

Left are some human keypoint localization results produced by our proposed model. Top compares keypoint localization performance on MPII Human Pose between our approach and state-of-the-art at the time of submission. QP1 is equivalent to a "unidirectional" bottom-up feed-forward model, and QP2 represents a "bidirectional" model with 2 passes of inference.

mpii qualitative results

Facial landmark localization and visibility prediction

mpii qualitative results

Left show examples of facial landmark localization plus visibility prediction (invisible and visible) from our model. Top compares keypoint localization performance on COFW between our approach and prior-art at the time of submission.

github

Code

We released code for both human keypoint and facial landmark localization on Github. Links for downloading our models are also included.

Acknowledgments

This research is supported by NSF Grant 0954083 and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012.