Finding Tiny Faces

Peiyun Hu Deva Ramanan
Robotics Institute
Carnegie Mellon University

teaser
We describe a detector that can find ~8001 faces out of the reportedly 1000 present, by making use of novel characterization of scale, resolution, and context to find small objects. Detector confidence is given by the color bar on the right. Can you confidently identify errors?

1. Download our demo here. Last update on 12/22/2016.
teaser Download Paper

Abstract

Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects of the problem in the context of finding small faces: the role of scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, the cues for recognizing a 3px tall face are fundamentally different than those for recognizing a 300px tall face. We take a different approach and train separate detectors for different scales. To maintain efficiency, detectors are trained in a multi-task fashion: they make use of features extracted from multiple layers of single (deep) feature hierarchy. While training detectors for large objects is straightforward, the crucial challenge remains training detectors for small objects. We show that context is crucial, and define templates that make use of massively-large receptive fields (where 99% of the template extends beyond the object of interest). Finally, we explore the role of scale in pre-trained deep networks, providing ways to extrapolate networks tuned for limited scales to rather extreme ranges. We demonstrate state-of-the-art results on massively-benchmarked face datasets (FDDB and WIDER FACE). In particular, when compared to prior art on WIDER FACE, our results reduce error by a factor of 2 (our models produce an AP of 81% while prior art ranges from 29-64%).

Multi-task modeling of scales

human_exp

Traditional approaches build a single-scale template that is applied on a finely-discretized image pyramid (a). To exploit different cues available at different resolutions, one could build different detectors for different object scales (b). Such an approach may fail on extreme object scales that are rarely observed in training (or pre-training) data. We make use of a coarse image pyramid to capture extreme scale challenges in (c). Finally, to improve performance on small faces, we model additional context, which is efficiently implemented as a fixed-size receptive field across all scale-specific templates (d). We define templates over features extracted from multiple layers of a deep model, which is analogous to foveal descriptors (e).

human_exp

Context in human vision

On the left, we visualize a large and small face, both without and with context. Context is not needed for a human user to recognize the large face, while the small face is dramatically unrecognizable without its context.

We quantify this observation with a simple human experiment (data) on the right, where users classify true and false positive faces of our proposed detector. Adding fixed context (300px) reduces error by 20% on small size (S) comparing to no context, but only 2% for extra large (XL). Also, proportional context (3X) becomes less helpful for smaller objects, suggesting context should also be modeled in a scale-variant manner. We operationalize this observation with foveal templates of massively-large receptive fields (around 300x300, the size of contextual images shown as as yellow boxes).

context remove_foveal

How best to encode context?

How much context should be modeled?

Modeling additional context helps, especially for finding small faces. The improvement from adding context to a tight-fitting template is greater for small faces (18.9%) than for large faces (1.5%). Interestingly smaller receptive fields do better for small faces, because the entire face is visible. The green box represents the actual face size, while dotted boxes represent receptive fields associated with features from different layers (cyan = res2, light-blue = res3, dark-blue = res4, black = res5). Same colors are used in Figures 5 and 7.

Understanding how foveal descriptors help

Foveal descriptors are crucial for accurate detection on small objects. The small template (top) performs 7% worse with only res4 and 33% worse with only res5. On the contrary, removing foveal structure does not hurt the large template (bottom), suggesting high-resolution features from lower layers are mostly useful for finding small objects!

scale case scale analysis

How to generalize pre-trained networks?

Say we want to find faces of 25x20 pixels. Surprisingly, we obtain better accuracy by using a 2X-larger template, of size 50x40 rather than 25x20. Alternatively, one can more accurately find 250x200 faces by using a 2X-smaller template, of size 125x100. Why?

We hypothesize that pre-trained models do not generalize to extreme scales. ImageNet has over 80% of its data between 40-140 pixel, considering a normalized world where images are all 224x2241. It is likely that pre-trained models are optimized for objects within such range of size. We've seen how to better generalize pre-trained networks for two specific scales. What about other scales?

We brute-forcely enumerate a number of scales, which roughly cover the scale space, and try three different strategies for every scale (shrink to 0.5X, stay as 1X, and interpolate to 2X). Natural regimes emerge in the figure: for finding large faces (more than 140px in height), build templates at 0.5X resolution; for finding small faces (less than 40px in height), build templates at 2X resolution. For sizes in between, build templates at 1X resolution.


1. Estimation is based on ImageNet's attribute annotations. See here for more info.

State-of-the-art results on WIDER Face and FDDB

WIDER Face test set 1

hard set (full dataset)
widerface test set hard
medium set
widerface test set medium
easy set
widerface test set easy

FDDB 2

Continuous score
FDDB published results on continue score
Discrete score
FDDB published results on discrete score

1. See easy, medium, and hard for performance curves on the validation set. Last update on 12/29/2016.
2. Only published results are shown here.

Qualitative results

Under different attributes

examples of different attributes

Under different scales

examples of different scales

Data

We release the data used in the human experiment of our paper. The data contains both true positives and false positives of different scales and with different amount of context from our detector. Download it here.

github

Code

We've released our training code plus a demo of our tiny face detector at Github.

Acknowledgments

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012.