Context Encoders: Feature Learning by Inpainting

Deepak Pathak

Phillip Krähenbühl

Jeff Donahue

Trevor Darrell

Alexei A. Efros

UC Berkeley

CVPR 2016

Checkout brand new Imagenet Results !!

Semantic Inpainting results on held-out images by Context Encoder.

We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

Demo and Source Code

Inpainting Code
[GitHub]

Features Caffemodel
[Prototxt] [Model 17MB]

Imagenet Results

These are Context Encoder results on random crops of "held-out" images. Center half region is inpainted by our method. Browser images appear on a rolling basis so you can still see results while full file is being loaded.

All the models were trained completely from scratch. 1.2M-Imagenet model was trained on complete 1.2M image-set of ILSVRC'12 for 110 epochs on Titan-X GPU and took one month to train. 100K-Imagenet model was trained on a random subset of 100K-images from ILSVRC'12 for 500 epochs spanned over 6 days. This 100K set was chosen at random, and we believe results shouldn't change if this set is chosen truly randomly. However, the list of exact 100K image ids is available for download below. Results are shown on images from ILSVRC'12 validation set.

1.2M-Imagenet Trained (110 epochs)
[Browse Image Results] or [Download Tar 82MB]
[Browse Patch Results] or [Download Tar 76MB]

100K-Imagenet Trained (500 epochs)
[Training 100K Image IDs]
[Browse Image Results] or [Download Tar 82MB]
[Browse Patch Results] or [Download Tar 76MB]

Paris Street-View Results

StreetView Inpainting
[Browse Image Results]

Paper and Supplementary Material

[paper 15MB] [arXiv] [slides]

Citation

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell and Alexei A. Efros. Context Encoders: Feature Learning by Inpainting. In CVPR 2016.

[Bibtex]

@inproceedings{pathakCVPR16context,
    Author = {Pathak, Deepak and
    Kr\"ahenb\"uhl, Philipp and
    Donahue, Jeff and
    Darrell, Trevor and
    Efros, Alexei},
    Title = {Context Encoders:
    Feature Learning by Inpainting},
    Booktitle = CVPR,
    Year = {2016}
}

Qualitative illustration of the task w/ an example of inpainting by human artist.

Acknowledgements

The authors would like to thank Amanda Buster for the artwork in figure above, as well as Shubham Tulsiani and Saurabh Gupta for helpful discussions. This work was supported in part by DARPA, AFRL, Intel, DoD MURI award N000141110688, NSF awards IIS-1212798, IIS-1427425, and IIS-1536003, the Berkeley Vision and Learning Center and Berkeley Deep Drive. We also thank NVIDIA for gpu donation.

Template: Colorful people!