Learning Features by Watching Objects Move

Deepak Pathak
Ross Girshick
Piotr Dollár
Trevor Darrell
Bharath Hariharan
UC Berkeley   &   Facebook AI Research (FAIR)
To appear at CVPR 2017

This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

Source Code and Models

We are releasing multiple softwares developed in the project, but could be generally useful for computer vision research:
(a) Unsupervised Learning: Github repository containing unsupervised trained caffe models and corresponding caffe prototxts. Repository also hosts these models in Torch.
(b) uNLC: Code for unsupervised bottom-up video motion segmentation. uNLC is a reimplementation of the NLC algorithm by Faktor and Irani, BMVC 2014, that removes the trained edge detector and makes numerous other modifications and simplifications. For additional details, see section 5.1 in the paper.
(c) PyFlow: This is python wrapper around Ce Liu's C++ implementation of Coarse2Fine Optical Flow. This is used inside uNLC implementation, and also generally useful independent package.

Unsupervised Learning
uNLC Code
Python Optical Flow

Unsupervised Video Segmentation Dataset

We ran uNLC on automatically selected 205K videos from YFCC100m. We sampled 5-10 frames per shot from each video to create our dataset of 1.6M images, so we have slightly more frames than images in ImageNet. However, note that frames from same clip are very correlated. For details, refer to paper.

[click here]
Video Frames
[Download Tar 45GB]
Unsupervised Segments
[Download Tar 22GB]

Unsupervised Learning Results

We compared the features learned by our unsupervised motion grouping algorithm with previous and concurrent works. The representation learned by our model outperforms previous approaches when transferred to PASCAL VOC Detection. A brief summary of comparison is as follows:

One interesting question is how much data one would need to surpass the supervised learning performance. We did some analysis as shown below. If this logarithmic growth continues, our representation will be on par with one trained on ImageNet if we use about 27M 'correlated' frames (or 3M-5M videos). We expect this number could be reduced with more algorithmic improvements.


[Paper 9MB]  [arXiv]

Deepak Pathak, Ross Girshick, Piotr Dollár , Trevor Darrell and Bharath Hariharan. Learning Features by Watching Objects Move. In CVPR 2017.

    Author = {Pathak, Deepak and
    Girshick, Ross and
    Doll{\'a}r, Piotr and
    Darrell, Trevor and
    Hariharan, Bharath},
    Title = {Learning Features
    by Watching Objects Move},
    Booktitle = {CVPR},
    Year = {2017}


This work was done during the summer internship at Facebook AI Research (FAIR). The authors would like to thank Larry Zitnick for helpful discussions. Template copied from Context Encoders which itself is a modification of the Colorful template!