There have been some recent efforts to build visual knowledge bases from Internet images. But most of these approaches have focused on bounding box representation of objects. In this paper, we propose to enrich these knowledge bases by automatically discovering objects and their segmentations from noisy Internet images. Specifically, our approach combines the power of generative modeling for segmentation with the effectiveness of discriminative models for detection. The key idea behind our approach is to learn and exploit top-down segmentation priors based on visual subcategories. The strong priors learned from these visual subcategories are then combined with discriminatively trained detectors and bottom up cues to produce clean object segmentations. Our experimental results indicate state-of-the-art performance on the difficult dataset (introduced here). We have integrated our algorithm in NEIL for enriching its knowledge base (see here).


Paper & Presentation

CVPR paper (pdf, 4.7MB)
Supplementary Material (pdf, 980MB)


Xinlei Chen, Abhinav Shrivastava and Abhinav Gupta. Enriching Visual Knowledge Bases via Object Discovery and Segmentation. In CVPR 2014.

    Author = {Xinlei Chen and Abhinav Shrivastava and Abhinav Gupta},
    Title = {{E}nriching {V}isual {K}nowledge {B}ases via {O}bject {D}iscovery and {S}egmentation},
    Booktitle = {Computer Vision and Pattern Recognition (CVPR)},
    Year = 2014,

Related Papers


Already on GitHub! Note: Due to randomness, the qualitative behavior should be the same, but the output may not be identical to the results in the paper. If fixing the random stream in the released code, the new quantitative results are summarized below. Bolded numbers indicate better performance than originally reported in the paper.

Full Internet Dataset

Category P J
Airplane 92.19 60.87
Car 87.28 62.74
Horse 90.11 60.23

Sampled Internet Dataset (100 Images)

Category P J
Airplane 89.92 54.62
Car 89.37 69.20
Horse 88.05 44.46

Also here are the qualitative results (1.4MB) on the datasets for evaluation.


The negative data (526MB) used for training latent SVM detectors are Here. Most of them are Google Scene Images. See the Code and learn how to use it.


This research was supported by: