Language Technologies Institute
Student Research Symposium 2006

Do images have visual keywords?

Jun Yang

This talk describes my research on visual keywords as a sparse, high-dimensional representation of images which are analogous to terms in text documents. The visual keywords of an image are extracted by detecting and clustering the local interest points or "keypoints" in the image. We study the distribution of visual keywords in a large video corpus, and find that it bears large similarity yet important differences from the term distribution in a text corpus. Representing images as "bag of visual keywords" encourages the use of mature techniques in IR to solve problems in image and video retrieval. In this talk, we show empirical results on using text categorization methods for classifying video frames, with an emphasis on the influence of vocabulary size, term weighting (e.g., binary, TF, TF-IDF), stop word removal, and feature selection. This approach achieves comparable performance to that of the global image features and significantly higher performance when used in combination, showing the great promise of IR methods in solving image and video retrieval problems.

Interesting findings in this study include:

1) The distribution of visual keywords basically follows Zipf's law, but is less uneven than the term distribution in a text corpus.

2) Frequently-occurring visual keywords are not "stop words"; instead, they are more informative than rare keywords when it comes to image classification.

3) Classification performance increases with the size of the visual keyword vocabulary, and levels off after the size reaches the magnitude of 100,000. The data are linearly separable when the vocabulary is sufficiently large.

4) While TF and TF-IDF weighting helps the classification performance at low feature dimensions, binary feature is more effective at higher dimensions. Normalization by the number of visual keywords in each image hurts the performance.

5) Feature selection based on chi-square statistics and mutual information can reduce the vocabulary size by 60% with no loss of classification performance.