|
General
Purpose
Submissions
Program
Past Years
|
Do images have visual keywords?
Jun Yang
This talk describes my research on visual keywords as a sparse, high-dimensional
representation of images which are analogous to terms in text documents. The
visual keywords of an image are extracted by detecting and clustering the local
interest points or "keypoints" in the image. We study the distribution of
visual keywords in a large video corpus, and find that it bears large similarity
yet important differences from the term distribution in a text
corpus. Representing images as "bag of visual keywords" encourages the use of
mature techniques in IR to solve problems in image and video retrieval. In this
talk, we show empirical results on using text categorization methods for
classifying video frames, with an emphasis on the influence of vocabulary size,
term weighting (e.g., binary, TF, TF-IDF), stop word removal, and feature
selection. This approach achieves comparable performance to that of the global
image features and significantly higher performance when used in combination,
showing the great promise of IR methods in solving image and video retrieval
problems.
Interesting findings in this study include:
1) The distribution of visual keywords basically follows Zipf's law, but is less
uneven than the term distribution in a text corpus.
2) Frequently-occurring visual keywords are not "stop words"; instead, they
are more informative than rare keywords when it comes to image classification.
3) Classification performance increases with the size of the visual keyword
vocabulary, and levels off after the size reaches the magnitude of 100,000. The
data are linearly separable when the vocabulary is sufficiently large.
4) While TF and TF-IDF weighting helps the classification performance at low
feature dimensions, binary feature is more effective at higher
dimensions. Normalization by the number of visual keywords in each image hurts
the performance.
5) Feature selection based on chi-square statistics and mutual information can
reduce the vocabulary size by 60% with no loss of classification performance.
|