Bag of Words is a popular and successful framework for the task of activity classification in videos. In BoW we extract features, cluster them to learn a codebook of words, and then quantize each video by pooling the features. We address limitations of two fundamental aspects of this framework. First, we add structure to the clustering step to enable generalization across different execution styles. Second, we provide a method for pooling features in a structured way. In prior work, this pooling is done over pre-determined rigid cuboids. It is natural to consider pooling features over a video segmentation, but this produces a video representation of variable size. We propose a fixed size representation, Motion Words, where we pool features over supervoxels. To segment the video into supervoxels we propose a superpixel-based method, Globally Consistent Supervoxels, designed to preserve motion boundaries over the entire video. Evaluation on classification and retrieval tasks on two datasets shows that Motion Words achieves state-of-the-art performance. In addition to providing a more flexible support for capturing actions, the proposed method enables interpretation of the results, i.e. it provides understanding of why two videos are similar.
Joint work with Fernando De la Torre and Martial Hebert.
Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.