Building the video representations typically involves four steps: feature extraction, quantization, encoding, and pooling. While there have been large advances in feature extraction and encoding, the questions of how to quantize video features and what kinds of regions to pool them over have been relatively unexplored. To tackle the challenges present in video data, it is necessary to develop robust quantization and pooling methods.
The first contribution of this thesis, Source Constrained Clustering, quantizes features into a codebook that generalizes better across actions. The main insight is to incorporate readily available labels of the sources generating the data. Sources can be the people who performed each cooking recipe, the directors who made each movie, or the YouTube users who shared their videos. In the pooling step, it is common to pool feature vectors over local regions. The regions of choice include the entire video, coarse spatio-temporal pyramids, or cuboids of pre-determined fixed size. A consequence of using indiscriminately chosen cuboids is that widely dissimilar features may be pooled together if they are in nearby locations. It is natural to consider pooling video features over supervoxels, for example, obtained from a video segmentation. However, since videos can have a different number of supervoxels, this produces a video representation of variable size. The second contribution of this thesis is a new, fixed size video representation, Motion Words, where we pool features over video segments.
The ultimate goal of video segmentations is to recover object boundaries, often grouping pixels from regions of very different motion. However, in the context of Motion Words, it is important that regions preserve motion boundaries. The third contribution of this thesis is a supervoxel segmentation, Globally Consistent Supervoxels, which respects motion boundaries and provides better spatio-temporal support for Motion Words.
Finally, it is essential that the representations we build are interpretable, that is, we want to be able to visualize and understand why videos are similar. Motion Words enable localization of common regions across videos in both space and time. This gives us the power to understand which regions make videos similar.
Martial Hebert (Co-Chair)
Fernando De la Torre (Co-Chair)
Silvio Savarese (Stanford University)