Part II: K-means and Video Retrieval

In this problem we will evaluate how $K$-means clustering can be used to retrieve videos from examples.

You are provided a “test” collection of 3000 videos (For the purposes of this assignment, the test set will also be a “training” set for computing codebooks as described later). Fifty of these videos are songs by various artists/groups. As a “query” you are provided one example each of songs by each of these artists. You are required to use these examples to retrieve the rest of the songs by each of these artists from the collection.

Specifically, given any example recording by an artist, you must rank the 3000 videos in terms of their similarity to the provided example. If you do this right, the top-ranked entries will include other songs by the artist.

To do this, you must convert each video into a “feature” vector $V_i$ (where $V_i$ represents the feature vector for the $i^{\rm th}$ video). You will then similarly convert your query video also to a vector $Q$. Subsequently you will rank all of the videos according to the cosine distance $D_i = \frac{V_i^\top Q}{|Q||V_i|}$.

The actual problem is to arrive at the vector representation $V_i$ for the videos (and $Q$ for the query). This problem has two parts, in which we will investigate two ways of doing so.

As a first step, you must extract a set of “SIFT” vectors from every key frame of every video. SIFT (or Scale-Invariant Feature Transform) is a feature-extraction method for images that extracts one vector “descriptor” for each “keypoint” in the image. A keypoint is an automatically-identified key feature in the image. The SIFT descriptor is invariant to scaling, shift or orientation of the keypoint. Thus each image becomes a collection of SIFT descriptors.

The video itself thus becomes a sequence of collections of SIFT vectors, one per keyframe. We will use this chracterization to generate our vector representation of the videos.

II-a: Bag-of-SIFT-Words Representation

Use $K$-means clustering to generate a codebook of size $K$ (to be specified shortly) from the set of SIFT feature vectors (for all the frames of all the training videos). Let $C_1,\cdots,C_K$ represent the codewords. $K$-means effectively partitions the space into $K$ regions. The codewords represent the mean of all the vectors that fall in each region.

For each video do the following:

  1. Initialize your Bag-of-SIFT-Words (BOSW) vector representation $V$ for the video as a $K$-dimensional vector of zeros.
  2. For each key frame $K$ in the video
    1. For each SIFT vector $s_{i,k}$ in the frame (here $k$ indexes the frame and $i$ indexes the SIFT descriptor)
      1. Identify the index $J$ of the codeword $C_J$ that is closest to $s_{i,k}$ , i.e. $J = \arg\min_j dist(C_j, s_{i,k})$ (where $dist(X,Y)$ is the Euclidean distance between $X$ and $Y$).
      2. Increment $V(J) = V(J)+1$, where $V(J)$ is the $J^{\rm th}$ component of the BOSW vector $V$.
  3. The resultant vector $V$ will be a population count of the number of SIFT descriptors that fell into each of the $K$ regions. So the BOSW representation is actually a histogram over SIFT descriptors for the video.

Compute a separate vector for every video in the test set. Also compute the vector for each of the supplied queries. Rank the test videos based on their similarity to the query as mentioned above. For each query file named “QXX.mp4” write out the numeric IDs of the test videos in the order in which they were ranked into a file called “query.II-a.$K$.output”, one ID per line. Return all the output files (10 of them) as your output ($K$ in the filename is a number, as given below).

You can vary the $K$ for different answers. You may also vary the manner in which you compute the score for ranking (change the cosine distance to something else, for instance). You may also change the SIFT descriptor specifications.

You will not be provided an answer key. Instead, we will put up a leader board of the best submissions so far. Submissions will be ranked on the basis of “Mean Average Precision”. MAP is a standard metric for measuring the accuracy of information retrieval systems.

Mail your outputs to Pallavi in a zipfile named “YourName.AnswerVersion.zip”. If you submit multiple files with the same version name, only the first one will be considered.

II-b: Bag-of-frame-Words Representation

In part “a” the BOSW representation accumulated SIFT descriptors from all frames into a single, common representation. For part “b” we will maintain the distinction between frames.

We will begin by computing a codebook $C_1,\cdots,C_J$ of SIFT descriptors as before.

Now compute a separate BOSW vector for every frame of each video as follows.

  1. For each key frame $k$ in the video
    1. Initialize your Bag-of-SIFT-Words (BOSW) vector representation $V_k$ for the frame (the subscript $k$ represents the frame) as a $K$-dimensional vector of zeros.
    2. For each SIFT vector $s_{i,k}$ in the frame (here $k$ indexes the frame and $i$ indexes the SIFT descriptor)
      1. Identify the index $J$ of the codeword $C_J$ that is closest to $s_{i,k}$ , i.e. $J = \arg\min_j dist(C_j, s_{i,k})$ (where $dist(X,Y)$ is the Euclidean distance between $X$ and $Y$).
      2. Increment $V_k(J) = V_k(J)+1$, where $V_k(J)$ is the $J^{\rm th}$ component of the BOSW vector $V$.
  2. The result will be a sequence of BOSW vectors $V_1, V_2, ... $ representing the video.

Each video now become a sequence BOSW vectors. Do this for every video in your test set as well as the query.

Now, using the collection of BOSW vectors from all the videos in the test set, train a second codebook $D_1, D_2, \cdots, D_M$ of $M$ codewords. Note that these are now codewords over BOSW vectors. In effect, this is a deep codebook.

Now for each video, compute a Bag of Frame Words (BOFW) representation as follows

  1. Initialize the BOFW vector $Z$ as an $M$-dimensional vector of zeros.
  2. For each BOSW vector $V_i$
    1. Find the index $J$ of the closest BOSW codeword:: $J = \arg\min_m dist(V_i, D_j)$.
    2. Increment $Z(J) = Z(J) + 1$.
  3. The resultant BOFW vectors now retains the distinction between the frames in the video.

Using the BOFW representations of all the videos in the test set and the query, rank the test videos for each query. Return the ranked list of test videos as before, now named “query.II-b.$K$.output” (note the name II-b).

Data and hints

The data which includes both the test and query set can be downloaded here

You are allowed to use any implementation of SIFT you are comfortable with. Some hints on how to do this with a well known MATLAB based SIFT feature extractor are given here

Due date

The assignment is due on 30th April 2015. The solutions must be emailed to Bhiksha and Pallavi. Please send the solutions as a zipfile called yourandrewid_mlsp_hw3b.zip. This should include: