In this problem we will evaluate how $K$-means clustering can be used to retrieve videos from examples.
You are provided a “test” collection of 3000 videos (For the purposes of this assignment, the test set will also be a “training” set for computing codebooks as described later). Fifty of these videos are songs by various artists/groups. As a “query” you are provided one example each of songs by each of these artists. You are required to use these examples to retrieve the rest of the songs by each of these artists from the collection.
Specifically, given any example recording by an artist, you must rank the 3000 videos in terms of their similarity to the provided example. If you do this right, the top-ranked entries will include other songs by the artist.
To do this, you must convert each video into a “feature” vector $V_i$ (where $V_i$ represents the feature vector for the $i^{\rm th}$ video). You will then similarly convert your query video also to a vector $Q$. Subsequently you will rank all of the videos according to the cosine distance $D_i = \frac{V_i^\top Q}{|Q||V_i|}$.
The actual problem is to arrive at the vector representation $V_i$ for the videos (and $Q$ for the query). This problem has two parts, in which we will investigate two ways of doing so.
As a first step, you must extract a set of “SIFT” vectors from every key frame of every video. SIFT (or Scale-Invariant Feature Transform) is a feature-extraction method for images that extracts one vector “descriptor” for each “keypoint” in the image. A keypoint is an automatically-identified key feature in the image. The SIFT descriptor is invariant to scaling, shift or orientation of the keypoint. Thus each image becomes a collection of SIFT descriptors.
The video itself thus becomes a sequence of collections of SIFT vectors, one per keyframe. We will use this chracterization to generate our vector representation of the videos.
Use $K$-means clustering to generate a codebook of size $K$ (to be specified shortly) from the set of SIFT feature vectors (for all the frames of all the training videos). Let $C_1,\cdots,C_K$ represent the codewords. $K$-means effectively partitions the space into $K$ regions. The codewords represent the mean of all the vectors that fall in each region.
For each video do the following:
Compute a separate vector for every video in the test set. Also compute the vector for each of the supplied queries. Rank the test videos based on their similarity to the query as mentioned above. For each query file named “QXX.mp4” write out the numeric IDs of the test videos in the order in which they were ranked into a file called “query.II-a.$K$.output”, one ID per line. Return all the output files (10 of them) as your output ($K$ in the filename is a number, as given below).
You can vary the $K$ for different answers. You may also vary the manner in which you compute the score for ranking (change the cosine distance to something else, for instance). You may also change the SIFT descriptor specifications.
You will not be provided an answer key. Instead, we will put up a leader board of the best submissions so far. Submissions will be ranked on the basis of “Mean Average Precision”. MAP is a standard metric for measuring the accuracy of information retrieval systems.
Mail your outputs to Pallavi in a zipfile named “YourName.AnswerVersion.zip”. If you submit multiple files with the same version name, only the first one will be considered.
In part “a” the BOSW representation accumulated SIFT descriptors from all frames into a single, common representation. For part “b” we will maintain the distinction between frames.
We will begin by computing a codebook $C_1,\cdots,C_J$ of SIFT descriptors as before.
Now compute a separate BOSW vector for every frame of each video as follows.
Each video now become a sequence BOSW vectors. Do this for every video in your test set as well as the query.
Now, using the collection of BOSW vectors from all the videos in the test set, train a second codebook $D_1, D_2, \cdots, D_M$ of $M$ codewords. Note that these are now codewords over BOSW vectors. In effect, this is a deep codebook.
Now for each video, compute a Bag of Frame Words (BOFW) representation as follows
Using the BOFW representations of all the videos in the test set and the query, rank the test videos for each query. Return the ranked list of test videos as before, now named “query.II-b.$K$.output” (note the name II-b).
The data which includes both the test and query set can be downloaded here
You are allowed to use any implementation of SIFT you are comfortable with. Some hints on how to do this with a well known MATLAB based SIFT feature extractor are given here
The assignment is due on 30th April 2015. The solutions must be emailed to Bhiksha and Pallavi. Please send the solutions as a zipfile called yourandrewid_mlsp_hw3b.zip. This should include: