In this problem we will consider shift-invariant mixtures of multi-variate multinomial distributions.
Consider data that have multiple discrete attributes. "Discrete" attributes are attributes that can take only one of a countable set of values. We will consider discrete attributes of a particular kind -- integers that have not only a natural rank ordering, but also a definite notion of distance.
Let $(X,Y)$ be the pair of discrete attributes defining any data instance. Since both $X$ and $Y$ are discrete, the probability distribution of $(X,Y)$ is a bi-variate multinomial.
We describe $(X,Y)$ as the outcome of generation by the following process:
The process has at its disposal several urns. Each urn has three sub-urns inside it. The first sub-urn represents a bi-variate multinomial: it contains balls, such that each ball has an $(X,Y)$ value marked on it. The second sub-urn represents a uni-variate multinomial -- it contains balls, such that each ball has a $Y$ value marked on it. The third sub-urn too represents a uni-variate multinomial -- it contains balls, such that each ball has a $X$ value marked on it.
Drawing procedure: At each draw the drawing process performs the following operations.
 
 
The final observation is:
Give the expression for $P(X,Y)$ in terms of $P(Z)$, $P(X,Y|Z)$, $P(X|Z)$, and $P(Y|Z)$.
You are given a histogram of counts $H(X,Y)$ obtained from a large number of observations. $H(X,Y)$ represents the number of times $(X,Y)$ was observed. Give the EM update rules to estimate $P(Z)$, $P(X,Y|Z)$, $P(X|Z)$, and $P(Y|Z)$.
represents a histogram (the value of any pixel at a position (X,Y), which ranges from 0-255, is viewed as the count of ``light elements'' at that position). We model this distribution as a shift-invariant mixture of 4 components (large urns). Specifically, we also assume that within each (X,Y) sub-urn X can take integer values 0-90, and Y can take values in 0-90. The X values in the X sub-urns can range from 0-(width-of-picture - 90), and Y values in the Y suburn can take values in the range 0-(heigth-of-picture-90).
Estimate and plot P(X,Y|Z). You will need the solution to problem 2 for this problem. If the solution to problem 2 is incorrect, the solution of problem 3 will not be considered or given any points.
In this problem we will evaluate how $k$-means clustering can be used to retrieve videos from examples.
You are provided a “test” collection of 3000 videos (For the purposes of this assignment, the test set will also be a “training” set for computing codebooks as described later). A small number of these are songs by various artists/groups. As a “query” you are provided one example each of songs by each of these artists. You are required to use these examples to retrieve the rest of the songs by each of these artists from the collection.
Specifically, given any example recording by an artist, you must rank the 3000 videos in terms of their similarity to the provided example. If you do this right, the top-ranked entries will include other songs by the artist.
To do this, you must convert each video into a “feature” vector $V_i$ (where $V_i$ represents the feature vector for the $i^{\rm th}$ video). You will then similarly convert your query video also to a vector $Q$. Subsequently you will rank all of the videos according to the cosine distance $D_i = \frac{V_i^\top Q}{|Q||V_i|}$.
The actual problem is to arrive at the vector representation $V_i$ for the videos (and $Q$ for the query). This problem has two parts, in which we will investigate two ways of doing so.
As a first step, you must extract a set of “SIFT” vectors from every key frame of every video. SIFT (or Scale-Invariant Feature Transform) is a feature-extraction method for images that extracts one vector “descriptor” for each “keypoint” in the image. A keypoint is an automatically-identified key feature in the image. The SIFT descriptor is invariant to scaling, shift or orientation of the keypoint. Thus each image becomes a collection of SIFT descriptors.
The video itself thus becomes a sequence of collections of SIFT vectors, one per keyframe. We will use this chracterization to generate our vector representation of the videos.
Use $K$-means clustering to generate a codebook of size $K$ (to be specified shortly) from the set of SIFT feature vectors (for all the frames of all the training videos). Let $C_1,\cdots,\C_K$ represent the codewords. $K$-means effectively partitions the space into $K$ regions. The codewords represent the mean of all the vectors that fall in each region.
For each video do the following:
Compute a separate vector for every video in the test set. Also compute the vector for each of the supplied queries. Rank the test videos based on their similarity to the query as mentioned above. For each query file named “query.mp4” write out the numeric IDs of the test videos in the order in which they were ranked into a file called “query.II-a.$K$.output”, one ID per line. Return all the output files (10 of them) as your output ($K$ in the filename is a number, as given below).
You can vary the $K$ for different answers. You may also vary the manner in which you compute the score for ranking (change the cosine distance to something else, for instance). You may also change the SIFT descriptor specifications.
You will not be provided an answer key. Instead, we will put up a leader board of the best submissions so far. Submissions will be ranked on the basis of “Mean Average Precision”, which we will describe on the answers page. MAP is a standard metric for measuring the accuracy of information retrieval systems.
Mail your outputs to Pallavi in a zipfile named “YourName.AnswerVersion.zip”. If you submit multiple files with the same version name, only the first one will be considered.
In part “a” the BOSW representation accumulated SIFT descriptors from all frames into a single, common representation. For part “b” we will maintain the distinction between frames.
We will begin by computing a codebook $C_1,\cdots,C_J$ of SIFT descriptors as before.
Now compute a separate BOSW vector for every frame of each video as follows.
Each video now become a sequence BOSW vectors. Do this for every video in your test set as well as the query.
Now, using the collection of BOSW vectors from all the videos in the test set, train a second codebook $D_1, D_2, \cdots, D_M$ of $M$ codewords. Note that these are now codewords over BOSW vectors. In effect, this is a deep codebook.
Now for each video, compute a Bag of Frame Words (BOFW) representation as follows
Using the BOFW representations of all the videos in the test set and the query, rank the test videos for each query. Return the ranked list of test videos as before, now named “query.II-b.$K$.output” (note the name II-b).