CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Fall 2011 C. Faloutsos

Homework 3
Out: Nov. 10 2011
Due: Nov. 22 2011, before 12 noon (class time), via e-mail to the TA

Reminders:

Please e-mail your solutions in a single e-message, to Ina (mfiterau at cs dot cmu dot edu) with the subject: [Databases - Homework 3 - ANDREWID].

Q1.  Map-Reduce on directed graphs (Java Hadoop) [30 points]

Purpose: The goal is to become familiar with the map/reduce method for mining large datasets. Hadoop is the open source version of map/reduce.

Description:
You are given the following directed graph data. Each line in the file represents an arc in the graph (source, destination) of the 'epinions' dataset we saw in the lectures (who-trusts-whom). The goal is to use the Hadoop API (release 0.20.205.0), to run several computations. Normally, MapReduce code runs on clusters, but for the purposes of this homework, you will run the code in standalone mode on a single machine.

Preparation:

See this example to get started. For many more details (which you won't need for this homework), see this extensive tutorial.

Deliverables:
  1. [5 points] Turn in the output you get when running the 'hadoop_wordcount' code.
  2. [3 points] As you probably saw, the map function in the provided code makes a list of the vocabulary words and their occurence count. Change the map function in the provided code, such that only words that start with a vowel are considered. Submit your code and the result.
  3. [2 points] Now change the code to count the total number of words in the text. Submit your code and the result.
  4. [5 points] For the graph data, write code to determine the number of edges in the graph. Turn in the code and the result you get when running your code on the graph data.
  5. [5 points] Count all vertices in the graph. Turn in the code and the result you get when running your code on the graph data provided.
  6. [10 points] Make a list with the format below to list the degrees of all nodes in the graph. The two columns should be comma separated. No sorting needed, submit the code and the result in a csv file.
    node_id degree

Q2.  Fourier / Wavelets on noisy data [30 points]

Purpose: In practice, we are often given one or more time sequences, and we have to find regularties. Use the DSP methods of the lectures to find as to find regularties. Use the DSP methods of the lectures to find as many patterns as you can, in a seemingly random time sequence.

Description:
You are given a noisy time sequence with 2048 timeticks. The time-sequence consists mainly of noise (random value, at every time-tick), except that we have inserted k signals inside it.  The goal is for you to discover those 'buried' signals that we tried to hide. You can use any/all of the methods we saw in class, including, but not limited to:
Here are some examples of functions f(t) you may (or may not) find inside the input sequence - the list is by no means exhaustive:
Hints:
Deliverables:
  1. [1 point] Plot the time sequence. Do you see any pattern?
  2. [8 points] How many buried signals can you recover? I.e., submit  the value of k ?
  3. [21 points] For each of the k 'buried' signal you discovered,
    1. Describe briefly how you discovered it: eg., give the plot (eg., DFT, DWT) and your verbal justification and
    2. give as much information about it as you can: shape (eg., sinusoid, chirp, etc), amplitude, phase, start time, end time. Notice that for some signals, you won't be able to give the full equation, because the noise we added is too over-powering.

Q3.  SVD [30 points]

Purpose: We are often given an m-dimensional cloud of points, and we need to find patterns in it. SVD  spots correlations, and it can help us visualize the cloud.

Description: Download the cloud of 1000 points in m =10 dimensions. It could be, say, 2000 patients, with 10 numerical attributes each (body height, body weight, blood-pressure, etc). The goal is (a) to visualize it somehow (b) to discover the linear correlations among the m dimensions.

Hints: We generated each tuple (x1, ... x10)  using equations similar (but not identical) to the ones below:

x1 = g1(u,v) = arctan (u)                       // (u= random(0,1), v= random(0,1) )
x2 = g2(u,v) = sinh( u + v )
x3 = g3(u,v) = cosh (v)
x4 = 3 x1 + 2 x2 + 5 x3                       // remaining attributes: linear combinations of the initial 3
x5 =    x1 -     x2 -     x3
....
xi = ai1 x1 + ai2 x2 + ai3 x3                // i=4, ... 10

Thus, the above equations would generate a cloud with f=3 linear degrees of freedom; if we somehow manage to discover those 3 axes, we can do the 3-choose-2 pair-wise scatter-plots ('pair-plots'), and get a good idea of what the cloud looks like.

Deliverables:
  1. [15 points] How many linear degrees of freedom f are there in the dataset? Turn in the plots that helped you reach your conclusion and explain your reasoning, briefly.
  2. [10 points] Give the pair-wise scatterplots for the top f principal components. (you may use the command 'pairs()' in R, or similar, in matlab).
  3. [5 points] Describe the base functions g1(), g2(), ... gf() as much as you can (eg., do they show a trend? a periodicity? an exponential growth? a zipf-like distribution?)

Q4.  ICA on Mixed Signal [10 points]

Purpose: SVD can find the latent topics (signals), but ICA (independent component analysis) usually does an even better job of separating them. The goal of this exercise is to make us familiar with this powerful technique. ICA is also called 'Blind source separation' (BSS), because it can often solve the 'coctail party problem': several people are speaking simultaneously, but we are able to isolate and lock-on to the discussion of interest.

Description: Consider  this dataset, of m=5 signals (columns), each of T=1000 timeticks (rows). It tries to simulate a 'coctail party' setting, where we have k=3 speakers, and m=5 listeners, each receiving a linear combination of the original 3 signals. Specifically, if xi,t is the value of the  i-th speaker at time t, listener j receives value yj,t which is given by a function like
    yj,t = C1 * x1,t + C2* x2,t + C3 x3,t
Use the Fast ICA toolkit to recover the original 3 signals of the speakers. You may use it in any of the languages provided.

Deliverables:

Last edited by Ina Fiterau and Christos Faloutsos, Nov. 8, 2011.