CARNEGIE MELLON UNIVERSITY
Out: Nov. 10 2011
Due: Nov. 22
2011, before 12 noon (class time), via e-mail to the TA
15-826 - Multimedia databases and data mining
Fall 2011 C. Faloutsos
Please e-mail your solutions in a single e-message, to Ina (mfiterau at cs dot cmu dot edu) with the subject: [Databases - Homework 3 - ANDREWID].
- Weight: 1/3 of the weight for homeworks, i.e., 3.33% of the course grade. The maximum score is 100 points.
- Policy: Like all homeworks, this has to be done individually - not in groups
- Policy: Please type all your answers.
- Rough estimate for time to completion: 15 hours = 5 hours (Hadoop) + 5 hours (Fourier/Wavelets) + 4 hours (SVD) + 1 hour (ICA)
- Your write-up, including plots and numerical results should be attached, and named Databases_Homework3_ANDREWID.pdf
- Your code should be archived and submitted with the name Databases_Homework3_ANDREWID.zip. For every piece of code, there should be a makefile, so that 'make' should run your code and produce the results.
Q1. Map-Reduce on directed graphs (Java Hadoop) [30 points]Purpose:
The goal is to become familiar with the map/reduce method for mining
large datasets. Hadoop is the open source version of map/reduce.
You are given the following directed graph data.
Each line in the file represents an arc in the graph (source,
destination) of the 'epinions' dataset we saw in the lectures
(who-trusts-whom). The goal is to use the Hadoop API (release
to run several computations. Normally, MapReduce code runs on
clusters, but for the purposes of this homework, you will run the code
in standalone mode on a single machine.
See this example to get started. For many more details (which you won't need for this homework), see this extensive tutorial.
- Download Hadoop from one of these mirrors and install it following these instructions - you only need the few commands in the 'Standalone Operation' section. Hadoop should run on the andrew machines without additional installations.
- Important: as part of the setup process, in conf/hadoop-env.sh, the path for JAVA_HOME should be set to /usr
- Download/unzip this code archive.
- Run the example 'hadoop_wordcount' by typing make.
- [5 points] Turn in the output you get when running the 'hadoop_wordcount' code.
- [3 points] As you probably
saw, the map function in the provided code makes a list of the vocabulary words and their occurence count. Change the map function in the provided code,
such that only words that start with a vowel are considered. Submit your code and the result.
- [2 points] Now change the code to count the total number of words in the text. Submit your code and the result.
- [5 points] For the graph
data, write code to determine the number of edges in the graph. Turn in
the code and the result you get when running your code on the graph
- [5 points] Count all vertices in the graph. Turn in the code and the result you get when running your code on the graph data provided.
- [10 points] Make a list
with the format below to list the degrees of all nodes in the graph.
The two columns should be comma separated. No sorting needed, submit the
code and the result in a csv file.
Q2. Fourier / Wavelets on noisy data [30 points] Purpose:
In practice, we are often given one or more time sequences, and we have
to find regularties. Use the DSP methods of the lectures to find as
to find regularties. Use the DSP methods of the lectures to find as
many patterns as you can, in a seemingly random time sequence.
You are given a noisy time sequence
with 2048 timeticks. The time-sequence consists mainly of noise (random
value, at every time-tick), except that we have inserted k
signals inside it. The goal is for you to discover those 'buried'
signals that we tried to hide. You can use any/all of the methods we
saw in class, including, but not limited to:
Here are some examples of functions f(t) you may (or may not) find
inside the input sequence - the list is by no means exhaustive:
- DFT (Amplitude spectrum),
- (SWFT) short-window Fourier transform, with a window-size of your choice, or even several window-sizes, if you deem necessary.
- DWT (wavelets), Haar or any other wavelet basis.
- Sinusoid, eg., f1(t) = sin(2 pi t / 100 + 20), t = [10, 100]
- Chirp, eg., f2(t) = 3* sin( 2 pi t2/ 70), t = [120, 150]
- Ramp, eg., f3(t) = 20* (t -50)
t = [250, 400]
- Triangle, eg., 0 1 2 3 4 5 4 3 2 1 0
- Sawtooth, eg., 0 1 2 3 0 1 2 3 0 1 2 3
- Staircase, eg., 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4
- You may want to generate white noise of similar average and
variance, and visually (or mathematically) compare the
Fourier/wavelet/etc transform of your pure noise against the given
- Once you find a way to spot the typical values of the
coefficients under white noise, set them to zero, and do the inverse
fourier/wavelet/etc transform, to recover the 'buried' signals.
- the number k of buried signals is small (<=5)
- we recomment using matlab with the Wavelet toolkit. Matlab is available on the Andrew machines.
- [1 point] Plot the time sequence. Do you see any pattern?
- [8 points] How many buried signals can you recover? I.e., submit the value of k ?
- [21 points] For each of the k 'buried' signal you discovered,
- Describe briefly how you discovered it: eg., give the plot (eg., DFT, DWT) and your verbal justification and
- give as much information
about it as you can: shape (eg., sinusoid, chirp, etc), amplitude,
phase, start time, end time. Notice that for some signals, you won't be
able to give the full equation, because the noise we added is too
Q3. SVD [30 points]
Purpose: We are often given an m-dimensional
cloud of points, and we need to find patterns in it. SVD spots
correlations, and it can help us visualize the cloud.
Description: Download the cloud of 1000 points in m
=10 dimensions. It could be, say, 2000 patients, with 10 numerical
attributes each (body height, body weight, blood-pressure, etc). The
goal is (a) to visualize it somehow (b) to discover the linear
correlations among the m dimensions.
Hints: We generated each tuple (x1, ... x10) using equations similar (but not identical) to the ones below:
x1 = g1(u,v) = arctan
// (u= random(0,1), v= random(0,1) )
x2 = g2(u,v) = sinh( u + v )
x3 = g3(u,v) = cosh (v)
= 3 x1 + 2 x2 + 5 x3
// remaining attributes: linear combinations of the initial 3
x5 = x1 - x2 - x3
xi = ai1 x1 + ai2 x2 + ai3 x3 // i=4, ... 10
Thus, the above equations would generate a cloud with f=3
linear degrees of freedom; if we somehow manage to discover those 3
axes, we can do the 3-choose-2 pair-wise scatter-plots ('pair-plots'),
and get a good idea of what the cloud looks like.
- [15 points] How many linear degrees of freedom f are there in the dataset? Turn in the plots that helped you reach your conclusion and explain your reasoning, briefly.
- [10 points] Give the pair-wise scatterplots for the top f principal components. (you may use the command 'pairs()' in R, or similar, in matlab).
- [5 points] Describe the base functions g1(), g2(), ... gf() as much as you can (eg., do they show a trend? a periodicity? an exponential growth? a zipf-like distribution?)
Q4. ICA on Mixed Signal [10 points]Purpose:
SVD can find the latent topics (signals), but ICA (independent
component analysis) usually does an even better job of separating them.
The goal of this exercise is to make us familiar with this powerful
technique. ICA is also called 'Blind source separation' (BSS), because
it can often solve the 'coctail party problem': several people are
speaking simultaneously, but we are able to isolate and lock-on to the
discussion of interest.
Description: Consider this dataset, of m=5 signals (columns), each of T=1000 timeticks (rows). It tries to simulate a 'coctail party' setting, where we have k=3 speakers, and m=5
listeners, each receiving a linear combination of the original 3
signals. Specifically, if xi,t is the value of the i-th speaker
at time t, listener j receives value yj,t which is given by a function like
yj,t = C1 * x1,t + C2* x2,t + C3 x3,t
Use the Fast ICA toolkit to recover the original 3 signals of the speakers. You may use it in any of the languages provided.
- Submit the time-plots of the 3 signals ('speakers') you recovered.
Last edited by Ina Fiterau and Christos Faloutsos, Nov. 8, 2011.