15-826 - Multimedia databases and data mining

Fall 2011 C. Faloutsos

- Weight: 1/3 of the weight for homeworks, i.e., 3.33% of the course grade. The maximum score is 100 points.
- Policy: Like all homeworks, this has to be done individually - not in groups
- Policy: Please type all your answers.
- Rough estimate for time to completion: 15 hours = 5 hours (Hadoop) + 5 hours (Fourier/Wavelets) + 4 hours (SVD) + 1 hour (ICA)

- Your write-up, including plots and numerical results should be attached, and named
**Databases_Homework3_ANDREWID.pdf** - Your code should be archived and submitted with the name
**Databases_Homework3_ANDREWID.zip.**For every piece of code, there should be a makefile, so that 'make' should run your code and produce the results.

Description:

You are given the following directed graph data. Each line in the file represents an arc in the graph (source, destination) of the 'epinions' dataset we saw in the lectures (who-trusts-whom). The goal is to use the Hadoop API (release 0.20.205.0), to run several computations. Normally, MapReduce code runs on clusters, but for the purposes of this homework, you will run the code in

**Preparation:
**

**Download Hadoop from one of these mirrors and install it**following these instructions - you only need the few commands in the 'Standalone Operation' section. Hadoop should run on the andrew machines without additional installations.**Important:**as part of the setup process, in conf/hadoop-env.sh, the path for JAVA_HOME should be set to /usr

- Download/unzip this code archive.
- Run the example 'hadoop_wordcount' by typing make.

Deliverables:

- [5 points] Turn in the output you get when running the 'hadoop_wordcount' code.
- [3 points] As you probably
saw, the map function in the provided code makes a list of the vocabulary words and their occurence count. Change the map function in the provided code,
such that only words that start with a vowel are considered. Submit your code and the result.

- [2 points] Now change the code to count the total number of words in the text. Submit your code and the result.
- [5 points] For the graph data, write code to determine the number of edges in the graph. Turn in the code and the result you get when running your code on the graph data.
- [5 points] Count all vertices in the graph. Turn in the code and the result you get when running your code on the graph data provided.
- [10 points] Make a list
with the format below to list the degrees of all nodes in the graph.
The two columns should be comma separated. No sorting needed, submit the
code and the result in a csv file.
node_id degree

Description:

You are given a noisy time sequence with 2048 timeticks. The time-sequence consists mainly of noise (random value, at every time-tick), except that we have inserted k signals inside it. The goal is for you to discover those 'buried' signals that we tried to hide. You can use any/all of the methods we saw in class, including, but not limited to:

- DFT (Amplitude spectrum),

- (SWFT) short-window Fourier transform, with a window-size of your choice, or even several window-sizes, if you deem necessary.
- DWT (wavelets), Haar or any other wavelet basis.

- Sinusoid, eg., f1(t) = sin(2 pi t / 100 + 20), t = [10, 100]
- Chirp, eg., f2(t) = 3* sin( 2 pi t
^{2}/ 70), t = [120, 150] - Ramp, eg., f3(t) = 20* (t -50) t = [250, 400]
- Triangle, eg., 0 1 2 3 4 5 4 3 2 1 0

- Sawtooth, eg., 0 1 2 3 0 1 2 3 0 1 2 3

- Staircase, eg., 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4

- etc

- You may want to generate white noise of similar average and variance, and visually (or mathematically) compare the Fourier/wavelet/etc transform of your pure noise against the given sequence
- Once you find a way to spot the typical values of the
coefficients under white noise, set them to zero, and do the inverse
fourier/wavelet/etc transform, to recover the 'buried' signals.

- the number k of buried signals is small (<=5)

- we recomment using matlab with the Wavelet toolkit. Matlab is available on the Andrew machines.

- [1 point] Plot the time sequence. Do you see any pattern?

- [8 points] How many buried signals can you recover? I.e., submit the value of k ?

- [21 points] For each of the k 'buried' signal you discovered,

- Describe briefly how you discovered it: eg., give the plot (eg., DFT, DWT) and your verbal justification and
- give as much information
about it as you can: shape (eg., sinusoid, chirp, etc), amplitude,
phase, start time, end time. Notice that for some signals, you won't be
able to give the full equation, because the noise we added is too
over-powering.

Description: Download the cloud of 1000 points in m =10 dimensions. It could be, say, 2000 patients, with 10 numerical attributes each (body height, body weight, blood-pressure, etc). The goal is (a) to visualize it somehow (b) to discover the linear correlations among the m dimensions.

Hints: We generated each tuple (x1, ... x10) using equations similar (but not identical) to the ones below:

x1 = g1(u,v) = arctan
(u)
// (u= random(0,1), v= random(0,1) )

x2 = g2(u,v) = sinh( u + v )

x3 = g3(u,v) = cosh (v)

x4 = 3 x1 + 2 x2 + 5 x3 // remaining attributes: linear combinations of the initial 3

x5 = x1 - x2 - x3

....

xi = a_{i1} x1 + a_{i2} x2 + a_{i3} x3 // i=4, ... 10

x2 = g2(u,v) = sinh( u + v )

x3 = g3(u,v) = cosh (v)

x4 = 3 x1 + 2 x2 + 5 x3 // remaining attributes: linear combinations of the initial 3

x5 = x1 - x2 - x3

....

xi = a

Thus, the above equations would generate a cloud with f=3 linear degrees of freedom; if we somehow manage to discover those 3 axes, we can do the 3-choose-2 pair-wise scatter-plots ('pair-plots'), and get a good idea of what the cloud looks like.

Deliverables:

- [15 points] How many linear degrees of freedom f are there in the dataset? Turn in the plots that helped you reach your conclusion and explain your reasoning, briefly.

- [10 points] Give the pair-wise scatterplots for the top f principal components. (you may use the command 'pairs()' in R, or similar, in matlab).

- [5 points] Describe the base functions g1(), g2(), ... g
_{f}() as much as you can (eg., do they show a trend? a periodicity? an exponential growth? a zipf-like distribution?)

Description: Consider this dataset, of m=5 signals (columns), each of T=1000 timeticks (rows). It tries to simulate a 'coctail party' setting, where we have k=3 speakers, and m=5 listeners, each receiving a linear combination of the original 3 signals. Specifically, if x

y

Use the Fast ICA toolkit to recover the original 3 signals of the speakers. You may use it in any of the languages provided.

Deliverables:

- Submit the time-plots of the 3 signals ('speakers') you recovered.

Last edited by Ina Fiterau and Christos Faloutsos, Nov. 8, 2011.