Carnegie Mellon University
15826 – Multimedia Databases and Data Mining
Spring 2010 – C. Faloutsos
Homework 3
Due date: April 20, 3:00pm
 Hand in a hard copy of your homework in class. It must be typed;
handwritten material may not be graded
 Put a soft copy and code in YourAndrewID_826hw3.tar and email it to TA (dchau at cs) in
a single email, with the title YourAndrewID_826hw3
 Homework must be done individually.
 For all the code that you write, please make sure they can
be compiled and can run on a Linux machine of the Andrew cluster.
To connect to one such machine, use your favorite SSH client (e.g., putty)
to connect to linux.andrew.cmu.edu, and
log in using your Andrew ID and password.
 You may also develop your code using other software and/or operating
systems (e.g., cygwin on Windows), as long as your code works on an
Andrew cluster machine.
 Weight: 1/3 of the homeworks grade, that is 3.33% of the total course grade
 Rough time estimates: 12 hours total, specifically:
 Q1: 1.5 hour
 Q2: 4 hours
 Q3: 1 hours
 Q4: 1 hour
 Q5: 1 hour
 Q6: 2.5 hours
 Q7: 1 hours
 Total possible points: 100
Q1. Recursive Least Squares [10 points]
Fit a 2D surface to this set of points (10,000 triplets of <x,y,value>, in MSDOS newline convention). The solution has the form
g(x,y) = a_{1}x^{2} + a_{2}y^{2} + a_{3}xy + a_{4}x + a_{5}y + a_{6}
which minimizes the least squares error.
 [3 points] Set up the problem as a least squares problem, in matrixvector form Ax=b (as on slide 24 of the lecture slides). Write down how you generate the data matrix A, the vector of unknowns x and the vector of knowns b.
 [5 points] Solve the problem, and write down the values for a_{1}, a_{2}, a_{3}, a_{4}, a_{5} and a_{6}, using either
 (a) your own program that implements least squares or recursive least squares as described in equations 314 from the Chen+Roussopoulos (SIGMOD94) paper, or
 (b) this recursive least squares (python) package and run it on your own computer; you will need to use the older Python 2.3 and the Numeric package (which does not work with the latest Python versions 2.6.5, and 3.1.2).
 [2 points] Solve the problem using MATLAB's regress command, or similar R commands (lm, or lsfit) and compare the results to that from Q1.2. Submit your code and your comparison.
Q2. Hadoop [30 points]
We will learn to use Hadoop to analyze the text of the MobyDick novel. Hadoop is a popular choice for running largescale distributed computations. Note that while the main
benefit of Hadoop is the ability to run on multiple machines, it can also run on a single machine in Standalone mode,
which we will learn to use.
Prerequisites
 Download and set up Hadoop following the official instructions. Make sure you can run the example listed under the Standalone Operation section. We have verified that Hadoop runs on the Andrew cluster machine; no additional software (except Hadoop itself) needs to be installed.
 Hint: as part of the setup process, in conf/hadoopenv.sh, the path for JAVA_HOME should be set to /usr
 Download and set up PIG following the official instructions.
 Hint: if your account on the Andrew cluster machines uses the csh shell, type the command setenv JAVA_HOME "/usr" to set JAVA_HOME. If your account uses the bash shell, type export JAVA_HOME=/usr instead. You can type chsh to switch between shells.
You will analyze the vocabulary of MobyDick, obtained from the fascinating Gutenberg project, which has already been transformed into one word
per line, in this file, in the
MSDOS newline convention.
 You are going to write two pieces of code (Hadoop and PIG) that compute the frequency of every word. The output should be pairs of <word, frequencyofoccurrence>.
 [8 points] Submit the map() and reduce() function for Hadoop, that solve the problem. Hint: reuse existing code as much as possible, and mark your changes, if any.
 [8 points] Submit a PIG script that solves the same problem. Again, reuse existing code as much as possible, marking your changes.
 [4 points] Submit the top five most frequent words and their frequencies in descending frequency order, in the form of <word, frequency>.
 [5 points] Using the output from above, submit your Hadoop code (the map() and reduce() function), that computes the frequencycount plot (pairs of the form <frequencyofoccurrence,count> ). Hint: reuse the existing code as much as possible, and mark your changes, if any.
 [5 points] Submit the Zipf plot (rankfrequency plot) in loglog scales, and report the slope of the leastsquares fitting line, to check whether it follows the Zipf distribution. You may use MATLAB or any other fitting package, to fit a line.
Q3. Singular Value Decomposition [10 points]
We would like to visualize this collection of 9dimensional points. One typical way to do that is to project the points to a lowerdimentional space, using Singular Value Decomposition. We will learn how.

[5 points] Perform PCA on the data (i.e., center the data points and apply SVD on them). You can use any eigen libraries or statistical packages. Submit your code.
 [5 points] Plot the points' projections on the first two principal components. Submit the code and the scatterplot (in jpg, eps, or any other popular format).
Q4. Tensors [10 points]
Consider this tensor that describes the sales for a data cube X of three modes: customer, product, branch; each line in the tensor file is a tabseparated quadruplet <customer, product, branch, dollarspent>. Apply PARAFAC on the tensor, using the tensortoolbox for MATLAB.
 [5 points] Write down the first 2 components (each component has three vectors).
 [5 points] What does the PARAFAC analysis tell you about those fictitious customers, products and branches? Specifically, state which customers belong to what groups. Repeat for products, and branches.
Hint: You may use this Perl script to convert names in the data to unique numeric IDs (type make to see an example, and ./anonymize.pl h for usage).
Q5. Discrete Fourier Transform [10 points]
We are given this time sequence A of arterial blood pressure (ABP), which we obtained from physionet.org. The sequence was sampled at 125Hz, and has 4096 samples (the original sequence was longer and we truncated it so that the number of samples is a power of 2).
 Apply Discrete Fourier Transform (DFT) on the sequence, and plot the amplitude spectrum. Your plot should only include the first half of the spectrum (dropping the second half), as discussed on slides 2741 of the lecture slides. You can use any Fourier package you want, such as fft from MATLAB, or fft from R.
 [3 points] Submit your code and the amplitude spectrum plot. Excluding the DC component, consider the two most dominating frequencies, and write down their periods in (a) timetick numbers and (b) in seconds.
 [2 points] Do the periods make physiologically sense? (Reminder: the typical heartbeat rate of a healthy adult human is about 70 beats per minute.)
 [5 points] Repeat the above with this time sequence B of the electrocardiogram (ECG). This sequence was also sampled at 125Hz, and has 4096 samples (truncated version of the original).
Q6. Discrete Wavelet Transform [25 points]
We will use the same sequence B (ECGV  heartbeats) from Q5.
 Apply Discrete Wavelet Transform (DWT) on it, using the Haar basis. You may implement your own, or optionally use this Perl code. The Perl code expects the sequence length to be a power of 2, and we already trimmed the sequence to length 4096. The code is also available on slide 111 of the lecture slides. Similar code can be found at Numerical Recipe in C.
 [5 points] Write down the coefficients s_{10,0}, d_{10,0} , d_{9,0}, d_{9,1} (as defined on slide 107 of the lecture slides). If you used your own implementation, please submit your code. If you used the provided Perl code, you do not need to.
 Implement the inverse discrete wavelet transform, using your favorite programming language.
 [10 points] Submit your code
 [3 points] Reconstruct the sequence using the 6 strongest coefficients (in absolute value), and report the MSE (Mean Square Error) defined as , where is the original signal and is the reconstructed signal. Submit your code. and the plot of the reconstructed sequence (of 4096 timeticks).
 [2 points] Submit a plot of the MSE as a function of k=1, 2, 4, 8, ... 4096, where k is the number of strongest coefficients (in absolute value).
 [3 points] Repeat for the k strongest Fourier coefficient.
Hint: it is tricky to find the k strongest amplitudes;
for each amplitude, say A_{f} (at frequency f, f != 0 and f != 2048),
make sure you consider both X_{f} as well as its mirror conjugate
X_{4096f}
and count them as two coefficients. If chosen,
frequencies f=0 and f= 2048 each
contribute one coefficient.
Counted as above,
the number of coefficients k might not be exactly a power of 2 
try to get as close to a power of 2 as you can (small deviations
are acceptable).
 [2 points] Compare the two previous plots (i.e., MSE for DWT and MSE for DFT).
Which transform is better,
with respect to energy concentration (= approximation accuracy)?
Q7. String Editing Distance / Dynamic Programming [5 points]
 [1 point] You are given the words desperate and separate. Find and submit the string editing distance between them (assuming cost 1 for insertion, deletion, and substitution). You may use either
 [4 points] Extend the code you used above to show the "string editing distance matrix" for the words desperate and separate. An example matrix for surgery and survey is shown below (and on slide 36 of the lecture slides). Submit
 your code, marking the changes you made
 the matrix for desperate and separate
0 1 2 3 4 5 6
1 0 1 2 3 4 5
2 1 0 1 2 3 4
3 2 1 0 1 2 3
4 3 2 1 1 2 3
5 4 3 2 2 1 2
6 5 4 3 3 2 2
7 6 5 4 4 3 2