Carnegie Mellon University
15-826 – Multimedia Databases and Data Mining
Spring 2010 – C. Faloutsos
Homework 3
Due date: April 20, 3:00pm

• Hand in a hard copy of your homework in class. It must be typed; handwritten material may not be graded
• Put a soft copy and code in YourAndrewID_826hw3.tar and email it to TA (dchau at cs) in a single e-mail, with the title YourAndrewID_826hw3
• Homework must be done individually.
• For all the code that you write, please make sure they can be compiled and can run on a Linux machine of the Andrew cluster. To connect to one such machine, use your favorite SSH client (e.g., putty) to connect to linux.andrew.cmu.edu, and log in using your Andrew ID and password.
• You may also develop your code using other software and/or operating systems (e.g., cygwin on Windows), as long as your code works on an Andrew cluster machine.
• Weight: 1/3 of the homeworks grade, that is 3.33% of the total course grade
• Rough time estimates: 12 hours total, specifically:
• Q1: 1.5 hour
• Q2: 4 hours
• Q3: 1 hours
• Q4: 1 hour
• Q5: 1 hour
• Q6: 2.5 hours
• Q7: 1 hours
• Total possible points: 100

# Q1. Recursive Least Squares [10 points]

Fit a 2-D surface to this set of points (10,000 triplets of <x,y,value>, in MS-DOS newline convention). The solution has the form

g(x,y) = a1x2 + a2y2 + a3xy + a4x + a5y + a6

which minimizes the least squares error.

1. [3 points] Set up the problem as a least squares problem, in matrix-vector form Ax=b (as on slide 24 of the lecture slides). Write down how you generate the data matrix A, the vector of unknowns x and the vector of knowns b.
2. [5 points] Solve the problem, and write down the values for a1, a2, a3, a4, a5 and a6, using either
• (a) your own program that implements least squares or recursive least squares as described in equations 3-14 from the Chen+Roussopoulos (SIGMOD94) paper, or
• (b) this recursive least squares (python) package and run it on your own computer; you will need to use the older Python 2.3 and the Numeric package (which does not work with the latest Python versions 2.6.5, and 3.1.2).
3. [2 points] Solve the problem using MATLAB's regress command, or similar R commands (lm, or lsfit) and compare the results to that from Q1.2. Submit your code and your comparison.

We will learn to use Hadoop to analyze the text of the Moby-Dick novel. Hadoop is a popular choice for running large-scale distributed computations. Note that while the main benefit of Hadoop is the ability to run on multiple machines, it can also run on a single machine in Standalone mode, which we will learn to use.

Prerequisites

• Download and set up Hadoop following the official instructions. Make sure you can run the example listed under the Standalone Operation section. We have verified that Hadoop runs on the Andrew cluster machine; no additional software (except Hadoop itself) needs to be installed.
• Hint: as part of the setup process, in conf/hadoop-env.sh, the path for JAVA_HOME should be set to /usr
• Hint: if your account on the Andrew cluster machines uses the csh shell, type the command setenv JAVA_HOME "/usr" to set JAVA_HOME. If your account uses the bash shell, type export JAVA_HOME=/usr instead. You can type chsh to switch between shells.

You will analyze the vocabulary of Moby-Dick, obtained from the fascinating Gutenberg project, which has already been transformed into one word per line, in this file, in the MS-DOS newline convention.

1. You are going to write two pieces of code (Hadoop and PIG) that compute the frequency of every word. The output should be pairs of <word, frequency-of-occurrence>.
1. [8 points] Submit the map() and reduce() function for Hadoop, that solve the problem. Hint: re-use existing code as much as possible, and mark your changes, if any.
2. [8 points] Submit a PIG script that solves the same problem. Again, re-use existing code as much as possible, marking your changes.
3. [4 points] Submit the top five most frequent words and their frequencies in descending frequency order, in the form of <word, frequency>.
2. [5 points] Using the output from above, submit your Hadoop code (the map() and reduce() function), that computes the frequency-count plot (pairs of the form <frequency-of-occurrence,count> ). Hint: re-use the existing code as much as possible, and mark your changes, if any.
3. [5 points] Submit the Zipf plot (rank-frequency plot) in log-log scales, and report the slope of the least-squares fitting line, to check whether it follows the Zipf distribution. You may use MATLAB or any other fitting package, to fit a line.

# Q3. Singular Value Decomposition [10 points]

We would like to visualize this collection of 9-dimensional points. One typical way to do that is to project the points to a lower-dimentional space, using Singular Value Decomposition. We will learn how.

1. [5 points] Perform PCA on the data (i.e., center the data points and apply SVD on them). You can use any eigen libraries or statistical packages. Submit your code.

2. [5 points] Plot the points' projections on the first two principal components. Submit the code and the scatterplot (in jpg, eps, or any other popular format).

# Q4. Tensors [10 points]

Consider this tensor that describes the sales for a data cube X of three modes: customer, product, branch; each line in the tensor file is a tab-separated quadruplet <customer, product, branch, dollar-spent>. Apply PARAFAC on the tensor, using the tensor-toolbox for MATLAB.

1. [5 points] Write down the first 2 components (each component has three vectors).
2. [5 points] What does the PARAFAC analysis tell you about those fictitious customers, products and branches? Specifically, state which customers belong to what groups. Repeat for products, and branches.

Hint: You may use this Perl script to convert names in the data to unique numeric IDs (type make to see an example, and ./anonymize.pl -h for usage).

# Q5. Discrete Fourier Transform [10 points]

We are given this time sequence A of arterial blood pressure (ABP), which we obtained from physionet.org. The sequence was sampled at 125Hz, and has 4096 samples (the original sequence was longer and we truncated it so that the number of samples is a power of 2).

1. Apply Discrete Fourier Transform (DFT) on the sequence, and plot the amplitude spectrum. Your plot should only include the first half of the spectrum (dropping the second half), as discussed on slides 27-41 of the lecture slides. You can use any Fourier package you want, such as fft from MATLAB, or fft from R.
1. [3 points] Submit your code and the amplitude spectrum plot. Excluding the DC component, consider the two most dominating frequencies, and write down their periods in (a) timetick numbers and (b) in seconds.
2. [2 points] Do the periods make physiologically sense? (Reminder: the typical heartbeat rate of a healthy adult human is about 70 beats per minute.)
2. [5 points] Repeat the above with this time sequence B of the electrocardiogram (ECG). This sequence was also sampled at 125Hz, and has 4096 samples (truncated version of the original).

# Q6. Discrete Wavelet Transform [25 points]

We will use the same sequence B (ECG-V - heartbeats) from Q5.
1. Apply Discrete Wavelet Transform (DWT) on it, using the Haar basis. You may implement your own, or optionally use this Perl code. The Perl code expects the sequence length to be a power of 2, and we already trimmed the sequence to length 4096. The code is also available on slide 111 of the lecture slides. Similar code can be found at Numerical Recipe in C.
1. [5 points] Write down the coefficients s10,0, d10,0 , d9,0, d9,1 (as defined on slide 107 of the lecture slides). If you used your own implementation, please submit your code. If you used the provided Perl code, you do not need to.
2. Implement the inverse discrete wavelet transform, using your favorite programming language.
1. [10 points] Submit your code
2. [3 points] Reconstruct the sequence using the 6 strongest coefficients (in absolute value), and report the MSE (Mean Square Error) defined as , where is the original signal and is the reconstructed signal. Submit your code. and the plot of the reconstructed sequence (of 4096 timeticks).
3. [2 points] Submit a plot of the MSE as a function of k=1, 2, 4, 8, ... 4096, where k is the number of strongest coefficients (in absolute value).
4. [3 points] Repeat for the k strongest Fourier coefficient. Hint: it is tricky to find the k strongest amplitudes; for each amplitude, say Af (at frequency f, f != 0 and f != 2048), make sure you consider both Xf as well as its mirror conjugate X4096-f and count them as two coefficients. If chosen, frequencies f=0 and f= 2048 each contribute one coefficient. Counted as above, the number of coefficients k might not be exactly a power of 2 -- try to get as close to a power of 2 as you can (small deviations are acceptable).
5. [2 points] Compare the two previous plots (i.e., MSE for DWT and MSE for DFT). Which transform is better, with respect to energy concentration (= approximation accuracy)?

# Q7. String Editing Distance / Dynamic Programming [5 points]

1. [1 point] You are given the words desperate and separate. Find and submit the string editing distance between them (assuming cost 1 for insertion, deletion, and substitution). You may use either
2. [4 points] Extend the code you used above to show the "string editing distance matrix" for the words desperate and separate. An example matrix for surgery and survey is shown below (and on slide 36 of the lecture slides). Submit