CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Spring 2006
Homework 3 - Due: April 25,
3:01pm, in class.
Important:
- Please turn in a typed
report - handwritten material will
not
be graded.
- Due date: April 25,
3:01pm, in class
- No collaborations are allowed.
For your information:
- Expected time effort for this homework:
- Q1: 1h
- Q2: 3h
- Q3: 5-7h
- Q4: 3h
- Points are in [bold pts] and add to 100.
Q1: 1D DFT, DWT and SWFT[15 pts]
1.1 Problem description
Apply DFT to following datasets. Now repeat this with DWT (using Haar
wavelets) instead of DFT.
1.2 What to hand in:
For DS1 and DS2, hand in the following:
- [2 pt] A plot of the original
dataset,
- [4 pt] the amplitude-frequency plot
for DFT. Does the plots agree with what you expected? Explain briefly (Hint:
You might want to create a log-log amplitude-freq plot for the Yahoo dataset).
- [4 pt] the scalogram for DWT. Do
the full level decomposition of the signals.
Moreover, for the Wavy dataset:
4. [5 pts] For the wavy dataset, plot
the spectrogram generated using SWFT. Use a window size ~100. Try to deduce the generator function by
inspection of the SWFT spectrogram. It is of the form A sin(2π.(f0.t
+ X)),
where you have to estimate the form of X(t).1.3 Hints
- You can use the 'wavemenu' toolbox in MATLAB. For SWFT you can lookup the
function 'specgram'.
- Or else you might consider using
xwpl
- Feel free to use any software (eg. R) you wish.
Q2: Iterated Function Systems [20 pts]
2.1 Problem description
Get the IFS code here. Now use it to generate and plot the following curves (i.e.,
10000
points of each):
- The Koch curve

- The Hilbert curve: (Assume
ε = 0.02)

2.2 What to hand in:
- [10 pts] The coefficients for both curves
- [10 pts] The plot of the curves
Q3: K-Means and K-Harmonic Means[40pts]
3.1 Problem description
Implement the K-Means and the K-Harmonic Means (Zhang
et.al - some corrections in the paper) clustering algorithms. You can use any programming language.
Input Data (ASCII files with windows end-of-line termination):
Input format:
- Input the set of points from the data file as specified above.
- Input the number of centers and the initial coordinates from the given
initialization file. The first line in the initialization file indicates the
number of centers k, to use. Then it gives the x y coordinate values of all the k
centers. Sample file: init
Output format:
- The number of iterations n, taken to converge.
- The (x, y) coordinate values of the k-centers after convergence.
3.2 What to hand in:
- [10 pts] Hard copy of the code.
- [10 pts] K-Means: Output of the
program on the given 2d Dataset
i) Using a good initialization of centers:
init_good
ii) Using a bad initialization of centers:
init_bad
- [5 pts] K-Means: Plot the cluster
centers superimposed on the dataset for both initialization cases.
- [10 pts] K-Harmonic Means: Output
of the program on the given 2d Dataset with the same initial configurations as
in 2 above.
- [5 pts] K-Harmonic Means: Plot the
cluster centers superimposed on the dataset.
Q4: Recursive Least Squares and Autoregression [25pts]
4.1 Problem description:
Implement the method of Recursive Least Squares,
with forgetting, from the paper by [Yi et
al, ICDE 2000], or from [Chen & Roussopoulos, SIGMOD 94].
- Language:
You may choose among C/C++, perl, python, Java, R, Matlab. However, your
algorithm must be incremental,
processing one row of data at a time.
- Input format:
n+1 numbers per line, blank separated; the last column is the dependent
variable
- Output format:
n regression coefficients, w1 ... wn, so that
<last-column> ~ w1 * <column-1> + ... wn *
<column-n>
For example, you could see this python
code which (a) needs "Numerical Python", and (b) has no forgetting factor.
You can "tar xvf" and then "make" it.
Note: The code is provided to mainly show you the expected formats of
the input and output. Make sure you test it thoroughly, or rewrite it from scratch.
Again, make sure that your algorithm is incremental, processing one line at a time.
No points if you load the full dataset in
memory!
You will use the above code to perform a Autoregression on the following
time-series dataset: Wave Data.
- The Autoregressive model AR(2) is:
xt = w1*xt-1 + w2*xt-2
,where xt denotes the value of x at time t.
- Preprocessed data: For your convenience we have
preprocessed the data, so that each line is of the format: <xi-2 ,xi-1,xi>
You can use this as input for your program.
4.2 What to hand in:
1. [5 pts] Hard Copy of your code
For the first half of the signal (t ≤ 256):
2. [1 pts] Regression coefficients
at t=256 without forgetting.
3. [3 pts] Regression coefficients
at t=256 when the forgetting factor is lambda = 0.98.
4. [2 pts] Estimate the frequency f, of the signal
from the formula: w2 = -1, and 2πf
= arctan ( sqrt(-w12 + 4) / w1).
Assume sampling rate = 1 per time step.
5. [3 pts] Verify the frequency you calculated using
DFT. Give the amplitude-frequency plot.
6. [3 pts] Assuming the 2 initial values of x (x1 and x2), forecast the values of x till
t=256 using the coefficients from 1 above. Plot the original and
forecasted curves for both cases.For the full signal (t ≤ 512):
7. [1 pts] Regression coefficients
at t=512 without forgetting.
8. [5 pts] Regression coefficients
t=512 when the forgetting factor is lambda = 0.98
9. [2 pts] To what frequency do the coefficients correspond?