CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Fall 2011 C. Faloutsos
Homework 2
Out: Oct. 27 2011
Due: Nov. 8
2011, before 12 noon (class time), via e-mail to the TA
Reminders:
- Weight: 1/3 of the weight for homeworks, i.e., 3.33% of the course grade. The maximum score is 100 points.
- Policy: Like all homeworks, this has to be done individually - not in groups
- Rough estimate for time to completion: 16-20h hours, with 4-5h for each of the 4 coding tasks, and 10'-30' for each task with plots.
Check-list: What to deliver
- Fractals
- code that generates the Bn fractal
- plots of the fractals, Hausdorff plot
- values for fractal and Hausdorff dimension
- Multifractals
- code that generates traffic
- time series and fdnq plots for generated data and correlation fractal dimension determined for each plot
- Power laws
- graph for 'mystery' dataset; observations
- code to generate data, correlation plot, scatterplot
- String editing distance (SED)
- code for modified SED
- list of top 3 suggestions for each word in the list
Please e-mail your solutions in a single e-message, to Ina (mfiterau at cs dot cmu dot edu) with the subject: [Databases - Homework 2 - ANDREWID].
- Your write-up, including plots and numerical results should be attached, and named Databases_Homework2_ANDREWID.pdf
- Your code should be archived and submitted with the name Databases_Homework2_ANDREWID.zip. For every piece of code, there should be a makefile, so that 'make' should run your code and produce the results.
Q1. Fractals [25 points]
- [10 points] Write code to generate the 'Battlement' (B) curve Bn of order n, (for any n), as shown in the following image:
(We call it 'Battlement' because it looks like the footpring of a castle). At each iteration, all lines change as follows:
Hint: use recursion, to plot each of the nine little pieces.
- [2 points] Plot the next two iterations of the curve: B2 and B3.
- [3 points] From its description, estimate the fractal dimension D of the Battlement curve Binfinity
- [8 points] Use as initial A0 = 1000, and generate the corner points of curve B4. Give the Hausdorff plot of those points.
- [2 points] What is the Hausdorff dimension of the B4 fractal?
Q2. Multifractals (Bursty Time Series) [25 points]
If we are given a bursty time sequence, how can we find its 'bias' parameter? The goal of this question is to show that the correlation fractal
dimension can help us answer this question.
We will implement the b-model proposed by Wang+ [ICDE 2002]
to generate traces (or sequences) of values: each value corresponds to,
say, the number of disk accesses at a time interval. These traces
follow the multifractal distribution with b:(1-b) splitting probabilities.
The data you generate should be in the following format, which we shall refer to as the 'histogram' format:
timetick |
#accesses |
1 |
13 |
2 |
2
|
5 |
3012
|
This format is expected by the FDNQ package. Notice that timeticks with no disk accesses, are omitted (like time-tick #3 and #4, in the above toy example).
More details: As explained in the foils (slide 42 of the lecture slides), for example, a bias b=0.8
means that within a given time interval, 80% of the accesses happen in
one half of the interval and the remaining 20% in the other half, and
this splitting of accesses happens recursively for each of those
halves. Your implementation should create deterministic splits, meaning
that, for all splits, b should be applied to the left half.
Generate three traces of values. Each trace has T=2048 timeticks (some
timeticks may not have disk accesses) and the total number of disk
accesses is N=20,000. The biases of the traces are b=0.7, b=0.9, and
b=0.5 respectively. Turn in:
- [15 points] Your code that generates the traces (again, with a makefile)
- [5 points] A plot for the histogram format of each trace (similar to the figure below or the paper); three plots total.
- [4 points] The correlation integral for each trace (treating each disk access as a point in 1-d space). You may use the FDNQ package for this, by running the command
perl fdnq.pl -h -q2 <histogram-file-name> |
- [1 point] Compute and report the correlation
fractal dimension for each trace, as well as the estimate of the bias
factor, using the formula:
D2 = -log2(b2 + (1-b)2)
Hints:
- Section 4.1 of the paper describes the naive method for generating the traces; section 4.4 describes an efficient alternative
Q3. Power laws - 'forensics' [25 points]
What can you say about a high dimensional dataset, that you can not easily plot and visualize? The goal here is to show how the correlation integral can help.
- [5 points] You are given a 'mystery' dataset containing 3D points
(eg., star coordinates). The goal is to find information about this
dataset. Plot (and hand-in) the correlation integral for these points.
- [5 points] Based on the correlation integral, what can you say about
the points? are they uniformly distributed? do they fall on a line? do
they form clusters? do they have characteristic scales? If yes, which one(s)? List as many observations as you can.
- [15 points] Reverse 'forensics': Generate a 2-d dataset with
correlation integral as follows, from left to right: (flat, slope = 1, flat, slope = 2, flat). You
may generate as many points as you want; you may use any values you
want for the characteristic scales. Hand in the code (with makefile), the correlation plot and the XY plot.
Q4. String editing distance (S.E.D.) [25 points]
How can we
correct typing errors, where finger-slips are frequent? In this
question, you will build a keyboard-driven spell-checker. The intuition
is
that the closer two keys are, the more likely it is for them
to substitute each other.
- [20 points] Implement the spell-checker for the keyboard in this file, which describes a typical american QWERTY keyboard, and where the *
characters indicate separators. Change the substitution cost to 0.1 if
the keys are on the same row, and adjacent (like q and w). Otherwise, the substitution cost is '1', and so is the insertion, and deletion, cost. You may use this SED code as a base for your program.
- [5 points] Run your program on the following words, with the following dictionary. For each word, give the top 3 replacement options. Break ties by returning the alphabetically first word.
Last edited: 25 October 2011, by Ina Fiterau and Christos Faloutsos