15-826 - Multimedia databases and data mining

Fall 2011 C. Faloutsos

- Weight: 1/3 of the weight for homeworks, i.e., 3.33% of the course grade. The maximum score is 100 points.
- Policy: Like all homeworks, this has to be done individually - not in groups
- Rough estimate for time to completion: 16-20h hours, with 4-5h for each of the 4 coding tasks, and 10'-30' for each task with plots.

- Fractals
- code that generates the B
_{n}fractal - plots of the fractals, Hausdorff plot
- values for fractal and Hausdorff dimension

- code that generates the B
- Multifractals
- code that generates traffic
- time series and fdnq plots for generated data and correlation fractal dimension determined for each plot

- Power laws
- graph for 'mystery' dataset; observations

- code to generate data, correlation plot, scatterplot

- graph for 'mystery' dataset; observations
- String editing distance (SED)
- code for modified SED
- list of top 3 suggestions for each word in the list

Please e-mail your solutions in a single e-message, to Ina

- Your write-up, including plots and numerical results should be attached, and named
**Databases_Homework2_ANDREWID.pdf** - Your code should be archived and submitted with the name
**Databases_Homework2_ANDREWID.zip.**For every piece of code, there should be a makefile, so that 'make' should run your code and produce the results.

- [10 points] Write code to generate the 'Battlement' (B) curve B
_{n}of order n, (for any n), as shown in the following image:

(We call it 'Battlement' because it looks like the footpring of a castle). At each iteration, all lines change as follows:

Hint: use recursion, to plot each of the nine little pieces. - [2 points] Plot the next two iterations of the curve: B
_{2}and B_{3}. - [3 points] From its description, estimate the fractal dimension D of the Battlement curve B
_{infinity}

- [8 points] Use as initial A
_{0}= 1000, and generate the corner points of curve B_{4}. Give the Hausdorff plot of those points. - [2 points] What is the Hausdorff dimension of the B
_{4}fractal?

If we are given a bursty time sequence, how can we find its 'bias' parameter? The goal of this question is to show that the correlation fractal
dimension can help us answer this question.

We will implement the *b-model* proposed by Wang+ [ICDE 2002]
to generate traces (or sequences) of values: each value corresponds to,
say, the number of disk accesses at a time interval. These traces
follow the multifractal distribution with b:(1-b) splitting probabilities.

The data you generate should be in the following format, which we shall refer to as the 'histogram' format:

timetick | #accesses |
---|---|

1 | 13 |

2 | 2 |

5 | 3012 |

More details: As explained in the foils (slide 42 of the lecture slides), for example, a bias b=0.8 means that within a given time interval, 80% of the accesses happen in one half of the interval and the remaining 20% in the other half, and this splitting of accesses happens recursively for each of those halves. Your implementation should create deterministic splits, meaning that, for all splits, b should be applied to the left half.

Generate three traces of values. Each trace has T=2048 timeticks (some timeticks may not have disk accesses) and the total number of disk accesses is N=20,000. The biases of the traces are b=0.7, b=0.9, and b=0.5 respectively. Turn in:

- [15 points] Your code that generates the traces (again, with a makefile)

- [5 points] A plot for the histogram format of each trace (similar to the figure below or the paper); three plots total.
- [4 points] The correlation integral for each trace (treating each disk access as a point in 1-d space). You may use the FDNQ package for this, by running the command
perl fdnq.pl -h -q2 <histogram-file-name> - [1 point] Compute and report the correlation
fractal dimension for each trace, as well as the estimate of the bias
factor, using the formula:
D2 = -log
_{2}(b^{2}+ (1-b)^{2})

- Section 4.1 of the paper describes the naive method for generating the traces; section 4.4 describes an efficient alternative

- [5 points] You are given a 'mystery' dataset containing 3D points
(eg., star coordinates). The goal is to find information about this
dataset. Plot (and hand-in) the correlation integral for these points.

- [5 points] Based on the correlation integral, what can you say about
the points? are they uniformly distributed? do they fall on a line? do
they form clusters? do they have characteristic scales? If yes, which one(s)? List as many observations as you can.

- [15 points] Reverse 'forensics': Generate a 2-d dataset with correlation integral as follows, from left to right: (flat, slope = 1, flat, slope = 2, flat). You may generate as many points as you want; you may use any values you want for the characteristic scales. Hand in the code (with makefile), the correlation plot and the XY plot.

- [20 points] Implement the spell-checker for the keyboard in this file, which describes a typical american QWERTY keyboard, and where the
***characters indicate separators. Change the substitution cost to 0.1 if the keys are on the same row, and adjacent (like*q*and*w*). Otherwise, the substitution cost is '1', and so is the insertion, and deletion, cost. You may use this SED code as a base for your program. - [5 points] Run your program on the following words, with the following dictionary. For each word, give the top 3 replacement options. Break ties by returning the alphabetically first word.

Last edited: 25 October 2011, by Ina Fiterau and Christos Faloutsos