15-826 - Multimedia databases and data mining
Fall 2011 C. Faloutsos

Homework 2
Out: Oct. 27 2011
Due: Nov. 8 2011, before 12 noon (class time), via e-mail to the TA


Check-list: What to deliver

  1. Fractals
  2. Multifractals
  3. Power laws
  4. String editing distance (SED)

Please e-mail your solutions in a single e-message, to Ina (mfiterau at cs dot cmu dot edu) with the subject: [Databases - Homework 2 - ANDREWID].

Q1.  Fractals [25 points]

  1. [10 points] Write code to generate the 'Battlement' (B) curve Bn of order n, (for any n), as shown in the following image:

    (We call it 'Battlement' because it looks like the footpring of a castle). At each iteration, all lines change as follows:
    Hint: use recursion, to plot each of the nine little pieces.
  2. [2 points] Plot the next two iterations of the curve:  B2 and B3.
  3. [3 points] From its description, estimate  the fractal dimension D of the Battlement curve Binfinity
  4. [8 points] Use  as initial A0 = 1000, and generate the corner points of  curve  B4. Give the Hausdorff plot of those points.
  5. [2 points] What is the Hausdorff dimension of the B4 fractal? 

Q2. Multifractals (Bursty Time Series) [25 points]

If we are given a bursty time sequence, how can we find its 'bias' parameter? The goal of this question is to show that the correlation fractal dimension can help us answer this question.

We will implement the b-model proposed by Wang+ [ICDE 2002] to generate traces (or sequences) of values: each value corresponds to, say, the number of disk accesses at a time interval. These traces follow the multifractal distribution with b:(1-b) splitting probabilities. 

The data you generate should be in the following format, which we shall refer to as the 'histogram' format:

timetick #accesses
1 13
2 2
5 3012
This format is expected by the FDNQ package. Notice that timeticks with no disk accesses, are omitted (like time-tick #3 and #4, in the above toy example).

More details: As explained in the foils (slide 42 of the lecture slides), for example, a bias b=0.8 means that within a given time interval, 80% of the accesses happen in one half of the interval and the remaining 20% in the other half, and this splitting of accesses happens recursively for each of those halves. Your implementation should create deterministic splits, meaning that, for all splits, b should be applied to the left half.

Generate three traces of values. Each trace has T=2048 timeticks (some timeticks may not have disk accesses) and the total number of disk accesses is N=20,000. The biases of the traces are b=0.7, b=0.9, and b=0.5 respectively. Turn in:

  1. [15 points] Your code that generates the traces (again, with a makefile)
  2. [5 points] A plot for the histogram format of each trace (similar to the figure below or the paper); three plots total.

  3. [4 points] The correlation integral for each trace (treating each disk access as a point in 1-d space). You may use the FDNQ package for this, by running the command
    perl -h -q2 <histogram-file-name>
  4. [1 point] Compute and report the correlation fractal dimension for each trace, as well as the estimate of the bias factor, using the formula: D2 = -log2(b2 + (1-b)2)


Q3. Power laws - 'forensics' [25 points]

What can you say about a high dimensional dataset, that you can not easily plot and visualize? The goal here is to show how the correlation integral can help.

  1. [5 points] You are given a 'mystery' dataset containing  3D points (eg., star coordinates). The goal is to find information about this dataset. Plot (and hand-in) the correlation integral for these points.
  2. [5 points] Based on the correlation integral, what can you say about the points? are they uniformly distributed? do they fall on a line? do they form clusters? do they have characteristic scales? If yes, which one(s)? List as many observations as you can.
  3. [15 points] Reverse 'forensics': Generate a 2-d dataset with  correlation integral as follows, from left to right: (flat, slope = 1, flat, slope = 2, flat). You may generate as many points as you want; you may use any values you want for the characteristic scales. Hand in the code (with makefile), the correlation plot and the XY plot.

Q4. String editing distance (S.E.D.) [25 points]

How can we correct typing errors, where finger-slips are frequent? In this question, you will build a keyboard-driven spell-checker. The intuition is that the closer two keys are, the more likely it is for them to substitute each other.
  1. [20 points] Implement the spell-checker for the keyboard in this file, which describes a typical american QWERTY keyboard, and where the * characters indicate separators. Change the substitution cost  to 0.1 if the keys are  on the same row, and adjacent  (like q and w). Otherwise, the substitution cost is '1', and so is the insertion, and deletion, cost. You may use this SED code as a base for your program.
  2. [5 points] Run your program on the following words, with the following dictionary. For each word, give the top 3 replacement options. Break ties by returning the alphabetically first word.

Last edited: 25 October 2011,  by Ina Fiterau and Christos Faloutsos