Carnegie Mellon University
15-826 – Multimedia Databases and Data Mining
Spring 2010 – C. Faloutsos
Homework 2
Due date: March 16, 3:00pm

Hand in hard and soft copy of your homework in class. It must be typed; handwritten material may not be graded
Put soft copy and code in YourAndrewID_826hw2.tar and email it to TA (dchau at cs) in a single e-mail, whose title must be YourAndrewID_826hw2
Homework must be done individually.
For all the code that you write, please make sure they can be compiled and can run on a Linux machine of the Andrew cluster. To connect to one such machine, use your favorite SSH client (e.g., putty) to connect to linux.andrew.cmu.edu, and log in using your Andrew ID and password.
You may also develop your code using other software and/or operating systems (e.g., cygwin on Windows), as long as your code works on an Andrew cluster machine.
Weight: 1/3 of homeworks grade, that is 3.33% of total course grade
Rough time estimates: 9-10 hours total, specifically:
- Q1: 3 hours
- Q2: 1 hour
- Q3: 3 hours
- Q4: 2-3 hours

Q1. R-tree [30 points]

We will implement a new command l for, inc(l)usion query, that prints out all the data rectangles that are contained in the closed bounding box specified by two points: low = (l₁, l₂, ..., l_n) and high = (h₁, h₂, ..., h_n). The command should ask for the low and high values by dimensions, i.e., in the order of l₁, h₁, l₂, h₂, l₃, h₃, .... You implementation should not modify the tree's data structure itself.

Please build the R-Tree Package (tar xvf; make demo). This creates the bin/DRmain program and runs it on some small datasets. Running make test1 should return an Algorithm not yet implemented message after loading the appropriate dataset.

Turn in:

[5 points] The list of qualifying rectangles for dataset 1 (2-d). This should run using make test1.
[5 points] The list of qualifying rectangles for dataset 2 (3-d). This should run using make test2.
[20 points] A tarball (YourAndrewID_rtree.tar) of your code emailed to the TA (dchau at cs) and a hard copy of files you changed. On the hard copy, highlight or circle the code that you modified\added.

Hint #1: you may want to modify the following files, and possible others:

DRTree\binsrc\DRmain.c
DRTree\libsrc\src\DR.c
DRTree\libsrc\src\DRtree.c

Hint #2: you may want to look at the DRrectCover function in DRTree\libsrc\src\DRrect.c, which checks whether one rectangle contains another rectangle.

Q2. Measuring Mystery Data Fractal Dimension [20 points]

We will use the fdnq software package to calculate the correlation integral of a collection of 6-dimensional points. Build the fdnq package (unzip; make). This computes the fractal dimensions for several example datasets, and shows the plot for one of them.

[5 points] Turn in a plot of the correlation integral (fdnq -q2). Write down the command you used to generate the plot. We suggest that you use the default values for r (0.001 for minimum radius) and R (10000, for maximum radius).
[10 points] For every line segment in the plot that is roughly linear, write down its range and its slope.
[5 points] What do you think the points in the mystery dataset look like?
For example, do they form a line, some clusters, or something else?

Hint for Cygwin users: you will need to install the "tcsh" and "gnuplot" package, and maybe others.

Q3. 2D Cantor Dust [30 points]

Cantor dust: iterations
0, 1, 2, 3

The 2D Cantor dust can be generated recursively, as shown in the figure above, by deleting the middle third segment on both dimensions.

[15 points] Write a program that generates the points (shown as black dots above) for the Cantor dust, up to 6 iterations (4⁽⁶⁺¹⁾=16,384 points total). Turn in:
1. your code (any major language is acceptable: C/C++, Perl, Python, Java, etc.)
2. a scatterplot of the points
[10 points] Using the fdnq package, or your own program (written in a major language, such as C/C++, Java, Perl, Python), turn in:
1. the boxcounting plot (fdnq -q2)
2. slope (fractal dimension).
[5 points] Using the same code, turn in:
1. the "Hausdorff" fractal dimension (fdnq -q0)
2. the corresponding plot.

Q4. Multifractal [20 points]

We will implement the b-model devised by Wang+ [ICDE 2002] to generate several traces (or sequences) of values, each value corresponding to, say, the number of disk accesses at a time interval. These traces follow the multifractal distribution with b:(1-b) splitting probabilities, where 0.5<=b<1. The data you generate could come in two formats:

histogram format <timetick #accesses>. e.g.,
1 3
2 10
5 30
...
timestamp format <timestamp>: e.g.,
1
1
1
2
2
2
...

We prefer that you use the histogram version because it is more compact.

For example, a bias b=0.8 means that within a given time interval, 80% of the accesses happen in one half of the interval and the remaining 20% in the other half, and this splitting of accesses happens recursively for each of those halves.

Your implementation should create deterministic splits, meaning for all splits, b should be applied to the left half.

Generate three traces of values. Each trace has T=1024 timeticks (some timeticks may not have disk accesses) and the total number of disk accesses is N=10,000. The biases of the traces are 0.7, 0.9, and 0.5 respectively. Turn in:

[10 points] your code that generates the traces
[5 points] a plot for the histogram format of each trace (similar to Figure 2b of the paper); three plots total.
[5 points] the correlation fractal dimension plot for each trace. We recommend using the command fdnq -h -q2 <histogram-file-name> to find it. Optionally, you may use fdnq -q2 <timestamp-file-name> to verify the result.

Hints:

You may refer to the method described on slide 42 of the lecture slides. Section 4.1 of the paper describes the naive method for generating the traces; section 4.4 describes an efficient alternative