Carnegie Mellon University

15-826 – Multimedia Databases and Data Mining

Spring 2010 – C. Faloutsos

Homework 2

Due date: **March 16, 3:00pm**

- Hand in hard and soft copy of your homework in class. It must be
**typed**; handwritten material may not be graded - Put
**soft copy**and**code**in YourAndrewID_826hw2.tar and email it to TA (dchau at cs) in a**single**e-mail, whose title must be**YourAndrewID**_**826hw2** - Homework must be done
**individually**. - For all the code that you write, please make sure they
**can be compiled and can run on a Linux machine of the**. To connect to one such machine, use your favorite SSH client (e.g., putty) to connect to linux.andrew.cmu.edu, and log in using your Andrew ID and password.*Andrew cluster* - You may also develop your code using other software and/or operating systems (e.g., cygwin on Windows), as long as your code works on an Andrew cluster machine.
- Weight: 1/3 of homeworks grade, that is 3.33% of total course grade
- Rough time estimates: 9-10 hours total, specifically:
- Q1: 3 hours
- Q2: 1 hour
- Q3: 3 hours
- Q4: 2-3 hours

We will implement a new command l for, inc(l)usion query, that prints out all the data rectangles that are contained in the closed bounding box specified by two points: *low = (l _{1}, l_{2}, ..., l_{n})* and

Please build the R-Tree Package (tar xvf; make demo). This
creates the bin/DRmain program and runs it on some small datasets.
Running make test1 should return an *Algorithm not yet implemented* message
after loading the appropriate dataset.

- [5 points] The list of qualifying rectangles for dataset 1 (2-d). This should run using make test1.
- [5 points] The list of qualifying rectangles for dataset 2 (3-d). This should run using make test2.
- [20 points] A tarball (
**YourAndrewID_rtree.tar**) of your code emailed to the TA (dchau at cs) and a hard copy of**files you changed**. On the hard copy,**highlight or circle**the code that you modified\added.

**Hint #1:** you may want to modify the following files, and possible others:

- DRTree\binsrc\DRmain.c
- DRTree\libsrc\src\DR.c
- DRTree\libsrc\src\DRtree.c

**Hint #2:** you may want to look at the DRrectCover function in DRTree\libsrc\src\DRrect.c, which checks whether one rectangle contains another rectangle.

We will use the *fdnq* software package to
calculate the correlation integral of a collection
of 6-dimensional points. Build the *fdnq*
package (unzip; make). This computes the fractal dimensions for
several example datasets, and shows the plot for one of them.

- [5 points] Turn in a plot of the
**correlation integral**(fdnq -q2). Write down the command you used to generate the plot. We suggest that you use the default values for r (0.001 for minimum radius) and R (10000, for maximum radius). - [10 points] For every line segment in the plot that is roughly linear, write down its range and its slope.
- [5 points] What do you think the points in the mystery dataset look like?

For example, do they form a line, some clusters, or something else?

**Hint for Cygwin users:** you will need to install the "tcsh" and "gnuplot" package, and maybe others.

The 2D Cantor dust can be generated recursively, as shown in the figure above, by deleting the middle third segment on both dimensions.

- [15 points] Write a program that generates the points (shown as
black dots above) for the Cantor dust, up to 6 iterations (4
^{(6+1)}=16,384 points total). Turn in:- your code (any major language is acceptable: C/C++, Perl, Python, Java, etc.)
- a scatterplot of the points

- [10 points] Using the
*fdnq*package, or your own program (written in a major language, such as C/C++, Java, Perl, Python), turn in:- the boxcounting plot (fdnq -q2)
- slope (fractal dimension).

- [5 points] Using the same code, turn in:
- the "Hausdorff" fractal dimension (fdnq -q0)
- the corresponding plot.

We will implement the *b-model* devised by Wang+ [ICDE 2002] to generate several traces (or sequences) of values, each value corresponding to, say, the number of disk accesses at a time interval. These traces follow the multifractal distribution with b:(1-b) splitting probabilities, where 0.5<=b<1. The data you generate could come in two formats:

- histogram format <timetick #accesses>. e.g.,

1 3

2 10

5 30

... - timestamp format <timestamp>: e.g.,

1

1

1

2

2

2

...

We prefer that you use the histogram version because it is more compact.

For example, a bias b=0.8 means that within a given time interval, 80% of the accesses happen in one half of the interval and the remaining 20% in the other half, and this splitting of accesses happens recursively for each of those halves.

Your implementation should create deterministic splits, meaning for all splits, b should be applied to the left half.

Generate three traces of values. Each trace has T=1024 timeticks (some timeticks may not have disk accesses) and the total number of disk accesses is N=10,000. The biases of the traces are 0.7, 0.9, and 0.5 respectively. Turn in:

- [10 points] your code that generates the traces
- [5 points] a plot for the histogram format of each trace (similar to Figure 2b of the paper); three plots total.
- [5 points] the correlation fractal dimension plot for each trace. We recommend using the command fdnq -h -q2 <histogram-file-name> to find it. Optionally, you may use fdnq -q2 <timestamp-file-name> to verify the result.

**Hints:**

- You may refer to the method described on slide 42 of the lecture slides. Section 4.1 of the paper describes the naive method for generating the traces; section 4.4 describes an efficient alternative

Last updated by Polo Chau, Mar 1, 2010