CARNEGIE MELLON UNIVERSITY

15-826 - Multimedia databases and data mining

Spring 2008

Homework 2 - Due: Tuesday Mar. 18, 1:30pm, in class.

Important:

Due date: Tuesday Mar. 18, 1:30pm
For all questions please BOTH:

e-mail the TA (htong at cs) a soft-copy of all code, output and auxilary scripts tar'ed up in a SINGLE file (tar, or zip). Please exclude binaries and other redundant elements.
hand in a hard-copy of all code, output and auxilary scripts

IMPORTANT: make sure your tar/zip-file has a makefile, so that the command ‘make’ should generate the answers to all these homework questions. If you are not familiar with ‘makefile’s, please have a script ‘runme.bash’ or ‘runme.bat’, instead.

Reminders

Please turn in a typed report - handwritten material may not be graded, at the grader's discretion.
All homeworks including this one, are to be done individually.
Estimated effort for this homework:

Q1: 3 hours
Q2: 8 hours
Q3: 5 hours
Q4: 4 hours

Q1: Fractals [20pts]

Generate a 3-D dataset with N=50,000 points, so that it has the correlation integral of the Figure 1. Notice that there is more than one correct solution – we’ll accept any one of them. Then, plot the correlation integral of your dataset. You may use the fractal-dimension code from www.cs.cmu.edu/~christos/SRC/fdnq.tar (feel free to modify it); or some other package from the web, or write your own code.

Figure 1: Expected Correlation Integral

What to hand in:

(1) [3 pts] Briefly describe the mechanism used to generate your dataset.

(2) [2 pts] The scatter-plot of the dataset you generate (In matlab, you can use plot3 to plot a 3-d dataset; in gnuplot, you can use splot; in R, you can use cloud)

(3) [10 pts] The fractal dimensions as well as the corresponding range. (It is fine to have small deviations from the specified numbers)

(4) [5 pts] The hard copy of your code that calculates such fractal dimensions. If you use the provided ‘fdnq’ package, highlight your changes, if any.

Q2: Fat Fractals [40 pts]

We want to generate 'fat fractals', as shown in Figures 2 below: We start with a square of side 1; each side is divided into n equal pieces (n=7 in the Figure), and each side gives 'birth' to an 'island' of side 1/n, above the m-th piece (m=3 there), at distance d * side length (side length=1 for iteration 1). Notice that, at the second iteration, every 1/n piece also gives birth to a little island.

Write a program that will read the parameters n, m, d and k (= number of iterations) and it will generate points on the periphery of the k-th iteration of such a fat fractal. Specifically, just print the end-points of each segment of side (1/n)**k. That is, for k=0, just print the four corners of the unit square; for k=1, also print the corners of each new little island, as well as the end-points of each 1/n segment on the original unit square; and so on. Hand in your code (again, in any major language).


Fat fractal – iter.#0	Iteration 1	Recursive rule	Iteration 2

Figure 2: Illustration of first 2 iterations

What to hand in:

(1) [20 pts] Run your program with n=7, m=3, d=1/n and k=4 iterations, and hand-in the scatterplot of these points.

(2) [10 pts] Give the correlation fractal dimension of these points.

(3) [10 pts] Email a tarfile of your source code to the TA (htong at cs). Reminder: exclude binaries and other redundant files.

Q3. Generating Power Laws – Random walks [25 pts]

In this exercise, we will learn how to generate power law by Random Walk.

Problem Description

In this model, assume that you have zero balance at the very beginning. Then at each step, you toss a coin; you win 1 dollar if it is head, and you lose 1 dollar if it is the tail. Consider the zero-crossings: the distribution of these intervals follows a power law (See Lecture 13, page 19-20, and the paper [Newman, 2005]). Write a program in your favorite language to verify this claim. There is only one parameter (e.g., the number of steps/iterations Tt) in your program.

What to hand in:

(1) [5 pts] Simulate a random walk with T=1,000,000 steps with your program. Plot the balance (e.g. the number of dollars you have at that iteration) vs. the iteration step t, for the last T_last=10,000 steps.

(2) [15 pts] Plot the length of interval vs. its rank (i.e., Zipf plot), in log-log scales. Fit a line and plot (in the same figure) the fitted line. What is the slope of your fitted line?

Hints: When fitting the line in log-log scales, you might want to delete the tail part of the curve. We accept any reasonable heuristic to define the tail. For example, you can use the same strategy used in the fdnq package that you have already seen in Q1.

(3) [5 pts] Email a tarfile of your source code to the TA (htong at cs). Reminder: Again, please exclude binaries etc.

Q4. Generating Power Laws – CRP [15 pts]

In this exercise, we will learn how to generate a power law distribution, using the ``Chinese Restaurant Process’’.

Problem Description

In this model, imagine there is a new customer entering a Chinese restaurant at each time step. He/she has two choices: (1) joining in an existing table with probability that is proportional to the size (e.g. the number of customers that are already in that table); (2) starting a new table with probability c. For simplicity, we assume that initially, there is only one table with one customer. The tail distribution of the size of the tables follows a power law (see Lecture 13, and [Newman, 2005]). Write a program in your favorite language to verify this claim. Your program should input two parameters: t and c (the number of steps/iterations t and the probability to start a new table c)

What to hand in:

(1) [5 pts] Simulate a Chinese restaurant process with c=0.1 and t=10,000 steps with your program. Plot the PDF for size of the tables in log-log scale. Fit a line and plot (in the same figure) the fitted line. What is the slope of your fitted line?

(2) [5 pts] Repeat, for c=0.5 (and same duration t=10,000). That is, show the PDF of the table sizes, fit a line in log-log scales, plot it, and report its slope.

(3) [5 pts] Email a tarfile of your source code to the TA (htong at cs). Please include just the source code and not the binaries