CARNEGIE MELLON UNIVERSITY

15-826 - Multimedia databases and data mining

Spring 2008

Homework 3 - Due: Tuesday Apr. 22, 1:30pm, in class.

Important:

  • Due date: Tuesday Apr. 22, 1:30pm, hard copy in class and soft copy e-mailed to the TA (htong at cs) in a single e-mail.
  • For all questions please BOTH:
    • e-mail the TA (htong at cs) a soft-copy of all code, output and auxilary scripts tar'ed up in a SINGLE file
    • hand in a hard-copy of all code, output and auxilary scripts

Reminders – Hints:

  • All homeworks including this one are to be done individually.
  • Expected effort for this homework:
    • Q1: 2 hrs
    • Q2: 1 hr
    • Q3: 2 hrs
    • Q4: 6 hrs
    • Q5: 1.5 hrs
    • Q6: 1 hr
    • Q7: 2 hrs

 

Q1: Generating fractals with IFS [15 pts]

In homework 2, we saw how to generate a (fat) fractal recursively. We can also generate fractal using the iterated function system. Get the IFS code here and use it to plot the following dragon curve. The picture below shows how a dragon curve is generated:

Order 0

Order 1

Order 2

Order 3

 

Hint #1: find the necessary transformations, and feed their coefficients to the IFS program

Hint #2: To rotate point (x,y) by angle a (counter-clockwise), we need to do a matrix-vector multiplication with matrix :

What to hand in:

(1)   [5 pts] The scatter-plot of the dragon curve after 3,000 iterations of the IFS method (roughly 3000 points).

(2)   [5 pts] Hard copy of the source code.

(3)   [4 pt] The correlation integral of this cloud of 3,000 points, as well as its slope, to 2 decimal digits of accuracy (recall that the slope is the fractal dimension).

(4)   [1 pt] Are you surprised by the value of the fractal dimension? Explain its value.

 

Q2: Tensor Decomposition [10pts]

Download the tensor a 4x6x5 tensor here (each row represents an non-zero element of the tensor in the format “(i,j,k) value’). Find its Kruskal decomposition of rank 2.

What to hand in:

(1)   [10 pts] the Kruskal decomposition of the original tensor, i.e. the three matrices (4x2, 6x2, 5x2 respectively) as well as the 2x1 vector of the super-diagonal of the core tensor (2x2x2).

Hint: Use the tensor toolbox at http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/

 

Q3: DFT [10 pts]

Download the following time series. We are told that it is generated by the following formula:

.

Find those unknown parameters: (for each i=1,…,k).

What to hand in:

(1)   [2 pts] Your estimation for k. Give your justification.

(2)   [6 pts] For each i=1,…,k, your estimations for .

(3)   [2 pts] A plot of the original dataset, together with your estimated curve.

Hint: Look at the amplitude spectrum and the frequency spectrum (amplitude-frequency and phase-frequency plots, respectively) of the DFT.

 

Q4: The forest fire model [20 pts]

In class, we saw that some systems will lead to power-laws at some critical points in their parameter space (i.e. phase transition). Furthermore, we saw that some systems can actually arrange themselves so that they are always close to such a critical point, no matter what state we start off in. In this exercise, we will study one such model, the ``forest fire’’ model. We follow the description from the paper by [Newman, 2005]).

Description: Consider a 100 x 100 lattice. Each cell is either empty, or has a tree.  At every time-tick, we have two activities:

(1)   each empty cell may grow a tree with probability p

(2)   once all new trees have appeared, a lightning hits a randomly chosen cell. If this cell has a tree, the tree burns instantaneously, and so does every other tree adjacent to a burning tree (we consider only four neighbors of every cell, north, south, east, west). That is, a lightning burns out the whole cluster of trees. See Fig.1. for illustrations of clusters, and Fig.2. for the burning step.

Let r(t) be the ratio of occupied cells at time-tick t. After several time-ticks, the ratio (r(t)) will be oscillating around the phase-transition point. Furthermore, the size of the tree-clusters will follow a power-law.

Fig.1., In this 6 x 6 lattice, each occupied square represents a tree. A cluster is a connected component where the trees can reach each other in a 4-neighborhood manner. In this figure, there are 4 clusters, each of which is represented by the same color as well as the same number.

 

Fig.2. (a) Before burning.

(b) After burning.

 

What to hand in:

(1)   [10 pts] Simulate a forest fire with p=0.2 and with  p=0.4 until the system is stable (i.e. r remains unchanged except some small oscillations). Alternatively, try T=1,000 time-ticks, and report the last 500 ones. For each value of  p,  plot  r(t) vs. t. What is the final r(t)? What conclusion can you draw from these experiments?

(2)   [5 pts] For the settings above (p=0.2 and  p=0.4), and after T=1,000 time-ticks, plot the NCDF of the tree-cluster sizes (i.e., count of clusters with area s or greater,  vs. s) in log-log scales. Fit a line and report the slope.

(3)   [5 pts] Email a tarfile of your source code to the TA (htong at cs). Please include just the source code; no binaries, no intermediate results.

Q5: SVD for latent semantic index [15 pts]

Consider the following Document by Term matrix with some noise.  In this file each row denotes a document, and the columns denote terms. An entry (i,j) denotes the number of occurrences of the jth term in the ith document. Based on their occurrence in the documents, the terms can be clustered as belonging to a particular topic. Run SVD to determine the number of topics present. You may use any standard package (e.g., Matlab, R, mathematica, python/Numeric/numpy, e.t.c.).

What to Hand in:

(1)   [5 pts] Give your estimation of the number of topics present in the dataset, together with your justification.

(2)   [10 pts] Give the top 5 terms for each topic.

Q6: SVD for missing values [10 pts]

When there are some entries in the matrix missed, we can use the reconstructed value of SVD to estimate them. Consider the following 2-d data set (the missing values are denoted by NA), where each line is a data point in 2-d space, you are asked to run SVD to recover the missing values.

What to Hand in:

(1)   [2 pts] Treat the missing values (i.e., ‘NA’) as zeros and give the scatter-plot the whole dataset. Use dots (‘.’) for the existing points, and stars (‘*’) for the points with missing values.

(2)   [8 pts] Now, run the SVD on the whole dataset (by treating the ‘NA’ as zeros). We can use the reconstructed values (i.e., the rank-1 approximation) as the prediction for the missing values. Give the scatter-plot again, with dots ‘.’ for the existing points, and circles ‘o’ for the reconstructed ones. What is your prediction for the missing values?

Comments & interesting facts [0 pts]

(1)   A possible improvement for the above technique is to do it in a recursive way, - to (a) use the reconstructed values as the initial prediction for missing value, (2) do SVD one more time and (3) use the new reconstructed values as a ‘better’ prediction and so on.

(2)   This problem is related with the ratio rules we saw in the class, where we only have one missing value;

(3)   A (much more elaborate) version of this technique is leading the Netflix competition. See the paper by Yehuda Koren for details.

Q7: Ranking on graphs [10 pts]

Download the co-authorship network for the machine learning community. Each line of the dataset has three numbers: the index (author-id) for author i, the index for author j, and the count of paper they have been authored. You can also download the (author-id, name) list here. You are asked to find the top 5 most influential authors by the PageRank algorithm (c=0.85, that is, fly-out probability = 1-c = 0.15, as Google suggests).

Hint: the PageRank vector v satisfies the equation

v = c A v + (1-c)/n 1

where v is an nx1 column vector, A is the nxn to-from column normalized transition matrix, and 1 (in bold) is an nx1 vector full of ‘ones’.

What to Hand in:

(1)   [10 pts], The names of top 5 authors in decreasing order, together with their ranking values.