Homework 3 - Due: Tuesday Apr. 22,
Reminders – Hints:
In homework 2, we saw how to generate a (fat) fractal recursively. We can also generate fractal using the iterated function system. Get the IFS code here and use it to plot the following dragon curve. The picture below shows how a dragon curve is generated:
|
|
|
|
Order 0 |
Order 1 |
Order 2 |
Order 3 |
Hint #1: find the necessary transformations, and feed their coefficients to the IFS program
Hint #2: To rotate point (x,y) by angle a (counter-clockwise), we need to do a matrix-vector multiplication with matrix :
What
to hand in:
(1)
[5 pts] The scatter-plot of the dragon curve
after 3,000 iterations of the IFS method (roughly 3000 points).
(2)
[5 pts] Hard copy of the source code.
(3)
[4 pt] The correlation integral of this cloud
of 3,000 points, as well as its slope, to 2 decimal digits of accuracy (recall
that the slope is the fractal dimension).
(4)
[1 pt] Are you
surprised by the value of the fractal dimension? Explain its value.
Download the tensor a 4x6x5 tensor here (each
row represents an non-zero element of the tensor in the format “(i,j,k)
value’). Find its Kruskal decomposition of rank 2.
What
to hand in:
(1)
[10 pts] the Kruskal decomposition of the
original tensor, i.e. the three matrices (4x2, 6x2, 5x2 respectively) as well as
the 2x1 vector of the super-diagonal of the core tensor (2x2x2).
What
to hand in:
(1)
[2 pts] Your
estimation for k. Give your
justification.
(2)
[6 pts] For each i=1,…,k, your estimations for .
(3)
[2 pts] A plot of the original dataset,
together with your estimated curve.
In class, we saw that some systems will lead to
power-laws at some critical points in their parameter space (i.e. phase
transition). Furthermore, we saw that some systems can actually arrange
themselves so that they are always close to such a critical point, no matter
what state we start off in. In this exercise, we will study one such model, the
``forest fire’’ model. We follow the
description from the paper by [Newman,
2005]).
Description:
Consider a 100 x 100 lattice. Each cell is either empty, or has a tree. At every time-tick, we have two
activities:
(1)
each empty cell may grow a tree with
probability p
(2)
once all new trees have appeared, a
lightning hits a randomly chosen cell. If this cell has a tree, the tree burns
instantaneously, and so does every other tree adjacent to a burning tree (we
consider only four neighbors of every cell, north, south, east, west). That is,
a lightning burns out the whole cluster of trees. See Fig.1. for illustrations
of clusters, and Fig.2. for the burning step.
Let r(t)
be the ratio of occupied cells at time-tick t.
After several time-ticks, the ratio (r(t))
will be oscillating around the phase-transition point. Furthermore, the size of
the tree-clusters will follow a power-law.
|
Fig.1., In this 6 x 6 lattice, each occupied
square represents a tree. A cluster is a connected component where the trees
can reach each other in a 4-neighborhood manner. In this figure, there are 4
clusters, each of which is represented by the same color as well as the same
number. |
|
|
Fig.2.
(a) Before burning. |
(b)
After burning. |
What
to hand in:
(1)
[10 pts] Simulate a forest fire with p=0.2 and with p=0.4 until the system
is stable (i.e. r remains unchanged
except some small oscillations). Alternatively, try T=1,000 time-ticks, and report the last 500 ones. For each value
of p, plot r(t)
vs. t. What is the final r(t)? What conclusion can you draw from
these experiments?
(2)
[5 pts] For the settings above (p=0.2 and p=0.4), and after T=1,000 time-ticks, plot the NCDF of the
tree-cluster sizes (i.e., count of clusters with area s or greater, vs. s) in log-log scales. Fit a line and
report the slope.
(3)
[5 pts] Email a tarfile of your source code to
the TA (htong at cs). Please include just the source code; no binaries, no
intermediate results.
Consider the following Document by Term matrix with some noise. In this file each row denotes a document, and the columns denote terms. An entry (i,j) denotes the number of occurrences of the jth term in the ith document. Based on their occurrence in the documents, the terms can be clustered as belonging to a particular topic. Run SVD to determine the number of topics present. You may use any standard package (e.g., Matlab, R, mathematica, python/Numeric/numpy, e.t.c.).
What
to Hand in:
(1) [5 pts]
Give your estimation of the number of topics present in the dataset, together
with your justification.
(2) [10 pts] Give the
top 5 terms for each topic.
When there are some entries in the matrix missed, we
can use the reconstructed value of SVD to estimate them. Consider the following
2-d data set
(the missing values are denoted by NA), where each line is a data point in 2-d
space, you are asked to run SVD to recover the missing values.
What
to Hand in:
(1) [2 pts]
Treat the missing values (i.e., ‘NA’) as zeros and give the scatter-plot the
whole dataset. Use dots (‘.’) for the existing points, and stars (‘*’) for the
points with missing values.
(2) [8 pts] Now,
run the SVD on the whole dataset (by treating the ‘NA’ as
zeros). We can use the reconstructed values (i.e., the rank-1 approximation) as
the prediction for the missing values. Give the scatter-plot again, with dots
‘.’ for the existing points, and circles ‘o’ for the reconstructed ones. What
is your prediction for the missing values?
Comments
& interesting facts [0 pts]
(1) A
possible improvement for the above technique is to do it in a recursive way, -
to (a) use the reconstructed values as the initial prediction for missing value,
(2) do SVD one more time and (3) use the new reconstructed values as a ‘better’
prediction and so on.
(2) This
problem is related with the ratio rules we saw in the class, where we only have
one missing value;
(3) A
(much more elaborate) version of this technique is leading the Netflix
competition. See the
paper by Yehuda Koren for details.
Download the co-authorship
network for the machine learning community. Each line of the dataset has
three numbers: the index (author-id) for author i, the index for author j,
and the count of paper they have been authored. You can also download the
(author-id, name) list here.
You are asked to find the top 5 most influential authors by the PageRank
algorithm (c=0.85, that is, fly-out probability = 1-c = 0.15, as Google
suggests).
Hint:
the PageRank vector v satisfies the
equation
v
= c A v + (1-c)/n 1
where
v is an nx1 column vector, A is the nxn to-from column normalized
transition matrix, and 1 (in bold)
is an nx1 vector full of ‘ones’.
What
to Hand in:
(1) [10 pts],
The names of top 5 authors in decreasing order, together with their ranking
values.