15-826 HW2 Spring 2007

CARNEGIE MELLON UNIVERSITY

15-826 - Multimedia databases and data mining

Spring 2007

Homework 2 - Due: Tuesday March 20, 1:30pm, in class.

Important:

Due date: Tuesday March. 20, 1:30pm, hard copy in class and soft copy e-mailed to the TA in a single e-mail.
For all questions (unless otherwise noted) please BOTH:
- e-mail the TA a soft-copy of all code, output (including images of plots) and auxiliary scripts tar'ed up in a single file. Please name this file <your_andrew_id>_hw2.tar
- hand in a hard-copy of all code, output and auxiliary scripts
Please turn in a typed report - handwritten material may not be graded, at the grader's discretion.
All homeworks including this are to be done INDIVIDUALLY
For all plots, in addition to a hardcopy printout, please submit an electronic version in postscript, pdf, gif, jpeg, or png format.
Expected effort for this homework:
- Q1: 3 hours
- Q2: 1 hours
- Q3: 2 hours
- Q4: 1 hours

Q1: String-edit and time-warping distances [40 pts]

This string editing distance code penalizes insertions, deletions and substitutions by 1 unit.

[15 pts] Modify it, so that the penalty for vowel-vowel substitution is 0.5, and all other substitutions (consonant-consonant, vowel-consonant and consonant-vowel) should still have a penalty of '1'. Let the vowels be 'a', 'e', 'i', 'o', 'u'. (If Perl is cumbersome for you, feel free to use your favorite language). HAND IN your code.
Using your modified string editing distance function, plot the correlation integral for the a subset of the UNIX dictionary words

a) [4 pts] HAND IN the plot, and
b) [1 pt] report its slope.

[10 pts] Write code to calculate the time warping distance between two strings of numbers. The rules of time-warping distance are fully described in an ICDE 98 paper (gzipped ps, pdf). In summary they are:
- Deletions are impossible (infinite cost)
- Stuttering of either time sequence is free. Specifically:
  - Insertions in either time sequence are free, but you can only insert a value equal to the previous value (in other words, you may repeat any value as many times in a row as you like, for free)
- Substitutions cost the difference between the old and new value, ie: |x_old - x_new|
Your code should take two command line arguments: the names of the files containing the numerical sequences (not necessarily of the same length). The format for each data file should be each number in each file on its own line, with line number implicitly corresponding to timestamp. Eg:
bash> cat file-1.dat
1
12
12
13

bash> cat file-2.dat
10
12
13
and print the time warping distance to standard out:
bash> timewarp file-1.dat file-2.dat
9
HAND IN your code.

[5 pts] Using your time warping distance function, calculate and report the time warping distance between time_1.dat and time_2.dat.
[5 pts] Plot the original (un time-warped) time_1.dat and time_2.dat together in the same plot. In a separate plot, show both sequences together after alignment by time-warping.

Q2: Fractals [15 pts]

Download the fractal-dimension code here and untar it. Run it on the following datasets (points specified as x and y coordinate values):

[4 pt] Elliptical galaxies (notice that the range of x is ~ [0, 40] while y is ~ [-1, 1]).
[4 pt] Spiral galaxies (notice that the range of x is ~ [0, 40] while y is ~ [-1, 1]).
[4 pt] Montgomery county.
[3 pt] "Mystery" dataset.

For each dataset, hand in the following:
(a) the fractal dimensions of the dataset (both D₀ and D₂), and
(b) the corresponding plots.

Q3: Multi-Fractals [30 pts]

Using the b-model paper (gzipped ps, pdf ) as a guide:

[10 pts] Generate a dataset of 1,000,000 disk accesses distributed over 1,024 time intervals, according to a 30/70 (70% on the right) b-model. Submit a time plot (time stamp, number of disk accesses) of this dataset, along with both a soft and hard copy of the code used to generate the data.
[10 pts] Implement a function to calculate the "entropy plot" of this dataset. Your code should zero-pad the sequence so that its length is brought up to the next power of 2. Hand this code in.
[5 pts] Generate and hand in the entropy plot of the synthetic data you generated.
[5 pts] Generate and hand in the entropy plot of this real dataset of disk accesses over 2,048 lines/time intervals (each line contains the number of bytes transferred in that time interval).

Q4: Text and SQL[15 pts]

Download these five electronic books from the Project Guttenberg website. You will need to decompress them and remove the header text from the uncompressed files:

Herman Melville's Moby Dick (document_id = 1)
Leo Tolstoy's War and Peace (document_id = 2)
Jane Austen's Pride and Prejudice (document_id = 3)
Albert Einstein's Relativity: the Special and General Theory (document_id = 4) (Just ignore the equation files. No need to remove the image pointers from the text).
Laozi's Tao Te Ching (document_id = 5) (or Dao De Jing. If curious, optionally read more here)

Turn this data into an SQL databse with a schema like:

create table LIBRARY(document_id integer, term varchar(150), frequency integer);

For example:

document_id term frequency

1 whale 1685

4 acceleration 11

1 ahab 512

5 way 53

... ... ...

Given this data please:

[5 pts] Generate and hand in the Zipf plots (rank-frequency, in log-log scale) for each of the texts. Also hand in the code used to generate these plots.
[5 pts] Generate and hand in the probability density function (pdf) plots (count-frequency, in log-log scale) for each of the texts. Also hand in the code used to generate these plots.
[2 pts] Report which author has the largest vocabulary.
[1 pts] What else can these plots tell you about how each author uses his/her words?
[2 pts] Report which ten words are the most popular across all authors. Also report which ten words are most popular for each author.

HINTS for preprocessing:

You can easily parse each text file into an easier to manage file with each word on its own line by typing, in linux/unix/cygwin:

bash> tr -cs 'a-zA-Z' '[\n*]' < original-file.txt > one-word-per-line-file.txt

And then convert all words to lower case with:

bash> tr A-Z a-z < one-word-per-line-file.txt > one-word-per-line-all-lower-case-file.txt

You can get help with this command by typing man tr or looking here.

Note that you will still have to collapse/count and remove duplicate words. You can do this using the linux/unix/cygwin commands:

bash> sort

and

bash> uniq -c

Search the internet or use man pages for more details.

document_id	term	frequency
1	whale	1685
4	acceleration	11
1	ahab	512
5	way	53
...	...	...