CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Spring 2006
Homework 2 - Due: March 21,
3:01pm, in class.
Important:
- Please turn in a typed
report - handwritten material will
not
be graded.
- Due date: March 21,
3:01pm, in class
- No collaborations are allowed.
For your information:
- Expected time effort for this homework:
- Q1: 1h
- Q2: 30'
- Q3: 5h
- Q4: 1h
- Q5: 3h
- Points are in [bold pts] and add to 100.
Q1: Indexing [15pts]
1.1 Problem description
In this question you are required to create a keyword search system, by using
PostgreSQL.
Dataset:
dblp.gz
Run 'gunzip dblp.gz' to unzip.
The first column is the document id, and the second column is the word present
in the document.
- Create a table termDocPairs.
create table termDocPairs ( docId
varchar(50), term
varchar(50));
- Store the <document-id, keyword>
pairs in the table termDocPairs.
- Run the query
select count(*)
from termDocPairs
where term='using';
- Determine the average wall-clock time required to run the query. Take the
average over 5 runs of the query. Also report the type of the
machine. (Hint: lookup: Explain Analyze).
- Use the SQL EXPLAIN command to
examine how the optimizer treats the above query.
- Now create an index:
create index termIndex on termDocPairs(term);
and repeat steps 3, 4 and 5.
1.2 What to hand in:
- [3 pts] Hard copy result of the query.
- [4 pts] The average wall clock time for the query without an index.
- [4 pts] The average wall clock time for the query with an index.
- [4 pts] Output of EXPLAIN command in both the cases, and highlight the line(s)
indicating that the index was used (if it was).
1.3 Hints
Please follow the following instructions to run PostgreSQL in any of the
Andrew machines:
-
You should include /usr/local/lib/pgsql/bin
in your PATH. Type 'which psql' at the command line and
make sure the result is
'/usr/local/lib/pgsql/bin/psql'.
setenv PATH ${PATH}:/usr/local/lib/pgsql/bin/
-
Set PGPORT properly. The port number has to be unique for
every user. If you experience problem connecting with postgres, please try
changing the port number. You can select the 'some-number'
as the last four digit of your student id number or some unique number for your
own.
setenv PGPORT some-number
Please find
more details at
http://www.cs.cmu.edu/~olston/15-415/F05/HW/PostgreSQL_Readme.htm.
You may find the following documents useful:
http://www.cs.cmu.edu/~olston/15-415/F05/HW/PostgreSQL_Readme.htm
Syntax of EXPLAIN command:
http://www.postgresql.org/docs/7.3/static/sql-explain.html
How to use EXPLAIN command and understand its output:
http://www.postgresql.org/docs/7.3/static/performance-tips.html
Check the statistics collected by PostgreSQL:
http://www.postgresql.org/docs/7.3/static/planner-stats.html
Change the run-time configurations:
http://www.postgresql.org/docs/7.3/static/runtime-config.html
Create an Index for a table:
http://www.postgresql.org/docs/7.3/static/sql-createindex.html
Q2: Fractals [5]
2.1 Problem description
Download the fractal-dimension code
here and untar it.
Run it on the following datasets:
- [1pt] The
Koch Curve.
- [2pts] The
Periphery of
England
- [2pts]
3d dataset
2.2 What to hand in:
For each dataset, hand in the following:
(a) the fractal dimensions of the dataset (both D0 and D2),
and
(b) the corresponding plots.
Q3: Z-ordering [35pts]
3.1 Problem description
Consider the case of a 2-dimensional address space. We want to store rectangles (xlow, xhigh
, ylow , yhigh), after we break them into z-values.
The z-value for the (x=0,y=1) cell is 1 in decimal.
Write a program 'zvalues' in your favorite language that will do this job. The
first line should give the number of z-values output. For
example, from the UNIX prompt:
%
./zvalues 00 11 00 11 #xlow xhigh ylow yhigh
should return:
1
****
% ./zvalues 0100 1100 0100 1100 #xlow xhigh
ylow yhigh
should return:
17
0011****
0110****
01110000
01110010
01111000
01111010
1001****
1011000*
1011010*
1100****
11010000
11010010
11011000
11011010
1110000*
1110010*
11110000
Notice that:
- In general, it will return more than one z-values.
- For simplicity, you may assume that the input values are already
binary strings, without errors (i.e., same length, with only '0's and '1's,
and with xlow ≤ xhigh ; ylow ≤ yhigh
).
- Ideally, the length of the input strings could be arbitrary. For the
purpose of this assignment, assume that each binary string will have length at
most Lmax = 10.
3.2 What to hand in:
Run your program on the following
script, and hand
in the following:
- [20 pts] Hard copy of your
program's output on the script.
- [10 pts] Hard copy of the source
code.
- [5 pts] Email a tarfile of your
source code to the TA (kaustav+15826 at cs). He should be able to simply run "make
hw2" and get the output. Please include just the source code and not the
binaries.
Q4: SVD [10pts]
4.1.1 Problem description:
Consider the following Document by Term matrix:
DocTerm
The file is in csv (comma-separated-values) format with windows end of line
termination. In this file each row denotes a document, and the columns denote terms. An
entry (i,j) denotes the number of occurrences of the jth term in the
ith document. Based on their occurrence in the documents, the terms
can be clustered as belonging to a particular Topic. You need to run SVD on the
dataset to determine the number of topics present. You can use any standard
software (e.g., Matlab).
4.1.2 Noisy Dataset:
NoisyDocTerm
Run the same analysis on this dataset. Additionally you need to take care of
a small addition of noise to the dataset.
4.2 What to hand in:
- [5pts] The SVD components along with
your estimate of the number of topics for the DocTerm dataset.
- [5pts] The SVD components, and an
explanation of how you handle the noise for the NoisyDocTerm dataset.
Give your estimate of the number of topics present in the dataset.
Q5: String edit distance [35pts]
5.1 Problem description
The edit distance (ED) of two strings, s and t, is defined as the
minimum number of mutations required to change s into t, where
a mutation is one of:
- change a letter,
- insert a letter or
- delete a letter
Example:
- If s is "test" and t is "tent", then ED(s,t) = 1, because one substitution
(change 's' to 'n') is sufficient to transform s into t.
- If s is "test" and t is "scent", then ED(s,t) = 3, (insert 's', change 't'
to 'c', change 's' to 'n').
5.1.1 String editing distance
You need to implement a dynamic programming algorithm to compute the string
edit distance between two strings. Write a program 'sedit' in your favorite
language. For example, from the UNIX prompt:
% ./sedit test tent
should give output 1.
5.1.2 Cheap substitutions
Consider different costs for the three types
of mutations
- change a letter: cost = 1
- insert a letter: cost = 2
- delete a letter: cost = 2
Example:
- If s is "test" and t is "tent", then ED(s,t) = 1, because one substitution
(change 's' to 'n') is sufficient to transform s into t.
- If s is "test" and t is "scent", then ED(s,t) = 4, (insert 's', change 't'
to 'c', change 's' to 'n').
Write a program 'sedit2' in your favorite language. For example, from the
UNIX prompt:
% ./sedit2 test scent
should give output 4.
5.1.3
Substring Edit Distance (SED)
We are given a string s and a substring t.
Deletions from either end of the string are free. All other mutations have
cost 1. The Substring edit distance is the minimum cost needed to
transform s to t.
Example:
s is 'overnumerousnesses' t is 'number': SED = 1.
- Delete 'over'; result: 'numerousnesses' (cost = 0)
- Insert 'b'; result: 'numberousnesses' (cost = 1)
- Delete 'ousnesses'; result: 'number' (cost = 0)
Note that unlike the previous cases the substring edit distance is not
symmetric with respect to s and t.
Write a program 'subsedit' in your favorite language. From the UNIX prompt:
% ./subsedit overnumerousnesses number
should give output 1.
5.2 What to hand in:
Run your programs on the following
script, and hand
in the following:
- [5+5+10 pts] Hard copy of your
program's output on the script.
- [10 pts] Hard copy of the source
code.
- [5 pts] Email a tarfile of your
source code to the TA (kaustav+15826 at cs). He should be able to simply run "make
hw2" and get the output. Please do not include the binaries.