CARNEGIE MELLON UNIVERSITY

15-826 - Multimedia databases and data mining

Spring 2006

Homework 2 - Due: March 21, 3:01pm, in class.

Important:

For your information:

Q1: Indexing [15pts]

1.1 Problem description

In this question you are required to create a keyword search system, by using PostgreSQL. 

Dataset: dblp.gz
Run 'gunzip dblp.gz' to unzip. The first column is the document id, and the second column is the word present in the document.

  1. Create a table termDocPairs.
    create table termDocPairs ( docId varchar(50), term varchar(50));
  2. Store the <document-id, keyword> pairs in the table termDocPairs.
  3. Run the query
    select count(*)
    from termDocPairs
    where term='using';
     
  4. Determine the average wall-clock time required to run the query. Take the average over 5 runs of the query. Also report the type of the machine. (Hint: lookup: Explain Analyze).
  5. Use the SQL EXPLAIN command to examine how the optimizer treats the above query.
  6. Now create an index:
    create index termIndex on termDocPairs(term);
    and repeat steps 3, 4 and 5.
     

1.2 What to hand in:

  1. [3 pts] Hard copy result of the query.
  2. [4 pts] The average wall clock time for the query without an index.
  3. [4 pts] The average wall clock time for the query with an index.
  4. [4 pts] Output of EXPLAIN command in both the cases, and highlight the line(s) indicating that the index was used (if it was).

1.3 Hints

Please follow the following instructions to run PostgreSQL in any of the Andrew machines:

Please find more details at http://www.cs.cmu.edu/~olston/15-415/F05/HW/PostgreSQL_Readme.htm.

You may find the following documents useful:

 

Q2: Fractals [5]

2.1 Problem description

Download the fractal-dimension code here and untar it. Run it on the following datasets:
 

  1. [1pt] The Koch Curve.
  2. [2pts] The Periphery of England
  3. [2pts] 3d dataset
     

2.2 What to hand in:

For each dataset, hand in the following:
(a) the fractal dimensions of the dataset (both D0 and D2), and
(b) the corresponding plots.

 

Q3: Z-ordering [35pts]

3.1 Problem description

Consider the case of a 2-dimensional address space. We want to store rectangles (xlow, xhigh , ylow , yhigh), after we break them into z-values. The z-value for the (x=0,y=1) cell is 1 in decimal. Write a program 'zvalues' in your favorite language that will do this job. The first line should give the number of z-values output. For example, from the UNIX prompt:


    % ./zvalues 00 11 00 11 #xlow xhigh ylow yhigh
should return:

1
****

    % ./zvalues 0100 1100 0100 1100 #xlow xhigh ylow yhigh
should return:

17
0011****
0110****
01110000
01110010
01111000
01111010
1001****
1011000*
1011010*
1100****
11010000
11010010
11011000
11011010
1110000*
1110010*
11110000

Notice that:

3.2 What to hand in:

Run your program on the following script, and hand in the following:

  1. [20 pts] Hard copy of your program's output on the script.
  2. [10 pts] Hard copy of the source code.
  3. [5 pts] Email a tarfile of your source code to the TA (kaustav+15826 at cs). He should be able to simply run "make hw2" and get the output. Please include just the source code and not the binaries.
     

Q4: SVD [10pts]

4.1.1 Problem description:

Consider the following Document by Term matrix: DocTerm

The file is in csv (comma-separated-values) format with windows end of line termination. In this file each row denotes a document, and the columns denote terms. An entry (i,j) denotes the number of occurrences of the jth term in the ith document. Based on their occurrence in the documents, the terms can be clustered as belonging to a particular Topic. You need to run SVD on the dataset to determine the number of topics present. You can use any standard software (e.g., Matlab).

4.1.2 Noisy Dataset: NoisyDocTerm 

Run the same analysis on this dataset. Additionally you need to take care of a small addition of noise to the dataset.

4.2 What to hand in:

  1. [5pts] The SVD components along with your estimate of the number of topics for the DocTerm dataset.
  2. [5pts] The SVD components, and an explanation of how you handle the noise for the NoisyDocTerm  dataset. Give your estimate of the number of topics present in the dataset.

 

Q5: String edit distance [35pts]

5.1 Problem description

The edit distance (ED) of two strings, s and t, is defined as the minimum number of mutations required to change s into t, where a mutation is one of:

  1. change a letter,
  2. insert a letter or
  3. delete a letter

Example:

5.1.1 String editing distance

You need to implement a dynamic programming algorithm to compute the string edit distance between two strings. Write a program 'sedit' in your favorite language. For example, from the UNIX prompt:

   % ./sedit test tent

should give output 1.

5.1.2 Cheap substitutions

Consider different costs for the three types of mutations

  1. change a letter: cost = 1
  2. insert a letter: cost  = 2
  3. delete a letter: cost = 2

Example:

Write a program 'sedit2' in your favorite language. For example, from the UNIX prompt:

   % ./sedit2 test scent

should give output 4.

5.1.3 Substring Edit Distance (SED)

We are given a string s and a substring t. Deletions from either end of the string are free. All other mutations have cost 1. The Substring edit distance is the minimum cost needed to transform s to t.

Example:
s is 'overnumerousnesses' t is 'number': SED = 1.

  1. Delete 'over'; result: 'numerousnesses' (cost = 0)
  2. Insert 'b'; result: 'numberousnesses' (cost = 1)
  3. Delete 'ousnesses'; result: 'number' (cost  = 0)

Note that unlike the previous cases the substring edit distance is not symmetric with respect to s and t.

Write a program 'subsedit' in your favorite language. From the UNIX prompt:

   % ./subsedit overnumerousnesses number
should give output 1.

5.2 What to hand in:

Run your programs on the following script, and hand in the following:

  1. [5+5+10 pts] Hard copy of your program's output on the script.
  2. [10 pts] Hard copy of the source code.
  3. [5 pts] Email a tarfile of your source code to the TA (kaustav+15826 at cs). He should be able to simply run "make hw2" and get the output. Please do not include the binaries.