CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Fall 2012 C. Faloutsos

Homework 0

Out: Sept. 11 2012

Due: Sept. 18 2012, before 12:00 pm (class time), via blackboard  deposit

Introduction


Task

Estimated time

Setup SQLite

5-20 minutes

SQL test queries

5-10 minutes

KD tree

5-30 minutes


Q1: Test SQL

For this question we will be using the Marvel dataset, answering questions like 'who is the most popular Marvel-comic character'.


(For your information, the Marvel dataset lists the appearances of Marvel characters in comic books, and was obtained from: http://bioinfo.uib.es/~joemiro/marvel.html. If you are interested, you can find the names of the comic books and characters in the files marvelCharacters.txt and marvelComicBooks.txt.)

This question assumes you have access to SQLite, which is already installed on the Andrew Linux servers.


(Alternatively, you are welcome to install sqlite3 (or mysql, or postgres) on your own machine - click here for instructions for sqlite3.)


Once you have a running sqlite3 engine, run the scripts below. These commands have been tested with SQLite version 3.6.20 - earlier versions may give syntax errors.

1.     On the SQLite3  prompt, create a table with the following commands

linux_prompt% sqlite3 marvel.db

sqlite> create table marvel (
     char_id int,
     comic_id int
 );

2.     Import data from the file marvel.txt by running the following commands:

sqlite> .separator ' '

sqlite> .import marvel.txt marvel

3.     Replace '/absolute/path/to/marvel.txt' with the absolute file path - for instance, if you save the file in your home directory, the full path will be something like /afs/andrew.cmu.edu/usr2/ANDREWID/marvel.txt. Count the number of records in the table. Report the answer in your submission.

sqlite> select count(*) from marvel;

4.     How many distinct comic books are there in the dataset? Report the answer in your submission.

sqlite> select count(distinct(comic_id))
     from marvel;

       5.       (Optional) If you are curious, find which characters appear in the most comic books:

sqlite> select char_id,count(*) ct
     from marvel
     group by char_id
     order by ct;


Q2: Build and run the KD-tree package

The following instructions are for the andrew linux servers. You may use alternative settings (cygwin, mac-os), at your own risk.

1.     Download the kdtree package in your home dir on the andrew linux server (log in with your andrew credentials on linux.andrew.cmu.edu)

2.     Untar the archive

linux_prompt% tar -xvf kdtree_base.tar.gz

3.     Go in the kdtree_base directory.

4.     Clean the directory of any previous builds.

 linux_prompt% make clean 

5.     Build the package

linux_prompt% make 

6.     Run the test scripts provided (script1, script2, script3) and report the number of data points found in your submission for each command.

linux_prompt% kdtree_main -d 3 < script1
linux_prompt% kdtree_main -d 3 < script2
linux_prompt% kdtree_main -d 3 < script3


What to submit:

Submit your answers (5 numbers, all on the same line, blank separated), on the CMU Blackboard system (http://www.cmu.edu/blackboard), under 'HW0'. If (and only if) there is trouble depositing it, please email your answers mailto:ajuarez@andrew.cmu.edu, with the exact header "15826 HW0".

*       Q1.3 (a single number)

*       Q1.4 (a single number)

*       Q2.6 (three numbers, one per script)


Last modified by Antonio Juarez and Christos Faloutsos, Sept. 9, 2011