Homework 0

CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Fall 2012 C. Faloutsos

Homework 0

Out: Sept. 11 2012

Due: Sept. 18 2012, before 12:00 pm (class time), via blackboard deposit

Introduction

Purpose: The point of this homework is to help you setup the tools you need for Homework 1.
Weight: It has no points, only a 'tick-mark', but it is mandatory.
Support: please post questions to the blackboard (discussion forum 'HW0').
Time to completion: the rough estimates are below

Task	Estimated time
Setup SQLite	5-20 minutes
SQL test queries	5-10 minutes
KD tree	5-30 minutes

Q1: Test SQL

For this question we will be using the Marvel dataset, answering questions like 'who is the most popular Marvel-comic character'.

(For your information, the Marvel dataset lists the appearances of Marvel characters in comic books, and was obtained from: http://bioinfo.uib.es/~joemiro/marvel.html. If you are interested, you can find the names of the comic books and characters in the files marvelCharacters.txt and marvelComicBooks.txt.)

This question assumes you have access to SQLite, which is already installed on the Andrew Linux servers.

(Alternatively, you are welcome to install sqlite3 (or mysql, or postgres) on your own machine - click here for instructions for sqlite3.)

Once you have a running sqlite3 engine, run the scripts below. These commands have been tested with SQLite version 3.6.20 - earlier versions may give syntax errors.

1. On the SQLite3 prompt, create a table with the following commands

linux_prompt% sqlite3 marvel.db

sqlite> create table marvel (
char_id int,
comic_id int
);

2. Import data from the file marvel.txt by running the following commands:

sqlite> .separator ' '

sqlite> .import marvel.txt marvel

3. Replace '/absolute/path/to/marvel.txt' with the absolute file path - for instance, if you save the file in your home directory, the full path will be something like /afs/andrew.cmu.edu/usr2/ANDREWID/marvel.txt. Count the number of records in the table. Report the answer in your submission.

sqlite> select count(*) from marvel;

4. How many distinct comic books are there in the dataset? Report the answer in your submission.

sqlite> select count(distinct(comic_id))
from marvel;

5. (Optional) If you are curious, find which characters appear in the most comic books:

sqlite> select char_id,count(*) ct
     from marvel
     group by char_id
     order by ct;

Q2: Build and run the KD-tree package
The following instructions are for the andrew linux servers. You may use alternative settings (cygwin, mac-os), at your own risk.

1. Download the kdtree package in your home dir on the andrew linux server (log in with your andrew credentials on linux.andrew.cmu.edu)

2. Untar the archive

linux_prompt% tar -xvf kdtree_base.tar.gz

3. Go in the kdtree_base directory.

4. Clean the directory of any previous builds.

linux_prompt% make clean

5. Build the package

linux_prompt% make

6. Run the test scripts provided (script1, script2, script3) and report the number of data points found in your submission for each command.

linux_prompt% kdtree_main -d 3 < script1
linux_prompt% kdtree_main -d 3 < script2
linux_prompt% kdtree_main -d 3 < script3

What to submit:
Submit your answers (5 numbers, all on the same line, blank separated), on the CMU Blackboard system (http://www.cmu.edu/blackboard), under 'HW0'. If (and only if) there is trouble depositing it, please email your answers mailto:ajuarez@andrew.cmu.edu, with the exact header "15826 HW0".

* Q1.3 (a single number)

* Q1.4 (a single number)

* Q2.6 (three numbers, one per script)

Last modified by Antonio Juarez and Christos Faloutsos, Sept. 9, 2011

CARNEGIE MELLON UNIVERSITY 15-826 - Multimedia databases and data mining Fall 2012 C. Faloutsos

Introduction

Q1: Test SQL

CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Fall 2012 C. Faloutsos