Carnegie Mellon University
15-826 – Multimedia Databases and Data Mining
Spring 2010 – C. Faloutsos
Homework 1
Due date: Feb 2, 3:00pm
- Hand in hard copy of your homework in class. It must be typed; handwritten material may not be graded
- Put soft copy and code in 826hw1_YourAndrewID.tar and email it to TA (dchau at cs) in a single e-mail, whose title must be 826hw1_YourAndrewID
- Homework must be done individually. It is also MANDATORY, meaning you must get at least a passing grade: 50 points or higher
- For all the code that you write, please make sure they can be compiled and can run on a Linux machine of the Andrew cluster. To connect to one such machine, use your favorite SSH client (e.g., putty) to connect to linux.andrew.cmu.edu, and log in using your Andrew ID and password.
- You may also develop your code on other software and/or operating systems (e.g, cygwin on Windows), as long as your code works on an Andrew cluster machine.
Q1. SQL [25 pts]
You must use SQLite for Q1. The machines on the Andrew cluster already have SQLite 3.3.6 installed.
For your information, you may want to download SQLite to your local machine to try out your queries; SQLite is cross-platform. This page lists the SQL supported by SQLite. Notice that the latest version is 3.6.22 (as of Jan 19, 2009).
Both version 3.3.6 and 3.6.22 should work fine for the following questions; but if anything goes wrong, make sure your version is not older than 3.3.6.
Part I. Basic Queries & Join Queries
For each question in this part, provide both the SQL statement(s) and the resulting answer(s), unless specified otherwise.
The following tables list the estimated population of some of the most densely populated cities in the United States (from the United States Census Bureau, for July 1, 2008). Each city is uniquely identified by cityID, a numerical value that starts at 1.
- [3 points] Load both tables into a SQLite database. The table names and column names must follow those specified above. You don't need to turn in anything for this question. (Hint: use the SQLite .import and .separator commands)
- [2 points] What is the average population of the cities?
-
[3 points] What are the five most populated cities, in descending order of population? Your query should retrieve exactly five (city, state) tuples. (Hint: use limit)
-
[3 points] Which city name(s), if any, appears exactly three times in the cities table (we ignore the states)?
Part II. Accessing Two Databases
You now have the following two tables, stored in two separate SQLite databases (
cities.db and
ranks.db). When combined, the tables tell you which 10 cities in the US are the most livable (rank 1 means most livable), according to
Places Rated Almanac in 2009.
- [2 points] Download the two .db databases. Load cities.db with SQLite, using a command similar to sqlite3.exe cities.db. Then, use SQLite's attach statement to add the database ranks.db to the current database connection (for cities.db). Write down the attach statement that you used.
-
[3 points] Which city is the most livable? Your query should return exactly one (city, state) tuple. Write down both the query and the tuple.
Part III. Indexing
The following two tables, stored in
this SQLite database, contain the age and salary information of some employees of a company, each identified by an
id.
personal(id, age)
employment(id, salary)
The following query finds the number of employees who are in their 40s (i.e., 40 to 49, inclusive) and who make more than 100K (>100000).
SELECT count() FROM personal, employment
WHERE personal.id=employment.id AND personal.age >= 40 AND personal.age <=49
AND employment.salary>100000;
- [1 point] Report how much time (to the second) it takes to run the above query. Approximate, wall-clock time is fine. (You may use a stop watch; you may also use the Unix/Linux time command.)
- [3 points] Use SQLite's explain statement to examine why the query took a while to run. What is the reason? (Hint: use .explain to make the outputs easier to read)
- [2 points] Create one index on the id column of the personal table, another on that of the employment table. Write down the statements.
- [1 point] Report how much time it takes now.
- [2 points] Use the explain statement again to find out why it's faster. What is the reason?
Hints/Notes:
- The ASCII data files may have ms-windows/dos end-of-line termination. For unix/linux use "dos2unix" to convert them.
- You may use views or additional tables to compute intermediary results.
Q2. K-D Tree [25 points]
$
./kdtree_main -d 2
num. dimensions = 2
i for insertion
n for nn search
r for range search
p to print the tree
c for count
x to exit
h to print this help message
kdtree>
i
inserting ...
0-th attr. value=
0
1-th attr. value=
0
inserting point: 0 0
i for insertion
n for nn search
r for range search
p to print the tree
c for count
x to exit
h to print this help message
kdtree>
i
inserting ...
0-th attr. value=
1
1-th attr. value=
1
inserting point: 1 1
i for insertion
n for nn search
r for range search
p to print the tree
c for count
x to exit
h to print this help message
kdtree> c
counting ...
0-th attr. low value= 0
1-th attr. low value= 0
0-th attr. high value= 1
1-th attr. high value= 1
counting - low values: 0 0
counting - high values: 1 1
0 0
1 1
2 points found.
Problem Description: We will add functionality to an existing
K-D Tree package. First, build the package (
tar xvf; make). Then run
make test1, which will load a dataset and, finally, return a
Function not yet implemented! message.
Implementation:
-
Implement a new command c (for count) that finds the total number of points in the tree that are within the closed bounding box defined by two points: low = (l1, l2, ..., ln) and high = (h1, h2, ..., hn). In other words, the command should count any point p = (p1, p2, ..., pn) where li<=pi<=hi for 1<=i<=n. The command should first ask for the values of low, then for high. See the usage example on the right for the exact input and output interfaces of the c command (highlighted in yellow). Your interfaces should be the same. User's inputs are in red.
Turn In:
- [10 points] The count returned by c (number of points within the bounding box) when it is applied to the KD-Tree
constructed with (i) dataset 1, using make test1, (ii) dataset 2, using make test2, and (iii) dataset 3, using make test3.
- [15 points] A tarball (YourAndrewID_kdtree.tar) and a hard copy of your code. On the hard copy, highlight or circle the code that you modified\added.
Q3. Hilbert Curve [50 points]
Problem Description: You will write two programs:
- horder, to compute the Hilbert value of a 2D point
- ihorder, to compute the inverse of a Hilbert value
Implementation:
$ ./horder -g 2 0 1
3
horder should have input and output interfaces similar to that shown on the right (user's inputs are in
red). The arguments
-g 2 specifies 2 as the curve's granularity\order. And the arguments
0 1 (integers) are the x and y coordinates of a 2D point. The program then prints out the point's Hilbert value 3.
More specifically, the command-line syntax of horder and ihorder should be:
horder -g <granularity-of-curve> <x-value> <y-value>
ihorder -g <granularity-of-curve> <hilbert-value>
Thus:
horder -g 2 0 0
horder -g 1 0 1
horder -g 2 0 1
ihorder -g 2 0
ihorder -g 1 1
ihorder -g 2 3
Things to turn in:
- [20 points for horder, 20 points for ihorder] A tarball (YourAndrewID_hilbert.tar) and a hard copy of your code. You may use a programming language of your choice (e.g., C, C++, Java), but you must include, with your code, a makefile, that when I type make hilbert will run your program which prints out the answers for question 2. See the makefile included in the K-D Tree package in Q2 for reference.
STRONG HINTS
-
Use or modify the code by Jagadish
[SIGMOD 90]; see the algorithm by Roseman+
[PODS 89].
It is also fine to copy and/or modify code from the web, but you must cite the sources and make sure that your programs follow all of the above
specifications.
-
Make sure your program avoids arithmetic overflows
- [5 points] The answers for the following. For the soft copy, put the outputs in the text file YourAndrewID_hilbert.txt.
horder -g 1 1 0
horder -g 1 1 1
horder -g 2 0 0
horder -g 2 2 1
ihorder -g 2 2
ihorder -g 1 2
ihorder -g 2 3
horder -g 7 100 101
horder -g 8 100 101
ihorder -g 8 3000
-
[5 points] Using your ihorder, plot a Hilbert curve of granularity 4 (16x16 grid). You only need to hand in the hard copy of the plot (soft copy not needed).