Carnegie Mellon University
15-826 – Multimedia Databases and Data Mining
Spring 2010 – C. Faloutsos
Homework 1
Due date: Feb 2, 3:00pm

Q1. SQL [25 pts]

You must use SQLite for Q1. The machines on the Andrew cluster already have SQLite 3.3.6 installed.

For your information, you may want to download SQLite to your local machine to try out your queries; SQLite is cross-platform. This page lists the SQL supported by SQLite. Notice that the latest version is 3.6.22 (as of Jan 19, 2009).

Both version 3.3.6 and 3.6.22 should work fine for the following questions; but if anything goes wrong, make sure your version is not older than 3.3.6.

Part I. Basic Queries & Join Queries

For each question in this part, provide both the SQL statement(s) and the resulting answer(s), unless specified otherwise.

The following tables list the estimated population of some of the most densely populated cities in the United States (from the United States Census Bureau, for July 1, 2008). Each city is uniquely identified by cityID, a numerical value that starts at 1.

cities(cityID, cityName, state) population(cityID, value)
  1. [3 points] Load both tables into a SQLite database. The table names and column names must follow those specified above. You don't need to turn in anything for this question. (Hint: use the SQLite .import and .separator commands)
  2. [2 points] What is the average population of the cities?
  3. [3 points] What are the five most populated cities, in descending order of population? Your query should retrieve exactly five (city, state) tuples. (Hint: use limit)
  4. [3 points] Which city name(s), if any, appears exactly three times in the cities table (we ignore the states)?

Part II. Accessing Two Databases

You now have the following two tables, stored in two separate SQLite databases (cities.db and ranks.db). When combined, the tables tell you which 10 cities in the US are the most livable (rank 1 means most livable), according to Places Rated Almanac in 2009.
cities.db cities(cityID, cityName, state) ranks.db ranks(cityID, rank)
  1. [2 points] Download the two .db databases. Load cities.db with SQLite, using a command similar to sqlite3.exe cities.db. Then, use SQLite's attach statement to add the database ranks.db to the current database connection (for cities.db). Write down the attach statement that you used.
  2. [3 points] Which city is the most livable? Your query should return exactly one (city, state) tuple. Write down both the query and the tuple.

Part III. Indexing

The following two tables, stored in this SQLite database, contain the age and salary information of some employees of a company, each identified by an id.
personal(id, age) employment(id, salary)
The following query finds the number of employees who are in their 40s (i.e., 40 to 49, inclusive) and who make more than 100K (>100000).
SELECT count() FROM personal, employment WHERE personal.id=employment.id AND personal.age >= 40 AND personal.age <=49 AND employment.salary>100000;
  1. [1 point] Report how much time (to the second) it takes to run the above query. Approximate, wall-clock time is fine. (You may use a stop watch; you may also use the Unix/Linux time command.)
  2. [3 points] Use SQLite's explain statement to examine why the query took a while to run. What is the reason? (Hint: use .explain to make the outputs easier to read)
  3. [2 points] Create one index on the id column of the personal table, another on that of the employment table. Write down the statements.
  4. [1 point] Report how much time it takes now.
  5. [2 points] Use the explain statement again to find out why it's faster. What is the reason?

Hints/Notes:

Q2. K-D Tree [25 points]

$ ./kdtree_main -d 2 # create an empty 2-D tree num. dimensions = 2 i for insertion n for nn search r for range search p to print the tree c for count x to exit h to print this help message kdtree> i inserting ... 0-th attr. value= 0 1-th attr. value= 0 inserting point: 0 0 i for insertion n for nn search r for range search p to print the tree c for count x to exit h to print this help message kdtree> i inserting ... 0-th attr. value= 1 1-th attr. value= 1 inserting point: 1 1 i for insertion n for nn search r for range search p to print the tree c for count x to exit h to print this help message
kdtree> c counting ... 0-th attr. low value= 0 1-th attr. low value= 0 0-th attr. high value= 1 1-th attr. high value= 1 counting - low values: 0 0 counting - high values: 1 1 0 0 1 1 2 points found.
Problem Description: We will add functionality to an existing K-D Tree package. First, build the package (tar xvf; make). Then run make test1, which will load a dataset and, finally, return a Function not yet implemented! message.

Implementation:
Turn In:
  1. [10 points] The count returned by c (number of points within the bounding box) when it is applied to the KD-Tree constructed with (i) dataset 1, using make test1, (ii) dataset 2, using make test2, and (iii) dataset 3, using make test3.
  2. [15 points] A tarball (YourAndrewID_kdtree.tar) and a hard copy of your code. On the hard copy, highlight or circle the code that you modified\added.


Q3. Hilbert Curve [50 points]

Problem Description: You will write two programs: Implementation:
$ ./horder -g 2 0 1 3
horder should have input and output interfaces similar to that shown on the right (user's inputs are in red). The arguments -g 2 specifies 2 as the curve's granularity\order. And the arguments 0 1 (integers) are the x and y coordinates of a 2D point. The program then prints out the point's Hilbert value 3.

More specifically, the command-line syntax of horder and ihorder should be:

horder -g <granularity-of-curve> <x-value> <y-value> ihorder -g <granularity-of-curve> <hilbert-value>
Thus:
horder -g 2 0 0 #should return '0' horder -g 1 0 1 # should return '1' horder -g 2 0 1 # should return '3' ihorder -g 2 0 # should return '0 0' ihorder -g 1 1 # should return '0 1' ihorder -g 2 3 # should return '0 1'


Things to turn in:

  1. [20 points for horder, 20 points for ihorder] A tarball (YourAndrewID_hilbert.tar) and a hard copy of your code. You may use a programming language of your choice (e.g., C, C++, Java), but you must include, with your code, a makefile, that when I type make hilbert will run your program which prints out the answers for question 2. See the makefile included in the K-D Tree package in Q2 for reference.

    STRONG HINTS
    • Use or modify the code by Jagadish [SIGMOD 90]; see the algorithm by Roseman+ [PODS 89]. It is also fine to copy and/or modify code from the web, but you must cite the sources and make sure that your programs follow all of the above specifications.
    • Make sure your program avoids arithmetic overflows
  2. [5 points] The answers for the following. For the soft copy, put the outputs in the text file YourAndrewID_hilbert.txt.
  3. horder -g 1 1 0 horder -g 1 1 1 horder -g 2 0 0 horder -g 2 2 1 ihorder -g 2 2 ihorder -g 1 2 ihorder -g 2 3 horder -g 7 100 101 horder -g 8 100 101 ihorder -g 8 3000
  4. [5 points] Using your ihorder, plot a Hilbert curve of granularity 4 (16x16 grid). You only need to hand in the hard copy of the plot (soft copy not needed).