15-826 - Multimedia databases and data mining
Fall 2011 C. Faloutsos

Homework 1
Out: Sept. 22 2011
Due: Oct. 4 2011, before 12 noon (class time), via e-mail to the TA
Task Estimated time
Q1 SQL - 1 20 minutes
Q1 SQL - 2.1 10 minutes
Q1 SQL - 2.2 10-30 minutes
Q1 SQL - 2.3 10-30 minutes
Q1 SQL - 3.1 20-40 minutes
Q1 SQL - 3.2 20-40 minutes
Q1 SQL - 3.3 20-40 minutes
Q1 SQL - 4.1 10-20 minutes
Q1 SQL - 4.2 10-20 minutes
KD-trees - understand code 1 hour
KD-trees - develop algorithm 2-3 hours
KD-trees - coding 30 minutes
KD-trees - debugging 10 minutes - 6 hours
KD-trees - run scripts 10 minutes

Check-list: What to deliver

Please e-mail your solutions to Ina (mfiterau at cs dot cmu dot edu) with subject: [Databases - Homework 1 - ANDREWID]. Please use the following template for your answers: pdf or docx. Your e-message would have two attachments:
  1. A filled-in copy of the template, in pdf. It should have
  2. A tar file with your  modified kd-tree package, for Q2. 'tar xvf; make' should compile and run your code on the demo script.

Q1: SQL [60 points]

You are given the following sets of tables holding the same information about user access to web servers (download them from this page):
Set 1
user_id - unique int
name - string of max. length 50
transaction_id - unique int
user_id - int
time_of_transaction - datetime
transaction_id - unique int
server_IP_address - string
transaction_size_kb - double
server_traffic_size_kb - int
transaction_bytes - double
Set 2
user_id - unique int
name - string of max. length 50
server_IP_address - unique string
server_traffic_size_kb - double
user_id - int
server_IP_address - string
transaction_size_kb - double
time_of_transaction - datetime
In the schema above, a transaction has a single user and server. There can be multiple transactions to a server (even by the same user). The request/traffic limits are constant characteristics of the server. An IP_adress identifies a server. Time_of_transaction is reported at one minute granularity, for instance, a session that last three minutes will correspond to 3 transactions in both schemas.
  1. [4 points] Import both schemas into MySQL. Do any of the schemas break BCNF? If yes, which one(s).
  2. We would like to obtain the following - specify the queries and the answers you obtained.
    1. [3 points] What is the total size of each schema? (after the tables are loaded)
    2. [4 points] What is the server with the largest traffic size in kb? In case of a tie, report the one with the alphabetically first IP.
    3. [6 points] Count all pairs of users with common interests - excluding self-pairs and mirror-pairs. Two users have common interests if they make queries to the same IP_address.
  3. Only for schema 2: Assume we're looking for attacks and we would like reports on the following:
    1. [8 points] A list of the names of the top 5 users in terms of:
      1. number of simultaneous requests summed over all servers
      2. total kB downloaded summed over all servers over all timestamps (the entire time period recorded in the database)
    2. [9 points] For every server, report the total number of distinct users (aggregate over all timestamps) logged in and total bandwidth ever requested. The list contains:

      - server_id
      - number_of_distinct_users
      - total_traffic_size_kb

    3. [8 points] Find the server with the highest number of distinct users. (Hint: create a view for the previous query)
  4. Only for schema 2: Get the total traffic for user Smith
    1. [4 points] Report the wall-clock time it took for the query. Use  explain select  to find out what type of join was used.
    2. [4 points] Index user table by name and traffic table by user id. Run your query again and report time. Now report the type of join that was used.

Q2: KD-trees [50 points]

The full deletion in Kd-trees is tricky. Instead, you are to implement the tomb-stone deletion: that is, you only need to flag a node as 'deleted', but leave it in the tree. And, of course, you have to ignore it, for subsequent queries. Use  the provided kdtree code (note: this version is slightly different from the one in the previous homework: it should run on a mac, too). 
  1. [15 points] Execute script 1 (included, in the tar file above). What is the nearest neighbor to point of coordinates (0.1,0.1)?
    Run  kdtree_main -d 2 < script1  to get the answer.
  2. [15 points] Execute script 2 (also in the tar file). List all the points within the range determine by the points (0.2, 0.5) and (0.25, 0.55).
    Run  kdtree_main -d 2 < script2  to get the answer.
  3. [20 points] Evaluation on hidden scripts: Your code will also be evaluated on a series of scripts that will be made public after the homeworks are returned. The idea is to make sure that  you test your code throughly. Make sure it works on corner cases (e.g., empty tree; newly deleted leaf; deleting a node twice;  deleting, and re-inserting a node; etc).
As we mentioned, in the pdf, submit the functions you changed, highlighting your changes (in red). And, also submit a tar archive with the updated kd-tree package.

Last edited: 21 September 2011,  by Ina Fiterau and Christos Faloutsos