hw1

CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Fall 2011 C. Faloutsos

Homework 1 Out: Sept. 22 2011 Due: Oct. 4 2011, before 12 noon (class time), via e-mail to the TA

Purpose: The point of this homework is help you get familiarized with SQL queries and working with KD-trees.
Weight: 1/3 of the weight for homeworks. The maximum score is 100 points.
Time to completion: Estimated total time: 1-4h for Q1; 3-10h for Q2. In more detail:

Task	Estimated time
Q1 SQL - 1	20 minutes
Q1 SQL - 2.1	10 minutes
Q1 SQL - 2.2	10-30 minutes
Q1 SQL - 2.3	10-30 minutes
Q1 SQL - 3.1	20-40 minutes
Q1 SQL - 3.2	20-40 minutes
Q1 SQL - 3.3	20-40 minutes
Q1 SQL - 4.1	10-20 minutes
Q1 SQL - 4.2	10-20 minutes
KD-trees - understand code	1 hour
KD-trees - develop algorithm	2-3 hours
KD-trees - coding	30 minutes
KD-trees - debugging	10 minutes - 6 hours
KD-trees - run scripts	10 minutes

Check-list: What to deliver

Please e-mail your solutions to Ina (mfiterau at cs dot cmu dot edu) with subject: [Databases - Homework 1 - ANDREWID]. Please use the following template for your answers: pdf or docx. Your e-message would have two attachments:

A filled-in copy of the template, in pdf. It should have

the answers to all the subquestions of Q1 and Q2, as well as
the code you changed for the kd-tree. Highlight your changes in red

A tar file with your modified kd-tree package, for Q2. 'tar xvf; make' should compile and run your code on the demo script.

Q1: SQL [60 points]

You are given the following sets of tables holding the same information about user access to web servers (download them from this page):

Set 1
schema1_users.csv user_id - unique int name - string of max. length 50
schema1_traffic_log_client_side.csv transaction_id - unique int user_id - int time_of_transaction - datetime
schema1_traffic_log_server_side.csv transaction_id - unique int server_IP_address - string transaction_size_kb - double server_traffic_size_kb - int transaction_bytes - double

Set 2
schema2_users.csv user_id - unique int name - string of max. length 50
schema2_servers.csv server_IP_address - unique string server_traffic_size_kb - double
schema2_traffic.csv user_id - int server_IP_address - string transaction_size_kb - double time_of_transaction - datetime

In the schema above, a transaction has a single user and server. There can be multiple transactions to a server (even by the same user). The request/traffic limits are constant characteristics of the server. An IP_adress identifies a server. Time_of_transaction is reported at one minute granularity, for instance, a session that last three minutes will correspond to 3 transactions in both schemas.

[4 points] Import both schemas into MySQL. Do any of the schemas break BCNF? If yes, which one(s).
We would like to obtain the following - specify the queries and the answers you obtained.
1. [3 points] What is the total size of each schema? (after the tables are loaded)
2. [4 points] What is the server with the largest traffic size in kb? In case of a tie, report the one with the alphabetically first IP.
3. [6 points] Count all pairs of users with common interests - excluding self-pairs and mirror-pairs. Two users have common interests if they make queries to the same IP_address.
Only for schema 2: Assume we're looking for attacks and we would like reports on the following:
1. [8 points] A list of the names of the top 5 users in terms of:
  1. number of simultaneous requests summed over all servers
  2. total kB downloaded summed over all servers over all timestamps (the entire time period recorded in the database)
2. [9 points] For every server, report the total number of distinct users (aggregate over all timestamps) logged in and total bandwidth ever requested. The list contains:
  
  - server_id
  - number_of_distinct_users
  - total_traffic_size_kb
3. [8 points] Find the server with the highest number of distinct users. (Hint: create a view for the previous query)
Only for schema 2: Get the total traffic for user Smith
1. [4 points] Report the wall-clock time it took for the query. Use explain select to find out what type of join was used.
2. [4 points] Index user table by name and traffic table by user id. Run your query again and report time. Now report the type of join that was used.

Q2: KD-trees [50 points]

The full deletion in Kd-trees is tricky. Instead, you are to implement the tomb-stone deletion: that is, you only need to flag a node as 'deleted', but leave it in the tree. And, of course, you have to ignore it, for subsequent queries. Use the provided kdtree code (note: this version is slightly different from the one in the previous homework: it should run on a mac, too).

[15 points] Execute script 1 (included, in the tar file above). What is the nearest neighbor to point of coordinates (0.1,0.1)?
Run kdtree_main -d 2 < script1 to get the answer.
[15 points] Execute script 2 (also in the tar file). List all the points within the range determine by the points (0.2, 0.5) and (0.25, 0.55).
Run kdtree_main -d 2 < script2 to get the answer.
[20 points] Evaluation on hidden scripts: Your code will also be evaluated on a series of scripts that will be made public after the homeworks are returned. The idea is to make sure that you test your code throughly. Make sure it works on corner cases (e.g., empty tree; newly deleted leaf; deleting a node twice; deleting, and re-inserting a node; etc).

As we mentioned, in the pdf, submit the functions you changed, highlighting your changes (in red). And, also submit a tar archive with the updated kd-tree package.

Last edited: 21 September 2011, by Ina Fiterau and Christos Faloutsos

CARNEGIE MELLON UNIVERSITY 15-826 - Multimedia databases and data mining Fall 2011 C. Faloutsos

Check-list: What to deliver

Q1: SQL [60 points]

Q2: KD-trees [50 points]

CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Fall 2011 C. Faloutsos