CARNEGIE MELLON UNIVERSITY
15-826 - Multimedia databases and data mining
Spring 2006
Homework 1 - Due: Feb. 7, 3:01pm,
in class.
Important:
- Please turn in a typed
report - handwritten material will
not
be graded.
- Due date: Feb. 7,
3:01pm, in class
- All homeworks including this are to be
done INDIVIDUALLY
- Expected effort for this homework:
- Q1: 1-2 hours
- Q2: 5-6 hours: 1h to compile and run the
code, 2h to design the algorithm and 2-3 hours for coding and debugging.
- Q2: 5-7 hours: 1h to compile and run the
code, 3-5h for coding and debugging and 1h for generating results
Q1: SQL [20pts]
Retrieve the following tables and load them in a DBMS of your choice:
MS-Access is recommended, but any other is acceptable (eg., MySQL, postgres,
etc). Please refer to
http://www.cs.cmu.edu/~olston/15-415/F05/HW/PostgreSQL_Readme.htm if you
want to run postgres in one of the andrew machines.
Note for postgres users:
- Be careful to use your own unique port number (setenv PGPORT) as
described in the above link.
- You can use this script:
load.sql to
load the data into the server. Use the command '\i load.sql' at the psql
promt.
Tables:
- MOVIE (title, year, type)
- PLAY_IN (first_name, last_name, title)
- ACTOR (first_name, last_name, gender)
The tables are comma-separated ASCII files. Answer the following queries:
- [2pts] List the titles of all movies having more than 3 actors.
- [2pts] List the titles of all type=M movies of "Tom Cruise" made after
the year 1996 (>1996).
- [5pts] Report the actor (first_name, last_name) who has acted in the maximum
number of movies. In case of tie, report all such actors.
- [4pts] List the titles of all movies that have at least one common actor with
the movie "EuroTrip".
- [7pts] List the titles of all movies that have more than one common actor with
the movie "EuroTrip".
For each query give
- the resulting table [half of the
points] and
- the corresponding SQL statement [rest half
of the points].
Caution: The ASCII data files have windows end-of-line termination. For unix/linux
use "dos2unix" to convert them.
Q2 K-d trees : Spatial Join [30pts]
Consider the k-d-tree package, in C, at
www.cs.cmu.edu/~kaustav/15826/hw1/kdtree1_1.tar (tar
xvf; make). You are required to augment it to handle spatial joins.
Input Data (ASCII files with windows end-of-line termination):
- 2D data: (www.cs.cmu.edu/~kaustav/15826/hw1/2D.txt).
- Each line specifies a point. The (blank separated) format is: point_id x
y.
- point_id specifies a unique id for each point. x y are the two
coordinate values
- 3D data: (www.cs.cmu.edu/~kaustav/15826/hw1/3D.txt)
- Each line specifies a point. The (blank separated) format is: point_id x
y z.
- point_id specifies a unique id for each point. x y z are the three
coordinate values
Implement an option 's' for spatial join. Then your program should:
- read an integer value ε.
- print the coordinates of the pairs of points
that are within (<=) Euclidean distance ε of each other. Eliminate
self-pairs and mirror-pairs. Eg. (a,b) is a mirror pair of (b,a).
- print the count of such pairs.
You are not allowed to make n (= number of
points) range queries. You should be able to exploit the space partition
represented by the kd tree to do it more efficiently.
Turn in:
- [15pts] Hard copy: a printout of your source code (please give only the parts that you have added/modified).
Email the tar-ball to kaustav+15826 at cs. Please make sure that the program
correctly compiles with make.
- [8pts] Hard copy results: The coordinates of the pair of points and the count for
ε = 1, 10, 20 for the 2D dataset.
- [7pts] Hard copy results: The counts for
ε = 1, 10, 20, 50, 100, 200 for both data sets.
Q3: R trees [50pts]
You are required to extend the capabilities of an R tree package, and
implement plot and count queries.
- Code: The code for the R-Tree is at
www.cs.cmu.edu/~kaustav/15826/hw1/DRtree.tar.gz It
is in C; 'gunzip; tar xvf' and do 'make demo'. This creates
the bin/DRmain program and runs it on some small datasets. It has
been tested on unix platform in the andrew machines.
- Do 'bin/DRmain' and insert some points of your own, to become familiar
with the package.
- the program expects rectangles - treat points as degenerate rectangles (x_low
= x_high etc).
- in case of unexplained errors, do 'make spotless' , to delete all
libraries etc.
- Implement : Currently the R-tree package supports 's' for
range search, 'i' for insertion etc. You have to implement
- 'r' for 'mb(r) coordinates' : your program should print the
coordinates(x_low, x_high, y_low, y_high) of the MBRs of all the nodes of the
specified level.
- 'p' for '(p)lot' : your program should plot all the MBRs of the specified
level.
- 'c' for '(c)ount' : your program should print the count of the number of
nodes in the specified level.
- For each of the three queries above, your program should ask the user to
specify the level.
- Input Data : MGCounty
www.cs.cmu.edu/~kaustav/15826/hw1/MGCounty.txt (points specified as x and
y coordinate values). Please take into account that the file has windows
end-of-line termination.
- Turn in :
- [15pts] Hard copy: a printout of your source code (you are
welcome to give only the parts that you have added/modified).
- [5pts] Email the whole tar-ball to kaustav+15826 at cs. Please
make sure that the program correctly compiles with make.
- [15pts] Hard copy: Give the coordinates (x_low, x_high, y_low, y_high) of
level 1 parent MBRs. Assume the root to be level 0.
- [10pts] Hard copy: Make a plot of all the level 1 MBRs. (You may use
gnuplot to plot the rectangles).
- [5pts] Hard copy: Report the counts (a) of level 1 and (b)
of level 2 nodes.