Homework 1  Due: Tuesday Feb. 5, 1:30pm,
in class.
Problem Description. Retrieve the following tables and load them in a DBMS of your choice: MSAccess is recommended, but any other is acceptable (eg., MySQL, postgres, etc). Caution: The ASCII data files have windows endofline termination. For unix/linux use "dos2unix" to convert them.
Tables:
The tables are commaseparated ASCII files. Note that in ‘Rating’ table, we have omitted the userid. So, you might find that some movies have been rated multiple times with the same rating. The rating values range from 1 to 5 (5 being the best). Answer the following queries:
Q.1.1. [5 pts] List all movie titles where “Tom Cruise” has played.
SELECT title FROM playin WHERE first_name
= 'Tom' and
last_name = 'Cruise' 

Comment: the result on the old dataset is also fine.
Q.1.1. [5 pts] List the title of popular movies, that is, whose rating is great than 4 (>4) as well as their average ratings. Order the results by their average rating (from highest to lowest).
Clarification: we want to find the title as well
as the average rating for those popular movies. And here, a popular movie is
defined as the one whose average rating is great than 4.
SELECT title, avg(rating) FROM rating GROUP BY title HAVING
avg(rating)>4 ORDER BY
avg(rating) DESC 

Q.1.2. [10 pts] List the actors (first_name, last_name) and titles of movies, which are produced before 1960 and whose type is 'TV'.
SELECT
playin.first_name, playin.last_name, movie.title FROM playin,
movie WHERE movie.title
= playin.title and movie.year
< 1960 and movie.type = 'TV' 

Q.1.3. [10 pts] List the actors (first_name, last_name) as well as their genders, who have played in those well discussed movies, that is, which have been rated more than 3 times (>3). Note that, here the actual ratings do not matter. So a movie that has been rated 4 times, and each time it receives a rating of 1 will still qualify.
SELECT
actor.first_name,actor.last_name, actor.gender FROM playin,
actor WHERE
playin.first_name = actor.first_name and
playin.last_name = actor.last_name and playin.title
in (SELECT title
FROM rating
GROUP BY title HAVING count(*)>3) 

What to hand in: for each query give
(1) the resulting tables [half of the points] and
(2)
the hardcopy of the corresponding SQL statement [half
of the points].
Fun fact: this is closely related to the NetFlix prize, which is $1M (see http://www.netflixprize.com/index).
Q.2.1. Problem description (1dimensional case): given a region (= segment) in 1dimension (x_{low}, x_{high}), break it into some homogeneous regions (i.e. one number, which consists of 1/0/*, for one such region). Here, x_{low}, and x_{high} will be binary strings with n bits, assuming that x_{low} is always less or equal than x_{high}. Write a program 'zrange_1d' in your favorite language that will do this job. For example, (with the usual unix/linux convention, that ‘%’ is the promp and ‘#’ is the comment symbol)
% ./zrange_1d 00 11 #xlow xhigh
**
Here, the answer ‘**’ means that it covers the whole region, which is the interval from 0 (=00_{2}) to 3 (=11_{2}). Another example is,
% ./zrange_1d 010 111 #xlow xhigh
01*
1**
Notice that in this example, the program returns two pieces and each of them is 3 characters long, indicating that the original region (010, 111) consists of two homogeneous regions.
Answer:
% ./zrange_1d 001 101
001
01*
10*
% ./zrange_1d 0100 1100
01**
10**
1100
Comment:
· A homogeneous must satisfy the following two features: (a) xlow and xhigh agree with the prefix; and (2) the suffix of xlow must be all 0s and the suffix of xhigh must be all 1s.
· The above claim has two exceptions: (a) a single point, where the suffix is empty and (2) the whole region where the prefix is empty.
· Try to do it recursively based on the above two claims.
Q.2.2. Problem description (2dimensional case): now, consider the case of
2dimensional space. We want to break a range query (x_{low}, x_{high
, }y_{low , }y_{high}), into zvalues. Write a program 'zrange_2d ' in your favorite language
that will do this job. Again, the four input numbers are n bits long, with the
x_{low} is less or equal than x_{high}, and the y_{low},
is less or equal than y_{high}.
For example,
% ./zrange_2d 00 11
00 11 #xlow xhigh ylow yhigh
****
This means that we cover the whole area and the output has 4 ‘don’tcare’ (‘*’) characters.
% ./zrange_2d
0100 1100 0100 1100 #xlow xhigh
ylow yhigh
0011****
0110****
01110000
01110010
01111000
01111010
1001****
1011000*
1011010*
1100****
11010000
11010010
11011000
11011010
1110000*
1110010*
11110000
Observation #1: we consider ‘1110010*’ as a legitimate zvalue despite the fact that it is not a square
region.
Observation #2: ‘*’ in the middle is not allowed.
Observation #3: The zvalue for the (x=0,y=1) cell is 1 in decimal.
Answer:
% ./zrange_2d 01010 01010 11001 11001
0111001001
% ./zrange_2d 0001 0001 0000 1111
0000001*
0100001*
0001001*
0101001*
0000011*
0100011*
0001011*
0101011*
% ./zrange_2d 01000 10011 01000 10011
0011******
011000****
011010****
10010*****
110000****
% ./zrange_2d 0001000 0101111 0001000 0101111
00000011******
0000011*******
0001001*******
00001001******
00001011******
000011********
000110********
00100001******
00100011******
001001********
001100********
Comment:
· Try to break this problem into two 1d problems and then do bitshuffling;
· Note that we allow odd number of ‘*’ at the end. In other words, we can represent a rectangle (instead of square) region by one single z value.
· Note that the 0s at the beginning are necessary. In other words, ‘0010’ and ‘10’ are two different regions.
Q.2.3. Given a 8×8
grid, we want to find out the region (x_{low}, x_{high , }y_{low , }y_{high}) that has largest number of zvalues. List the qualifying (x_{low}, x_{high , }y_{low
, }y_{high}) as
well as the count of its zvalues. If there is a tie, any answer is fine.
Answer: (000 111 001
110), which leads to 24 zvalues.
Comment: note that 1** is counted as 1 zvalue.
What to hand in:
(1)
[16 pts] Hard copy of your program's output on the
script.
(2)
[5 pts] Hard copy of the source code for Q.2.1 and Q.2.2.
(3)
[5 pts] Email a tarfile of your source code for Q.2.1 and Q.2.2 to the TA (htong at cs). Please
include just the source code and not the binaries.
(4)
[4 pts] List the qualifying (x_{low},
x_{high , }y_{low , }y_{high}) as well as the count of its zvalues. If there is a tie, any answer is
fine.
Q.3.1 Problem
Description. Consider the kdtree
package, in
C, (tar xvf; make). You are required to implement ‘w’ for
dra(w)ing the current space partition that is used by kdtree. For example, if
you insert these
three data points (each
row is a 2d data point) into the kdtree, you program should return the
following figure:
For this subproblem,
you can assume that all data points are located within a rectangle, where
x_low=y_low = 0; and x_high=y_high = 256.
What to Hand in:
(1)
[10 pts ] the plot of space partition that is used by kdtree if you insert the
following points into
the tree (each row is a 2d point);
(2)
[10 pts] the plot of space partition that is used by kdtree if you insert the
following points into
the tree (each row is a 2d point);
(3)
[5 pts] the hard copy of your source code (please give only the parts that
you have added/ modified). Email
the tarball to htong at cs. Please make sure that the program correctly
compiles with make.
Answer:
Comment:
·
We can
leverage ‘p’ function to solve this problem.
·
For those
who are not familiar with c, here is the matlab package for kdtree
·
The related
code is as follows:
/* C
code for finding the lines */
/*
generates intervals in the tree */
void draw(TREENODE * subroot){
printf("\n0 0 256
0\n");
printf("0 0 0
256\n");
printf("256 0 256
256\n");
printf("0 256 256
256\n");
NUMBER low[2];
low[0] = 0;
low[1] = 0;
NUMBER high[2];
high[0] = 256;
high[1] = 256;
rdraw(subroot, 0, low,
high);
}
void rdraw(TREENODE * subroot, int level, NUMBER * low, NUMBER * high){
if(subroot != NULL)
{
if(level == 0)
{
printf("%g %g %g
%g\n", ((subroot>pvec)>vec)[0], low[1], ((subroot>pvec)>vec)[0],
high[1]);
}
else
{
printf("%g
%g %g %g\n", low[0], ((subroot>pvec)>vec)[1], high[0],
((subroot>pvec)>vec)[1]);
}
NUMBER
tmp = high[level];
high[level]
= ((subroot>pvec)>vec)[level];
rdraw(subroot>left,
(level + 1) % 2, low, high);
high[level]
= tmp;
tmp
= (low)[level];
low[level]
= ((subroot>pvec)>vec)[level];
rdraw(subroot>right,
(level + 1) % 2, low, high);
low[level]
= tmp;
}
}
Q.3.2. According to
Bentley in 1975, the search time in a ddimensional
space is empirically expected to be O(log(n)), where n is the size of the dataset. The purpose of this subproblem is to
verify this claim, and to compare the search time of kdtrees against the time
for sequential scanning. You are
asked to randomly generate n data
points in 2d space; then, find the nearest neighbor to each of the following data
points (each row
is a data point) by kdtree; and compare the search time with sequential scan.
For sequential scan, if your machine is too low, you are allowed to only find
the nearest neighbor to the first data point. Make sure that you *EXCLUDE* the
time to construct the kdtree.
What to Hand in:
(1)
[10 pts], report plot (the search time vs. the size of the dataset) for the
two methods, where the size of the dataset n=1,000,000; 2,000,000; 4,000,000;
8,000,000; 16,000, 000. For each n,
run the experiment 10 times and report the mean search time and the _{}1 standard
deviation. Also, report the same plot in loglog scales.
(2)
[5 pts] the hard copy of your script which generates the above curve.
Hints:
(1)
Ideally,
the results should be run on a dedicated machine.
(2)
If your
machine is too slow, scale down the size of the dataset n by a factor of 10
(that is n=100,000; 200,000; 400,000; 800,000; 1,600,000).
(3)
You can use
gnuplot, excel, matlab (or anything else you prefer to) to generate the plot.
For example, you can use the command ‘errorbar’ in matlab to plot the mean search time as well as its standard
deviation.
(4)
You can use
a profiler to find the CPU/running time, or the function clock() in C
(and, of course, #include
<time.h> )
Answer:
For
sequential scan 

Linearlinear
plot 
Loglog
plot 


For
KDTree 

Linearlinear
plot 
Loglog
plot 


Comment:
·
The
absolute value does not matter.
·
Some
people found that the search time for KDtree (using clock()) is always be
0.000 because of the granuality of clock(). To avoid this, we can run the same
query multiple times and then use clock() the measure the total search time.
·
The
code for finding search time is as follows:
/* C code for finding the running time */
clock_t start = 0, finish = 0;
start = clock();
/* start your kdtree nn search, or sequential search here*/
finish = clock();
NUMBER duration = (NUMBER)(finish  start) / CLOCKS_PER_SEC;
printf("\n%g\n", duration);