**Homework 1 - Due: Tuesday Feb. 5, 1:30pm,
in class.**

- Due date: Tuesday Feb. 5,
**1:30pm**, hard copy in class and soft copy e-mailed to the TA (*htong at cs*) in a single e-mail. - This
homework is
**MANDATORY**, meaning you must get at least a passing grade:**50 points**or higher. - For
all questions please
**BOTH**: - e-mail the TA (
*htong at cs*) a soft-copy of all code, output and auxilary scripts tar'ed up in a single file - hand in a hard-copy of all code, output and auxilary scripts
- Please
turn in a
**typed**report - handwritten material may not be graded, at the grader's discretion. - All
homeworks including this are to be done INDIVIDUALLY
- Expected
effort for this homework (approximate times):
- Q1: 1 hour
- Q2: 15 hours: 6 hours for 2.1, 6h for 2.2, and 3 hours for 2.3.
- Q3: 6
hours: 3 hours for 3.1, and 3 hours for 3.2

Problem Description. Retrieve the following tables and load them in a DBMS of your choice: MS-Access is recommended, but any other is acceptable (eg., MySQL, postgres, etc). Caution: The ASCII data files have windows end-of-line termination. For unix/linux use "dos2unix" to convert them.

Tables:

- MOVIE
(title, year, type)
- PLAY_IN (first_name, last_name, title)
- ACTOR (first_name, last_name, gender)
- Rating (title, rating)

The tables are comma-separated ASCII files. Note that in ‘Rating’ table, we have omitted the user-id. So, you might find that some movies have been rated multiple times with the same rating. The rating values range from 1 to 5 (5 being the best). Answer the following queries:

Q.1.1.
**[5 pts]** List all movie titles where “Tom
Cruise” has played.

Q.1.2.
**[5 pts]** List the title of popular movies,
that is, whose rating is great than 4 (>4) as well as their average ratings.
Order the results by their average rating (from highest to lowest).

**Clarification:** we want to find the title as well
as the average rating for those popular movies. And here, a popular movie is
defined as the one whose average rating is great than 4.

Q.1.3.
**[10 pts] **List the actors (first_name,
last_name) and titles of movies, which are produced before 1960 and whose type
is 'TV'.

Q.1.4.
[**10 pts**] List the actors (first_name,
last_name) as well as their genders, who have played in those well discussed
movies, that is, which have been rated more than 3 times (>3). Note that,
here the actual ratings do not matter. So a movie that has been rated 4 times,
and each time it receives a rating of 1 will still qualify.

**What to hand in:** for each query
give

(1) the resulting tables [**half** of the points] and

(2)
the hardcopy of the corresponding SQL statement [**half
**of the points].

Fun fact: this is closely related to the NetFlix prize, which is $1M (see http://www.netflixprize.com/index).

Q.2.1. Problem description (1-dimensional case): given a region (= segment)
in 1-dimension (x_{low}, x_{high}), break it into some
homogeneous regions (i.e. one number, which consists of 1/0/*, for one such
region). Here, x_{low}, and x_{high} will be binary strings
with n bits, assuming that x_{low} is always less or equal than x_{high}. Write a program 'zrange_1d' in your
favorite language that will do this job. For example, (with the usual
unix/linux convention, that ‘%’ is the promp and ‘#’ is the comment symbol)

% ./zrange_1d 00 11 #xlow xhigh

**

Here, the answer ‘**’ means that it covers the whole region, which is the
interval from 0 (=00_{2}) to 3 (=11_{2}). Another example is,

% ./zrange_1d 010 111 #xlow xhigh

01*

1**

Notice that in this example, the program returns two pieces and each of them is 3 characters long, indicating that the original region (010, 111) consists of two homogeneous regions.

Q.2.2. Problem description (2-dimensional case): now, consider the case of
2-dimensional space. We want to break a range query (x_{low}, x_{high
, }y_{low , }y_{high}), into z-values. Write a program 'zrange_2d ' in your favorite language
that will do this job. Again, the four input numbers are n bits long, with the
x_{low} is less or equal than x_{high}, and the y_{low},
is less or equal than y_{high}.
For example,

% ./zrange_2d 00 11
00 11 #xlow xhigh ylow yhigh

****

This means that we cover the whole area and the output has 4 ‘don’t-care’ (‘*’) characters.

% ./zrange_2d
0100 1100 0100 1100 #xlow xhigh
ylow yhigh

0011****

0110****

01110000

01110010

01111000

01111010

1001****

1011000*

1011010*

1100****

11010000

11010010

11011000

11011010

1110000*

1110010*

11110000

**Observation #1:** we consider ‘1110010*’ as a legitimate z-value despite the fact that it is not a square
region.

**Observation #2:** ‘*’ in the middle is not allowed.

**Observation #3:** The z-value for the (x=0,y=1) cell is 1 in decimal.

Q.2.3. Given a 8×8
grid, we want to find out the region (x_{low}, x_{high , }y_{low , }y_{high}) that has largest number of zvalues. List the qualifying (x_{low}, x_{high , }y_{low
, }y_{high}) as
well as the count of its zvalues. If there is a tie, any answer is fine.

**What to hand in:**

(1)
**[16 pts]** Hard copy of your program's output on the
script.

(2)
**[5 pts]** Hard copy of the source code for Q.2.1 and Q.2.2.

*(3)
***[5 pts]** Email a tarfile of your source code for Q.2.1 and Q.2.2 to the TA (htong at cs). *Please
include just the source code and not the binaries.*

(4)
**[4 pts]** List the qualifying (x_{low},
x_{high , }y_{low , }y_{high}) as well as the count of its zvalues. If there is a tie, any answer is
fine.

Q.3.1 Problem
Description. Consider the k-d-tree
package, in
C, (`tar xvf; make`). You are required to implement ‘w’ for
dra(w)ing the current space partition that is used by k-d-tree. For example, if
you insert these
three data points (each
row is a 2-d data point) into the k-d-tree, you program should return the
following figure:

For this sub-problem,
you can assume that all data points are located within a rectangle, where
x_low=y_low = 0; and x_high=y_high = 256.

**What to Hand in:**

(1)
**[10 pts ]** the plot of space partition that is used by k-d-tree if you insert the
following points into
the tree (each row is a 2d point);

(2)
**[10 pts]** the plot of space partition that is used by k-d-tree if you insert the
following points into
the tree (each row is a 2d point);

(3)
**[5 pts]** the hard copy of your source code (please give only the parts that
you have added/ modified). Email
the tar-ball to htong at cs. Please make sure that the program correctly
compiles with `make`.

Q.3.2. According to
Bentley in 1975, the search time in a *d*-dimensional
space is empirically expected to be **O(log( n))**, where

**What to Hand in:**

(1)
**[10 pts]**, report plot (the search time vs. the size of the dataset) for the
two methods, where the size of the dataset n=1,000,000; 2,000,000; 4,000,000;
8,000,000; 16,000, 000. For each *n*,
run the experiment 10 times and report the mean search time and the _{}1 standard
deviation. Also, report the same plot in log-log scales.

(2)
**[5 pts]** the hard copy of your script which generates the above curve.

Hints:

(1)
Ideally,
the results should be run on a dedicated machine.

(2)
If your
machine is too slow, scale down the size of the dataset n by a factor of 10
(that is n=100,000; 200,000; 400,000; 800,000; 1,600,000).

(3)
You can use
gnuplot, excel, matlab (or anything else you prefer to) to generate the plot.
For example, you can use the command ‘errorbar’ in matlab to plot the mean search time as well as its standard
deviation.

(4)
You can use
a profiler to find the CPU/running time, or the function clock() in C
(and, of course, #include
<time.h> )