Carnegie Mellon University
15-826 – Multimedia Databases and Data Mining
Spring 2010 – C. Faloutsos
Homework 1 Solution

Q1

  1. create table cities(cityID INTEGER, cityName TEXT, state TEXT); create table population(cityID INTEGER, value INTEGER); .separator "," .import "cities.csv" cities .import "population.csv" population
  2. select avg(value) from population; 309228.538461539
  3. select cityName, state from cities, population where cities.cityID = population.cityID order by population.value desc limit 5; New York|New York Los Angeles|California Chicago|Illinois Houston|Texas Phoenix|Arizona
  4. select cityName from cities, population where cities.cityID=population.cityID group by cityName having count() = 3; Springfield
  5. attach "ranks.db" as ranks;
  6. select cityName, state from cities, ranks where cities.cityID=ranks.cityID and ranks.rank=1; Pittsburgh|Pennsylvania
  7. Typically between 30s and 120s, depending on the speed of your computer.
  8. SQLite uses nested for loops to perform the join of "personal.id=employment.id", one for scanning the id column of the personal table, the other for scanning the id column of the employment table (for each id in the personal table). This "sequential scan" is, naturally, slow.

  9. create index index_p_id on personal(id); create index index_e_id on employment(id);
  10. Around 1s.

  11. The two indexes created were based on B-trees. When performing the join of "personal.id=employment.id", for each id in the first table, SQLite looks for a matching id in the second table's id column using the column's index; each look-up is faster than O(n) when sequential/for-loop is used, assuming the number of records is n (empirically, fewer disk accesses are need).

Q2

  1. test1: 11 points
    test2: 81 points
    test3: 36 points
  2. In kdtree_main.c, we implement case c as follows:
    case 'c': printf("counting ...\n"); for(i=0; i<numdims; i++){ printf("%d-th attr. low value= ", i); scanf("%lf", &val); vecput( vpLow, i, val); } for (i=0; i<numdims; i++){ printf("%d-th attr. high value= ", i); scanf("%lf", &val); vecput( vpHigh, i, val); } printf(" counting - low values: "); vecprint( vpLow); printf(" counting - high values: "); vecprint( vpHigh); int cnt=0; rcount(root, vpLow, vpHigh, 0, &cnt); printf("%d points found.\n", cnt); break;
    In kdtree.c, we add the rcount function:
    void rcount(TREENODE *subroot, VECTOR *vpLow, VECTOR *vpHigh, int level, int *count){ int numdims; if( subroot != NULL ){ numdims = (subroot->pvec)->len; if( contains( vpLow, vpHigh, subroot->pvec ) ){ (*count)++; vecprint(subroot->pvec); } if( (vpLow->vec)[level] <= ((subroot->pvec)->vec)[level] ){ rcount( subroot->left, vpLow, vpHigh, (level+1)% numdims, count); } if( (vpHigh->vec)[level] > ((subroot->pvec)->vec)[level] ){ rcount( subroot->right, vpLow, vpHigh, (level+1)% numdims, count); } } return; }

Q3

  1. Java code, and the points of the 2D Cantor dust
  2. 3 2 0 13 1 1 1 1 0 1 10275 10273 54 34
  3. order-4 hilbert curve
  4. There are many possible methods that can generate the curve above. Here's one method. First run the following few lines of code to generate the x- and y- coordinates of all points.
    for i in {0..255} do ./ihorder -g 4 $i >> out.dat done
    Then using gnuplot, we run the following lines to create the figure.
    set xrange [-1:16] set yrange [-1:16] plot "out.dat" with linepoints