Carnegie Mellon University
15-826 – Multimedia Databases and Data Mining
Spring 2010 – C. Faloutsos
Homework 1 Solution

Q1

create table cities(cityID INTEGER, cityName TEXT, state TEXT); create table population(cityID INTEGER, value INTEGER); .separator "," .import "cities.csv" cities .import "population.csv" population
select avg(value) from population; 309228.538461539
select cityName, state from cities, population where cities.cityID = population.cityID order by population.value desc limit 5; New York|New York Los Angeles|California Chicago|Illinois Houston|Texas Phoenix|Arizona
select cityName from cities, population where cities.cityID=population.cityID group by cityName having count() = 3; Springfield
attach "ranks.db" as ranks;
select cityName, state from cities, ranks where cities.cityID=ranks.cityID and ranks.rank=1; Pittsburgh|Pennsylvania
Typically between 30s and 120s, depending on the speed of your computer.
SQLite uses nested for loops to perform the join of "personal.id=employment.id", one for scanning the id column of the personal table, the other for scanning the id column of the employment table (for each id in the personal table). This "sequential scan" is, naturally, slow.

addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Trace 0 0 0 00 1 Null 0 1 0 00 2 Integer 40 2 0 00 3 Integer 49 3 0 00 4 Integer 100000 4 0 00 5 Goto 0 28 0 00 6 OpenRead 0 2 0 2 00 7 OpenRead 1 3 0 2 00 8 Rewind 0 22 0 00 9 Column 0 1 5 00 10 Lt 2 21 5 collseq(BINARY) 6c 11 Column 0 1 5 00 12 Gt 3 21 5 collseq(BINARY) 6c 13 Rewind 1 21 0 00 14 Column 0 0 5 00 15 Column 1 0 6 00 16 Ne 6 20 5 collseq(BINARY) 6b 17 Column 1 1 6 00 18 Le 4 20 6 collseq(BINARY) 6c 19 AggStep 0 0 1 count(0) 00 20 Next 1 14 0 01 21 Next 0 9 0 01 22 Close 0 0 0 00 23 Close 1 0 0 00 24 AggFinal 1 0 0 count(0) 00 25 SCopy 1 7 0 00 26 ResultRow 7 1 0 00 27 Halt 0 0 0 00 28 Transaction 0 0 0 00 29 VerifyCookie 0 7 0 00 30 TableLock 0 2 0 personal 00 31 TableLock 0 3 0 employment 00 32 Goto 0 6 0 00
create index index_p_id on personal(id); create index index_e_id on employment(id);
Around 1s.

addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Trace 0 0 0 00 1 Null 0 1 0 00 2 Integer 40 2 0 00 3 Integer 49 3 0 00 4 Integer 100000 4 0 00 5 Goto 0 33 0 00 6 OpenRead 0 2 0 2 00 7 OpenRead 1 3 0 2 00 8 OpenRead 2 750 0 keyinfo(1,BINARY) 00 9 Rewind 0 26 0 00 10 Column 0 1 5 00 11 Lt 2 25 5 collseq(BINARY) 6c 12 Column 0 1 5 00 13 Gt 3 25 5 collseq(BINARY) 6c 14 Column 0 0 7 00 15 IsNull 7 25 0 00 16 Affinity 7 1 0 d 00 17 SeekGe 2 25 7 1 00 18 IdxGE 2 25 7 1 01 19 IdxRowid 2 5 0 00 20 Seek 1 5 0 00 21 Column 1 1 6 00 22 Le 4 24 6 collseq(BINARY) 6c 23 AggStep 0 0 1 count(0) 00 24 Next 2 18 0 00 25 Next 0 10 0 01 26 Close 0 0 0 00 27 Close 1 0 0 00 28 Close 2 0 0 00 29 AggFinal 1 0 0 count(0) 00 30 SCopy 1 9 0 00 31 ResultRow 9 1 0 00 32 Halt 0 0 0 00 33 Transaction 0 0 0 00 34 VerifyCookie 0 9 0 00 35 TableLock 0 2 0 personal 00 36 TableLock 0 3 0 employment 00 37 Goto 0 6 0 00
The two indexes created were based on B-trees. When performing the join of "personal.id=employment.id", for each id in the first table, SQLite looks for a matching id in the second table's id column using the column's index; each look-up is faster than O(n) when sequential/for-loop is used, assuming the number of records is n (empirically, fewer disk accesses are need).

Q2

test1: 11 points
test2: 81 points
test3: 36 points
In kdtree_main.c, we implement case c as follows:
case 'c': printf("counting ...\n"); for(i=0; i<numdims; i++){ printf("%d-th attr. low value= ", i); scanf("%lf", &val); vecput( vpLow, i, val); } for (i=0; i<numdims; i++){ printf("%d-th attr. high value= ", i); scanf("%lf", &val); vecput( vpHigh, i, val); } printf(" counting - low values: "); vecprint( vpLow); printf(" counting - high values: "); vecprint( vpHigh); int cnt=0; rcount(root, vpLow, vpHigh, 0, &cnt); printf("%d points found.\n", cnt); break;
In kdtree.c, we add the rcount function:
void rcount(TREENODE *subroot, VECTOR *vpLow, VECTOR *vpHigh, int level, int *count){ int numdims; if( subroot != NULL ){ numdims = (subroot->pvec)->len; if( contains( vpLow, vpHigh, subroot->pvec ) ){ (*count)++; vecprint(subroot->pvec); } if( (vpLow->vec)[level] <= ((subroot->pvec)->vec)[level] ){ rcount( subroot->left, vpLow, vpHigh, (level+1)% numdims, count); } if( (vpHigh->vec)[level] > ((subroot->pvec)->vec)[level] ){ rcount( subroot->right, vpLow, vpHigh, (level+1)% numdims, count); } } return; }

Q3

Java code, and the points of the 2D Cantor dust
3 2 0 13 1 1 1 1 0 1 10275 10273 54 34

for i in {0..255} do ./ihorder -g 4 $i >> out.dat done

set xrange [-1:16] set yrange [-1:16] plot "out.dat" with linepoints