Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Fall 2012 - C. Faloutsos Homework #1 - Solutions [Comments preceded by "RB" refer to the grading rubric used] Q1: 1) sqlite3 marvel.db .separator " " create table characters (char_id int, name varchar[500]); create table comics (comic_id int, name varchar[500]); create table appearances (char_id int,comic_id int); .import marvel.txt appearances .import marvel_characters.txt characters .import marvel_comic_books.txt comics RB: Did not include import statements: -1pt 2) RB: Some people assumed that each row in the tables was a distinct entity, some assumed that there were replicated names in the dataset. Both answers are considered correct. a) select distinct count(*) from characters; b) select count(distinct(name)) from characters; a) 6486 b) 6445 3) a) select distinct count(*) from comics; b) select count(distinct(name)) from comics; a) 12942 b) 12850 4) select c.name char_name,count(*) ct from appearances a,characters c where c.char_id=a.char_id group by a.char_id order by ct desc limit 1; SPIDER-MAN/PETER PARKER 1625 RB: 2pt 5) select c.name comic_name,count(*) ct from appearances a,comics c where c.comic_id=a.comic_id group by a.comic_id order by ct desc limit 1; COC 1 111 RB: 2pt 6) create view coactors as select distinct a.char_id id1,b.char_id id2 from appearances a,appearances b where a.comic_id=b.comic_id and a.char_id!=b.char_id; 7) select c.name,count(ca.id2) ct from coactors ca,characters c where c.char_id=ca.id1 group by c.char_id order by ct desc limit 3; CAPTAIN AMERICA 1933 SPIDER-MAN/PETER PAR 1741 IRON MAN/TONY STARK 1528 26,27s RB: Wrong co-actor counts included: -2pt RB: No co-actor counts included: -4pt RB: Incorrect co-actors: -8pt 8) create index app_char_idx on appearances (char_id); create index app_comic_idx on appearances (comic_id); create index char_idx on characters (char_id); 3s RB: Did not include index statements: -2pt 9) explain query plan select c.name,count(ca.id2) ct from coactors ca,characters c where c.char_id=ca.id1 group by c.char_id order by ct desc limit 3; output without index: 0 0 0 SCAN TABLE appearances AS a (~1000000 rows) 0 1 1 SEARCH TABLE appearances AS b USING AUTOMATIC COVERING INDEX (comic_id=?) (~7 rows) 0 2 2 SEARCH TABLE characters AS c USING AUTOMATIC COVERING INDEX (char_id=?) (~7 rows) 0 0 0 USE TEMP B-TREE FOR GROUP BY 0 0 0 USE TEMP B-TREE FOR ORDER BY output with index: 0 0 2 SCAN TABLE characters AS c USING INDEX char_idx (~1000000 rows) 0 1 0 SEARCH TABLE appearances AS a USING INDEX app_char_idx (char_id=?) (~10 rows) 0 2 1 SEARCH TABLE appearances AS b USING INDEX app_comic_idx (comic_id=?) (~5 rows) 0 0 0 USE TEMP B-TREE FOR ORDER BY Q2 RB: Each incorrect run's boundaries: -4pt RB: If code does not run: -12pt a) Low coords of bounding box: 0.100126 0.100131 High coords of bounding box: 0.999975 0.999988 b) Low coords of bounding box: 0.100132 0.015248 0.100013 High coords of bounding box: 0.999821 0.999822 0.999948 c) Low coords of bounding box: 0.028505 0.01849 0.100016 0.100136 High coords of bounding box: 0.999932 0.99983 0.999141 0.999815 Q3 RB: Incorrect rectangle bounds: -10pt RB: Incorrect rectangle count: -10pt RB: Code does not compile/run: -15pt Printout of MBRs at level 2: level=2 count=14 entry=0 childno=3 3 childcov= 26 998 360 897 0 997 1 972 entry=1 childno=31 31 childcov= 5 997 0 533 0 962 0 984 entry=2 childno=59 59 childcov= 0 975 19 998 6 995 27 999 entry=3 childno=65 65 childcov= 1 997 2 532 335 959 440 997 entry=4 childno=91 91 childcov= 31 998 589 999 2 998 12 981 entry=5 childno=101 101 childcov= 0 993 3 531 367 995 160 842 entry=6 childno=131 131 childcov= 30 993 538 999 26 989 0 980 entry=7 childno=149 149 childcov= 1 621 0 530 1 465 20 992 entry=8 childno=170 170 childcov= 34 926 783 999 290 997 21 978 entry=9 childno=182 182 childcov= 6 999 1 532 337 999 442 992 entry=10 childno=220 220 childcov= 0 999 5 529 341 997 3 438 entry=11 childno=236 236 childcov= 31 927 589 940 290 998 14 983 entry=12 childno=239 239 childcov= 291 999 1 530 1 457 22 996 entry=13 childno=265 265 childcov= 8 992 1 534 479 964 443 998 Count: Count of all data rectangles is 5000