11-741 - Information Retrieval
Jamie Callan
Yiming Yang
Due: Feb 11, 2010

Homework #2

Deadline Feb 11 11:59pm

In this homework, you will build a more efficient MapReduce (Hadoop) based indexer based on the Program 1 you build for HW1. You will also build a MapReduce based program that reads out inverted list fragments from a sharded index and merges them.

Expect the cluster to be busier (longer wait time) as it gets close to the deadline, so start early. Going large scale is not easy, you will need the time to debug on small datasets and scale to the large one. So work on the indexer RIGHT AWAY. And remember to leave 1 day for the report, it is credit heavy.

1. Learning points

2. Hand in

You must also turn in your source code, packaged as a .zip, .gz, or .tar file. The instructor will look at your source code, so make sure that it is legible, has reasonable documentation, and can be understood by others. This is a Computer Science class lesson - the instructor will actually care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it.

Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.

3. Detailed assignment instructions

3.1 Background

Data:

3.2 Program 2: Secondary sort for simple indexing

Functionality: produce word and its postings (docnos that the word appears in), using secondary sort to sort the docnos for each word
Architecture:
Deployment
Deploy your program 2 on the cluster using the full input data.
Make sure the docnos are sorted.

Include in the report,

  1. the first 10 documents of each inverted list for the sample words from HW1 Section 4.2, with reducer# fixed at 4
    Gather inverted lists for the same sample words used in HW1 Section 4.2.
    Hint: use the same hadoop fs -cat | grep trick in HW1 Section 4.2 to locate the sample words in your output directory.
    For frequent words, use head to truncate the long inverted list records, e.g.
    hadoop fs -cat /youroutput/part-00000 | grep -a "^WORD{Literal_TAB}" | head -n 1
    Here, keep the number of reducers for indexing fixed at 4. This will result in 4 shards (part-00000 to part-00003), and 4 inverted list fragments for each unique word.
    Override the toString() method of your data classes to format the output as follows:
    (records will be followed by new line characters "\n", and fields within each record separated by TAB)
    [word]	[DF]	[docno]			[docno]			...
    avatar	4	GX046-73-2232524	GX186-95-16464543	GX240-92-15755572	GX246-39-9037678
    avatar	4	GX001-35-5195992	GX006-47-3205930	GX036-51-9241581	GX173-01-16076052
    avatar	3	GX021-18-1156827	GX028-50-12367763	GX169-77-0344935
    avatar	1	GX241-62-5165601
    
    On the 10GB full dataset, the global DF of "avatar" is 12, distributed in 4 index shards.
    Make sure your indexer outputs the same inverted list for "avatar".
    
    The first 5 documents for the word "reduce" in each shard is:
    
    reduce	100	GX000-01-15897683	GX000-01-15979284	GX000-01-2893456	GX000-01-8011551	GX000-02-10992597
    reduce	100	GX000-02-11544820	GX000-02-15320763	GX000-02-15881606	GX000-03-3212167	GX000-03-5594674
    reduce	100	GX000-00-6435664	GX000-01-12642466	GX000-02-3622562	GX000-05-10797645	GX000-07-4432334
    reduce	100	GX000-00-4041288	GX000-02-5834661	GX000-03-7355341	GX000-04-8792333	GX000-06-13138007
    
    The DF in each record above are all 100, because the original records are too long and have been broken into 100 docnos per record.  Your count may be different if you split the long records differently.
    
    Since there will be many records for a moderately frequent word, the best way to locate the first record of each shard is to do the "grep" separately, and use a "head" to chop off the output: e.g. for the first shard, hadoop fs -cat otuput/part-00000 | grep -a "^WORD{Literal_TAB}" | head -n 1.
    
  2. timings for indexing,
    You need to vary the #reduce tasks a bit, to see how the timing differs.
    You will be sharing a cluster with your fellow students, so

Sorting on both <key1, key2>, but grouping records of the same key1 together is called secondary sort. See a secondary sort example here.

Secondary sort is needed for indexing, if you want to output the inverted list for each word as one single record, and still keep the docnos sorted.

3.3 Program 3: Efficient Sharded Positional Indexer

After the two programs from last HW, your indexer is just a few simple steps away,
Functionality: sharded and positional indexing
Architecture:
There are two main differences between this program and Program 2 from HW1, Deployment
Run the indexer on full input data, and record timings.

Include in the report,

  1. timings for indexing,
    You need to vary the #reduces as in Program 2,
    compare with the timings from Program 2, and discuss differences.
  2. report the first 10 documents of each inverted list for the sample words,
    Hint: use the same hadoop fs -cat | grep trick in HW1 Section 4.2 to locate the sample words in your output directory.
    Here, keep the number of reducers for indexing fixed at 4. This will result in 4 shards, and 4 inverted list fragments for each unique word.
    Override the toString() method of your data classes to make your output look like this:
    (records will be followed by new line characters "\n", and fields within each record separated by TAB)
      [word]	[DF]	[docno]			[TF]	[positions]			[docno]			[TF]	[positions]...
      avatar	4	GX046-73-2232524	4	571	623	655	710	GX186-95-16464543	1	5265	GX240-92-15755572	1	165	GX246-39-9037678	1	5046
      avatar	4	GX001-35-5195992	1	2370	GX006-47-3205930	2	16636	17687	GX036-51-9241581	1	2683	GX173-01-16076052	1	338
      avatar	3	GX021-18-1156827	1	384	GX028-50-12367763	7	1857	1859	1901	1936	1998	2023	2070	GX169-77-0344935	1	215
      avatar	1	GX241-62-5165601	1	80
    
    This is very similar to the output of Program 2, except term occurrences are included here.
    Note that both DOCNOs and occurrence positions are sorted.
    
    The first 5 documents for the word "reduce" in each shard is:
    
    reduce	100	GX000-01-15897683	1	636	GX000-01-15979284	2	3746	4182	GX000-01-2893456	2	34	278	GX000-01-8011551	4	1111	1176	1684	1774	GX000-02-10992597	3	413	648	953
    reduce	100	GX000-02-11544820	2	18898	31448	GX000-02-15320763	1	809	GX000-02-15881606	2	93	187	GX000-03-3212167	2	396	461	GX000-03-5594674	1	107
    reduce	100	GX000-00-6435664	3	182	200	262	GX000-01-12642466	2	231	244	GX000-02-3622562	1	1980	GX000-05-10797645	1	136	GX000-07-4432334	1	201
    reduce	100	GX000-00-4041288	1	674	GX000-02-5834661	1	550	GX000-03-7355341	1	479	GX000-04-8792333	2	1892	2374	GX000-06-13138007	1	1923
    
    The DF in each record above are all 100, because the original records are too long and have been broken into 100 docnos per record.  Your count may be different if you split the long records differently.
    
    Since there will be many records for a moderately frequent word, the best way to locate the first record of each shard is to do the "grep" separately, and use a "head" to chop off the output: e.g. for the first shard, hadoop fs -cat otuput/part-00000 | grep -a "^WORD{Literal_TAB}" | head -n 1.
    

3.4 Some details about data classes:

3.5 Details about controller classes:

Other useful resources


Copyright 2010, Carnegie Mellon University.
Updated on Feb 23, 2010.
Le Zhao