11-741 - Information Retrieval
Jamie Callan
Yiming Yang
Due: Jan 26, 2010

Homework #1

Deadline Jan 26 11:59pm

The purpose of this homework is to get you started with coding in Hadoop. You will need to choose a local machine to use, install and configure software on your local machine, run a sample MapReduce program and develop two small MapReduce (Hadoop) based programs. You will be coding in Java, using NetBeans as your development environment and running your Hadoop jobs on a remote cluster running Linux. (If you are comfortable coding and debugging without an IDE, NetBeans is not necessary. But the Hadoop plugin plus NetBeans makes Hadoop programming a lot easier.)

Expect the cluster to be busier (longer wait time) as it gets close to the deadline, so start early. Installing all the software and configuring them may take a while, so start RIGHT AWAY. Going large scale is not easy, you will need the time to debug on a small dataset and scale to the large one. So work on the programs as soon as possible. Start early also gives you time to dig into Hadoop.

1. Learning points

2. Hand in

You must also turn in your source code, packaged as a .zip, .gz, or .tar file. The instructor will look at your source code, so make sure that it is readable and has reasonable documentation. This is a Computer Science class lesson - the instructor will actually care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it.

Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.

3. Data

4. Detailed assignment instructions

4.1 Setting up:

Follow these instructions closely to
  1. setup your local machine and connect to the cluster,
  2. install necessary software (Hadoop, NetBeans, and the plugin),
  3. and configure your IDE (which also includes useful hints for coding and debugging).

4.2 Sample WordCount program

  1. Follow these instructions and get the default WordCount program working on a small text file (e.g. this file). Make sure you select Hadoop-0.20.0 libraries (not the 0.18 version in the tutorial).
    Let's call the name of your project {PROJECT}.
  2. When you follow the tutorial to add Hadoop library to your project, make sure you only add Hadoop-0.20.0, do NOT add the Hadoop-client library or anything else.
  3. Make sure you can use the workflow view to select input data and each panel in the workflow view (corresponding to each phase in the MapReduce framework) is showing the right output.
  4. Make sure you can run your WordCount program by right clicking on your project and select run. (before you do that, follow instructions here to setup your project's main class and command-line arguments.) Use the 96KB test file now, do not use the full input to do the testing.
Deployment
Deploy the program (in a jar file) on the cluster using the full input (see instructions about job deployment here).

Include in the report,
word counts on the full data, for the terms: "information", "age", "web", "retrieval" and "largescale".
Hint: to view counts for certain words simply do a grep on the WordCount output files: hadoop fs -cat "YourOutputDirectory/*" | grep -a "^WORD{Literal_TAB}" This will look for lines starting with WORD and have a space following WORD. Enter {Literal_TAB} by pressing CTRL+V and then hitting TAB.
(Just for you to check your output, there are 9 counts of the exact word "avatar", using this default program.)

4.3 Program 1: WordCount for TRECWEB files

Functionality: WordCount for TRECWEB format documents.
Architecture:
Deployment
Deploy your Program 1 on the cluster using the full input data.

Include in the report,

  1. word counts for the same sample words listed in Section 4.2.
    Just use the same hadoop fs -cat | grep trick in Section 4.2 to locate the sample words in your output directory.
    For example, if you use StringTokenizer to tokenize the document content, and lowercase the words, then "reduce" appears 69149 times, and "map" appears 135555 times in the corpus.
  2. timings for running the program. Do read the instructions for what timings to include and how to get them. You do NOT need to vary the #maps and #reduces for this HW, just use 4 reducers.
    (Just for you to check your output, there are 22 counts of the exact word "avatar", using the gov2 input record reader, and lowercase all tokens.)

Have more questions? Please check the FAQ



Other useful resources


Copyright 2010, Carnegie Mellon University.
Updated on Jan 13, 2010.
Le Zhao