11-741 - Information Retrieval
Jamie Callan
Yiming Yang |
Due: Jan 26, 2010 |
Homework #1
Deadline Jan 26 11:59pm
The purpose of this homework is to get you started with coding in Hadoop. You will need to choose a local machine to use, install and configure software on your local machine, run a sample MapReduce program and develop two small MapReduce (Hadoop) based programs. You will be coding in Java, using NetBeans as your development environment and running your Hadoop jobs on a remote cluster running Linux. (If you are comfortable coding and debugging without an IDE, NetBeans is not necessary. But the Hadoop plugin plus NetBeans makes Hadoop programming a lot easier.)
Expect the cluster to be busier (longer wait time) as it gets close to the deadline, so start early. Installing all the software and configuring them may take a while, so start RIGHT AWAY. Going large scale is not easy, you will need the time to debug on a small dataset and scale to the large one. So work on the programs as soon as possible. Start early also gives you time to dig into Hadoop.
1. Learning points
- Large Scale:
- Setting up development environment for Hadoop programming
- Running the sample WordCount program
- Basic Map/Reduce programming in Hadoop
- Creating your own WordCount program for TRECWEB format input
- Monitoring and debugging Hadoop jobs
2. Hand in
- report:
- collect word count statistics from the output of the sample WordCount program
- collect term statistics for sample terms using your Hadoop Programs 1
- discuss the Map/Reduce framework
- discuss the good and the bad of Map/Reduce
- good and bad about Hadoop's implementation
- try to make constructive suggestions
- you may compare with existing parallel computation frameworks.
- code:
- (code should be documented at necessary places to make sure that TA can understand it)
- one M/R program
- Program 1: WordCount for TRECWEB format
Include a) the source code (the src directory), b) the exact path to your jar file on the cluster, so that we can run your code.
You must also turn in your source code, packaged as a .zip, .gz, or .tar file. The instructor will look at your source code, so make sure that it is readable and has reasonable documentation. This is a Computer Science class lesson - the instructor will actually care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it.
Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.
3. Data
- We will be using a full dataset of about 10GB of HTML texts in TRECWeb format, totally 372563 documents. It is a subset of GOV2 collection.
- It will be on HDFS of the Hadoop cluster: /user/data/gov2-10GB
- A smaller 1GB dataset of 23588 documents is also provided just for testing purposes: /user/data/gov2-1GB
- To test in the workflow view, an even smaller 96KB file of 9 documents is located here for download.
4. Detailed assignment instructions
4.1 Setting up:
Follow these instructions closely to
- setup your local machine and connect to the cluster,
- install necessary software (Hadoop, NetBeans, and the plugin),
- and configure your IDE (which also includes useful hints for coding and debugging).
4.2 Sample WordCount program
- Follow these instructions and get the default WordCount program working on a small text file (e.g. this file). Make sure you select Hadoop-0.20.0 libraries (not the 0.18 version in the tutorial).
Let's call the name of your project {PROJECT}.
- When you follow the tutorial to add Hadoop library to your project, make sure you only add Hadoop-0.20.0, do NOT add the Hadoop-client library or anything else.
- Make sure you can use the workflow view to select input data and each panel in the workflow view (corresponding to each phase in the MapReduce framework) is showing the right output.
- Make sure you can run your WordCount program by right clicking on your project and select run. (before you do that, follow instructions here to setup your project's main class and command-line arguments.) Use the 96KB test file now, do not use the full input to do the testing.
Deployment
Deploy the program (in a jar file) on the cluster using the full input (see instructions about job deployment here).
Include in the report,
word counts on the full data, for the terms: "information", "age", "web", "retrieval" and "largescale".
Hint: to view counts for certain words simply do a grep on the WordCount output files: hadoop fs -cat "YourOutputDirectory/*" | grep -a "^WORD{Literal_TAB}" This will look for lines starting with WORD and have a space following WORD. Enter {Literal_TAB} by pressing CTRL+V and then hitting TAB.
(Just for you to check your output, there are 9 counts of the exact word "avatar", using this default program.)
4.3 Program 1: WordCount for TRECWEB files
Functionality: WordCount for TRECWEB format documents.
- input: TRECWEB documents
- output: words and their Collection Term Frequency (CTF)
Architecture:
- For this program, create another project, and in it, create another Hadoop job class.
- InputRecordReader
- Background: we'll be using part of Cloud9's source code (edu.umd.cloud9.collection.gov2.Gov2DocumentInputFormat)
Follow the following steps:
- download the gzip source
- upload it onto your {LINUXSERVER}, in the {PROJECT}/src directory
- in Linux shell run: cd {PROJECT}/src and tar -zxf gov2inputformat.gz
(you may need to restart your NetBeans to see the new source code)
- right click on your project and select build to compile the new source code
- Now, notify Hadoop to use your InputRecordReader class, by choosing your class in the workflow view of your driver class, in the panel called "HadoopInput"
- it will automatically change the code in initJobConf() into: conf.setInputFormat(Gov2DocumentInputFormat.class);, (the plugin does not allow you to manuually change that in source code mode).
Hint: If you already chose the small test file as the input file in the workflow view, the Gov2InputFormat should give you 9 input records.
If you see any ClassNotFound error in the workflow view, build your project, or make sure your source code is in place.
- Mapper
- create the Mapper class: right clicking on your package -> New -> Other -> Hadoop MapReduce Components -> Hadoop Mapper (Stable) -> fill in input key type: LongWritable, input value: Gov2Document, output key: Text, output value: LongWritable.
This is very much the same as you created your Hadoop job class
- Get the textual content from the Gov2Document object. (You can get the docno as well.)
- the Cloud9's GOV2 InputFormat does some text cleaning for you, such as removing punctuation marks, removing HTML tags etc. You may simply lowercase it and use a StringTokenizer to tokenize the content into words.
- output <key=new Text(word), value=new LongWritable(1)> for each term occurrence
Hint: consult Hadoop's example WordCount program for how to write a token count Mapper
- hook your Mapper up with hadoop
- do it in the workflow view, in the HadoopMapper panel, you may need to scroll the horizontal bar to see the HadoopMapper panel
- Partitioner
- the default works
- Hadoop defaults to partition by the key from Mapper's output
- Combiner/Reducer
- again, the default works
- will accumulate counts, and output <key=word, value=count>
- OutputFormat
Just use the default org.apache.hadoop.mapred.TextOutputFormat, it's more human readable.
- Job configuration
You need to configure the number of mappers and reducers to use. The defaults are 1 and are given in the workflow view of your hadoop job class. You can also set them in the workflow view, or read it in from command line arguments as your input data and output path, and assign them using the right configuration method JobConf.setNumReduceTasks(n).
Because you will be sharing a small cluster of machines with your fellow students who will be working on the same deadlines as you do, make sure you set the number of reducers to be below 8.
Deployment
Deploy your Program 1 on the cluster using the full input data.
Include in the report,
- word counts for the same sample words listed in Section 4.2.
Just use the same hadoop fs -cat | grep trick in Section 4.2 to locate the sample words in your output directory.
For example, if you use StringTokenizer to tokenize the document content, and lowercase the words, then "reduce" appears 69149 times, and "map" appears 135555 times in the corpus.
- timings for running the program. Do read the instructions for what timings to include and how to get them. You do NOT need to vary the #maps and #reduces for this HW, just use 4 reducers.
(Just for you to check your output, there are 22 counts of the exact word "avatar", using the gov2 input record reader, and lowercase all tokens.)
Have more questions? Please check the FAQ
Other useful resources
- Hadoop MapReduce
This wiki includes the MapReduce execution process, and mostly all you'll need to know about coding for the homework, such as RecordReader for input record splitting.
- Cloud9 docs, includes useful coding, debugging, etc. information and trick
- Hadoop documentation includes example code and job configurations
- HDFS (Hadoop File System) shell commands
- Hadoop API
Make sure you are reading the right version of documentation, there are changes in API between versions.
- A step by step Hadoop developer tutorial
you can configure the NetBeans plugin to access the hadoop cluster on the Virtual Machine.
- VMware, if you want to install your own hadoop virtual machine
Copyright 2010, Carnegie Mellon University.
Updated on Jan 13, 2010.
Le Zhao