NetBeans IDE and Hadoop Plugin
- Coding:
- Closely follow tutorial here, to install the NetBeans plugin, create a Hadoop project and run the default WordCount program on a small text file.
Read the tutorial up to "Walking Through Your First MapReduce Job in the Job Developer"
- Modify the created project for implementing your assignment programs
- NetBeans is very similar to Eclipse, it will detect compilation errors in the code as you develop, it will also make suggestions about candidate solutions to correct the errors in the code. Make full use of these functions will make your coding experience much easier.
- a few notes about the workflow view (accessible from your MapReduce main class source file)
- A common error you see in the workflow view is: XXClass cannot be cast into XXClass. Java ClassCast runtime exception:
this happens because after you change your code, the Workflow view does not automatically update its various stages, resulting in runtime exception. Refresh it explicitly by pushing the "force refresh" button beside your workflow view button, or by choosing the Class again. This will eliminate the exception.
the data in each stage is also cached and not updated, resulting in runtime exception, you'll need to force a refresh.
- the Workflow view also doesn't understand the JobConf.setOutputValueGroupingComparator(..) in the code, thus results shown in the view may not be as it really is from actually running the code (via project->run, or deploying on a remote machine).
- Before running, remember to set the main class (right click your project->properties->Run->Main Class) to be yourPackage.yourHadoopJobClass
- Debugging locally:
- when debugging locally, make sure you use a small test file as input.
- right click on your project->Properties -> compiling -> enable Generate Debugging info. Clean & Build the whole project after the change.
- if your program takes commandline arguments, make sure you set them in (right click your project->properties->Run->Arguments & Main Class)
If you use the default template provided by the Hadoop plugin for NetBeans, the first commandline argument is the input data path for your Hadoop job (on HDFS), the second is the output directory on HDFS.
Before you run your hadoop job, you need to make sure that the input exists and the output directory does not exist.
- Use the workflow view to visualize results and identify problems
the workflow view does not refresh itself after code changes,
re-build your project first, then force it to refresh the various stages by simply reloading the source data or one of the component M/R classes.
- to inspect output from Map & sort when the workflow view is not working correctly (e.g. it does not recognize your comparator)
select IdentityReducer as reducer (and combiner if needed) in Workflow view, and run the whole program
- Running:
- Remember to build project before deploying
- Deploying locally for testing your application:
right click on your project -> run. (You need to setup your project's main class and command line arguments before doing this. see above for instruction). If you are running the code using NetBeans, your input file should be on the local file system of {LINUXSERVER}.
- Deploying remotely on the Hadoop cluster:
make sure you package your compiled class files into a jar, set it in project->Properties->Build->Packaging->Build JAR after compiling. Then,
Monitoring Hadoop jobs running on a remote cluster
- An important tool for monitoring and debuging Hadoop jobs is to use the Web Interface.
- You will be able to see how your map reduce tasks are finishing
- and STDOUT STDERR and SYSLOG output from your mappers and reducers.
- To determine why a map or reduce task fails, click on your job in the status page, click on the failed map or reduce tasks, click into one of the tasks, in the "Task Logs" column click "Last 4KB". You will be able to see what's causing the problem in SYSLOG.
- More documentation about the cluster can be found here.
Common failures on Hadoop
- Map or Reduce task failures,
There can be multiple reasons for that. By looking at the last 4KB of the log, often you will see what caused the task failure. Possible reasons include:
- Out of memory. Mappers and reducers observe a preset memory limit, if they exceed the limit they will be killed.
- Time out. Mapper/Reducers need to report to jobtracker's that they are making progress once in a while. This happens automatically in output.collect(key, value);. But if one key is taking too long to process, and is not outputing any records for over 600 seconds, that mapper or reducer will be killed.
This can happen, for example at the reduce stage of inverted-indexing for a very popular word. One way to solve this is to break down the large records into several smaller ones, for example every 100 docnos as one output record. Breaking large records down also helps avoid the out of memory problem. A small inconvenience of this solution is that the inverted list for a certain word will be broken to several pieces, but as long as docnos are still in order, this only costs a little bit more space for storing the duplicated keys.