15-719 / 18-847B: Advanced Cloud Computing (Fall 2013)

Project 1 overview and FAQ

The goal of project 1 is to learn how to write MapReduce jobs and to run them within AWS's Elastic Map Reduce framework. The project handout can be found here. It is due on 10/11/13 at 5pm.
When handing in your project, you MUST follow the steps in the e-mail sent out to the 15-719 announce dlist w/regards to granting Raja permissions.

A tarball of your deliverables should be submitted to the following s3 bucket:


A tarball of your code should be submitted to:


Q1: When was the project handout last updated

It was last updated on 10/12/2013.

Q2: Can I use Hadoop streaming for this project

No, we want you to write actual MapReduce jobs in Java.

Q3: The project handout ask us to replace "%22"s with underscores. Should I also replace other percent-encoded values with underscores?

For the purposes of this project, only replace %22s with underscores.

Q4: Should I optimize my hive table for the specific search terms that will be/have been e-mailed to me?

No, you should be able to return results for any search term.

Q5: Should I avoid days 16-30 of the June 2013 by filtering them out line-by-line or by only inputing days 1-15 to my MapReduce program?

Please only input days 1-15. This is how we will test your program.

Q6: Can you create a separate S3 bucket with just days 1-15 of the June 2013 dataset?

Yes. We have uploaded files for just days 1-15 at s3://15719f13wikitraffic. To ensure consistency, please use files from this location as input to your program.

Q7: When will I get the search terms assigned to me?

They will be e-mailed out Tuesday night (10/08/13) or early Wednesday morning. Note that you do not need to know the specific search terms assigned to you to develop/test the project! Your code should work for any arbritrary search term.

Q8: For the last part of the project, should I worry about the case of the search terms? How should the search work?

Your searches should be case insensitive. A search should return information about any article whose title contains the given search term. So, a search for 'bara' would return information about the article Barack_Obama.

Q9: How should I break ties? For example, what if two articles have the same number of page views and the handout asks me to rank by page views?

You should break ties using lexicographic order. The article that precedes the other in lexicographic order should come first. See the documentation on java.lang.string.CompareTo() for a clear definition of lexicographic order.

Q10: Can I use a hive interactive job to find search results or should I write a script instead?

You are free to use either. But, I highly recommend writing a script and turning it in along with your source code.

Q11: In one version of the handout, the list of extensions to fitler out include "ico" and "txt," but these two items don't have periods in front of them like the other extensions. Should I treat them differently?

No, treat them as extensions. They should have a "." in front of them in the handout.

Last updated: Mon Oct 28 20:09:17 -0400 2013 [validate xhtml]