15-719: Advanced Cloud Computing (Fall 2014)

Project 1 overview and FAQ

The goal of project 1 is to learn to write simple iterative big data processing jobs (using MapReduce and Spark) and to learn to run them effectively within AWS's Elastic computing framework. By comparison between MapReduce and Spark, you will learn the strengths and weaknesses of the MapReduce framework.

The project handout is available here:

  1. Phase 1
  2. Phase 2
  3. Phase 3
The project presentation is available here. The presentation for phase 3 is available here. And for frequently asked questions to be answered here.

The project will not be done in groups; each student does this project on their own.

Readings helpful for project 1

1. Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation, Volume 6, 2004. See the lecture readings for [Dean04].
2. Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot topics in cloud computing, 2010. See the lecture readings for [Zaharia10].
3. Amazon EMR developer guide.


Phase 1, is now due, Tuesday, September 23rd at 11:59PM Eastern
Phase 2, is now due, Wednesday, October 1st at 11:59PM Eastern

NEW!!! Phase 3, is now due, Friday, October 10th at 11:59PM Eastern. NEW!!!

Late Policy

NEW: Phase 3 will also have the same single day, 50% penalty, late policy (Sat Oct 11 at 11:59pm EST).

Phase 2 will have the same single day, 50% penalty, late policy.

The deadline for phase 1 of project 1 is in less than an hour.
If you miss the deadline but still submit before 11:59pm eastern on Wed Sept 24, you will receive a 50% penalty; that is, a perfect solution in 24 hours ends up being a bare pass.
If you want the 24-hour extension and 50% penalty, you must send email to 719-staff requesting the ability to submit tomorrow.
We will not grade twice and take the best. If you have request the extension you have taken the 50% penalty.
Note that this is the first and hardest phase of project 1, but it is not weighted with the majority of the points for project 1. Doing well on phase 2 and phase 3 can lead to a good final project 1 score. And understanding the phase 1 material will help in phase 2 and 3.


1. Are the AWS coupon codes out?
Yes. AWS coupon codes have been emailed to everyone. If you still haven't received it, please send an email to 719-staff.

2. What is Zlib? I do not seem to find a lot of information about it online?
Zlib is the default compression codec of Hadoop. That is, if you do not specify any compression codec in your MapReduce job configuration and just set compression to true, then Hadoop uses the Zlib codec.

3. Should I compress the intermediate map output or only the final output of a MapReduce job?
This is a design decision that is totally upto you. But for the compression section evaluation, you need not compress the intermediate map output.

4. Can I run the iterative job one after the other manually?
No. You should run a single jar (for the iterative PageRank part) which will take in the number of iterations as a command line argument and run for the specified number of iterations.

5. Do I need to filter the invalid urls in the value?
The correct way is to filter out (discard) all the urls which contain comma “,” in them as it is both invalid and might break the algorithm when you are using a comma as the separator.
For example, say we have the following key and value pairs
key: a.com value: b.com c.com c,def.com f.com
Output of the filtering job should be:
key: a.com value: b.com,c.com,f.com

If you did not filter them out and you do not want to change it now, it is still acceptable. In this case, the output of the filtering job will be
key: a.com value: b.com,c.com,c,def.com,f.com

If you have dealt with this in any other way, send an email to 719 staff.

6. How do I handle duplicate urls in key or value field in step 1 (filtering job)?
Say there are two keys with same value. Your filtering job should handle them and emit only one key url and consolidate the values.
For example, say we have the following key and value pairs
key: a.com value: b.com c.com def.com b.com c.com
key: a.com value: b.com c.com f.com def.com

Output of the filtering job should be
key: a.com value: b.com,c.com,f.com,def.com,f.com

Another acceptable output is
key: a.com value: b.com,c.com,def.com,b.com,c.com,b.com,c.com,f.com,def.com

7. Why do I need to influence (change) the number of reducers that run in step 1 (filtering job)?
The default block size in Hadoop is 128MB. It determines the number of mappers that are run. But some of the compression techniques (Zlib, Snappy) produces output which cannot be processed by different mappers (as they are encoded by the compression scheme and splitting them does not produce meaningful data). So now the number of output files (= number of reducers run) produced as a part of the step 1 matters as it determines the number of mappers that will run in the step 2. You need to run more number of mappers (so that each mapper can handle appox. one split of HDFS data) to parallelize the job and get better performance.

8. How should I handle dangling urls?
Dangling urls are the urls which: 1) does not belong to the initial list of keys and appear only on the values of one or more of the keys, OR 2) appear on the initial list of keys but does not link to any other url.
You should NOT filter them out and you should treat them just like other urls. The only difference is that they do not contribute pagerank values to other urls as they do not link to any of them.
Consider this example where the input dataset looks like this. Based on the algorithm in the handout the output should be as follows.
key: a.com value: b.com c.com
key: b.com value: c.com a.com

After filtering job (step 1) the output should be
key: a.com value: 1.0 b.com c.com
key: b.com value: 1.0 c.com a.com

After first iteration of the iterative pagerank (step 2) the output should be
key: a.com value: 0.5 b.com c.com
key: b.com value: 0.5 c.com a.com
key: c.com value: 1.0

After iteration 2, the output should be
key: a.com value: 0.25 b.com c.com
key: b.com value: 0.25 c.com a.com
key: c.com value: 0.5

Point to note here is that c.com is a dangling url and it is not filtered out, neither is its pagerank summed up (with pagerank from previous iteration) on each iteration.

9. Should I run the compression test only on the filtering job (step 1) or the whole application (filtering job and iterative pagerank job)?
You should run the compression test only on the filtering job (step 1). The Hive queries you run while evaluating the compression technique should be run on the output of the filtering job (step 1). Even though all the urls will have only pagerank of 1.0 at this point, it is okay because we need to test only the running time of the Hive queries when using compression and not the correctness or usefulness of the queries.

10. I did not get my set of specific urls for the project yet. What should I do?
Check your spam folder, as some email clients may filter out the email as it has a bunch of urls. If you cannot find it there, then send an email to the 719 staff mailing list.

11. How to handin?
You should handin your code and results by uploading to the bucket mentioned in the handout. While handing in, please make sure that you provide "Full control" access to the bucket as well as all the files inside the bucket individually to advancedcloudcomputing@gmail.com.
This should be done using the Bucket Explorer s3 client. This is needed otherwise it is not possible for us to see anything you submit.

Phase 2 - FAQ

12. How to compute pagerank for normal urls and dangling urls?
Step 1: Filter the input, remove duplicate urls and assign 1.0 as pagerank for all urls
Step 2: Compute pagerank for all the urls by summing up the pagerank contributions
Step 3: Count the total number of unique urls (including dangling urls)
Step 4: Compute the sum of all the pagerank values of all the dangling urls
Step 5: Update the pagerank of all the urls (including the dangling urls) with pagerank' = 0.15 + 0.85 * (sum of pagerank of dangling urls/total number of urls + pagerank)
Repeat steps 2 to 5 for 10 iterations.

After step 1,
key: a.com value: 1.0 b.com c.com
key: b.com value: 1.0 c.com a.com

After step 2,
key: a.com value: 0.5 b.com c.com
key: b.com value: 0.5 c.com a.com
key: c.com value: 1.0

Step 3 output: 3

Step 4 output: 1.0

After step 5,
key: a.com value: 0.858 b.com c.com
key: b.com value: 0.858 c.com a.com
key: c.com value: 1.28

13. Why is Shark not working on EC2?
There seems to be a comaptibility issue between Shark and Spark version > 1.0. This leads to the script which tries to install Shark on EC2 fail. An alternative is to use Spark verison < 1.0 to launch the cluster on EC2. You can download Spark version 0.9 from here. Another alternative is to use EMR to launch Spark and Shark cluster by using the EMR CLI or EMR GUI. Instructions on this can be found here.

Phase 3 - FAQ

14. Handin instructions
- Handin the tokens (traffic token and final token) as two separate text files named traffic_token.txt and final_token.txt. This should be inside the phase3_deliverables directory.
- Handin your performance score and total instance hours in a separate text file named performance.txt inside the phase3_deliverables directory.
- Your code should be inside a separate directory named "code" inside the phase3_deliverables directory.
- The report should be a pdf file named report.pdf inside the phase3_deliverables directory.
- Add the instance hour graph as an image file inside the phase3_deliverables directory.
The phase3_deliverables directory should be uploaded inside the s3 bucket that you used for the previous 2 phases.
Do not forget to give "full control" permission to phase3_deliverables folder and all the files inside it to advancedcloudcomputing@gmail.com.

15. How does the load generator work?
The load generator has a set of threads running, each of which sends a request to your web server waits for a response and sends another request. When the load increases, essentially more threads are created and each of them generates a request and waits for the response. This is why when your web server (or ELB) is slow, it gets only as many requests from the load balancer.

16. Hints on auto scaling
1. You can use load generator view-logs (the file which shows your performance score every minute) to detect spikes and set desired number of instances
2. Do not rely only on EC2 CPUUtil only. You can publish custom metrics to CloudWatch. You can also set more than one alarm.
3. Your alarms can change during the run (as long as it's in code)
4. Change in traffic is sharp. Do not scale one instance at a time
5. Cooldown should be set in such a way that you do not use an instance for just greater than 5 minutes

Last updated: 2014-10-08 23:27:25 -0400 [validate xhtml]