CMU 15-418 (Spring 2012) Final Project

CMU 15-418 (Spring 2012) Final Project Report

Adventures in Monte Carlo: a Dumpling Simulation

Kiraz Baysal and Jen Solyanik

Summary

We implemented an NLP algorithm using MPI on Blacklight with the goal of speeding up the (very slow) computation. We managed to achieve near-linear speedup and complete all of our intended deliverables.

Background

Application:
Machine Learning is focused on building models to predict the correct result. In Natural Language Processing, this is used to model human languages. The most naive models are built using a combination of events with probability weights (a more likely event is assigned a greater weight, as determined by the number of its occurrences in a given chunk of training data). Thus, when all possible events are evaluated on the model, the most likely one is assigned the greatest probability. This is used to learn patterns in data, which can then be used to predict the event immediately following. So, given the set of training data of Dr. Seuss' Green Eggs and Ham, a model would predict that the next word in the string "Green eggs and" would be "ham". However, if the correct event did not occur in the training data, its probability in the model is 0 and will be discarded in practice. Thus, with the same training data, the model would predict that the next word after "To be or not" should be "like" (due to the many occurrences of "I do not like" in the training data), rather than "to be". Clearly, this leads to very poor results.

The Chinese Restaurant Process (CRP) is a stochastic process which models data clusters. Its purpose is to model the probability of random occurrences by creating a .Chinese Restaurant. scenario, in which there is an infinite number of restaurants with an infinite number of tables (each one, of course, with an infinite number of seats) and infinite patrons. At each time step, a new patron either chooses to enter an existing restaurant, or a new one, with equal probability. Once there, they choose to sit at either one of the existing tables, or a new one, also with equal probability. Each customer then has the option of either sitting at an occupied table, or a new, unoccupied table, again with equal probability. Each table represents a cluster of data points (an event), and the people who sit at it, the corresponding weight.

CRP avoids the problematic 0-probability weight by modeling the probability of a new patron sitting down at an empty table. It can also be used as a clustering algorithm to find clusters (dependent on the data) to build an advanced model. Thus, CRPs build "smart" models with better results than the standard models.

Algorithm:
The algorithm is split into three very distinct sections: loading the data (not a trivial task), running the CRP algorithm, and then evaluating the results using perplexity (a concept in information theory; our goal was to keep the perplexity value constant over all trials).

The data we worked with was larger than 1GB, and had over 6 million lines of sentences the algorithm would train on. When looking at our initial timings, we saw that much of the slowdown was due to reading the input file. We decided to distribute this data over all processors.

The key data structure used was WordID, which kept track of the ID of the word; involving the input data as well as prior probabilities used to calculate next steps. The key operations on WordID were read and write as well as calculations for the respective probabilities.

The input to the problem was training data, test data, and the number of trials. The output we cared about was the perplexity of the algorithm on the testing data, which told us how well the algorithm performed.

The most computationally expensive part of the program was reading and analyzing the training data.

Approach

We used MPI on Blacklight to parallelize the code. We began with the codebase in this repository. The Boost API for MPI was used originally, but for the most part we worked directly with the same MPI as in previous assignments.

The code we decided to parallelize had three main parts:

Reading input: this is wehre the training data is read in. For the purposes of the project, this data is very large. The training data we used to run the code had over 6 million lines of text, and was larger than 1GB.
Analyzing data: this is where the magic happens. Each line of data is sent through the CRP algorithm.
Calculating result accuracy: separate test data is analyzed using the results from step 2, and returns a single number showing the accuracy of step 2.

Steps 2 and 3 make use of a data structure called WordID.

We began by parallelizing step 2. We implemented multiple different versions of this. We first parallelized over the number of trials, and then we parallelized over lines of text. (This was done before we had access to the training data, so we decided to focus on this step instead.) However, when we found out the size of the training data we would be working with, we knew the latter option would be much more helpful. We showed this by running both versions, and getting a much better speedup when parallizing over the data.

Our next step was to work with the amount of communication in the project. We tried to completely eliminate communication, communicate many lines of data at every set, and communicate at each step. When comparing the speedup we got with the accuracy of our results (we calculate the accuracy by looking at the result of the non-parallel version returns with the parallel version it returns), it was clear that no communication was necessary. Essentially, communication did not improve our accuracy while resulting in a considerable slowdown. Thus, we decided to go with the simplest solution and move on to the next step.

The next step was to parallelize part 1 (reading input). Although we did not expect it to be so, this part proved to be the most challenging part of our project. We tried many different implementations, and will discuss the main ones. First, we had each processor go through every line of code and put it into the used data structure only if it was the line we needed to look at it. The way we did that was through a simple interleaving algorithm. We also tried to implement a blocking algorithm, but it quickly became clear that it was not possible without adding one more while loop to count the lines in the input, giving that job to one processor, then communicating the result. However, this did not speedup our read time; we decided to stick with interleaving.

When running a small profiler (provded by Blacklight), we realized we were having considerable I/O Blocking which was slowing our code down (~500ms). We then realized that having every processor read every line was not very efficient and came up with multiple ways to fix this. The first was to run a bash script to split the one large input file into a number of smaller files, and then have each processor read their own file. However, this script was very very slow. Next, we decided to simulate what the batch script was doing with our code and MPI. We had one processor look at every line, create a data structure tha mapped the lines of data for one processor, to the processor ID, then send and receive these specific lines using MPI. Then, each processor only needed to tokenize over the string they received over MPI. This helped our speedups considerably. However the I/O Blocking was still present, taking up to 90% of the total run time. To hide this, we increased the number of trials by 100 to really see the effects of our parallelization.

In the end, we only needed to keep the changes we made to step 1, because when reading the test and training data, if we split up the text by number of processors, and each processor gets n/numProcs many lines instead of n, (n being the total number of lines in the input data), then steps 2 and 3 automatically results in a significant speedup as the data becomes considerably smaller. Any futher parallelization attempts would not make much sense and would take away from accuracy of the program.

Results

For our results, we measured how long each step of the code took in seconds.

We had two main goals, and achieved both of them.

Deliverables:

Software engineering quality: because this is an open source project, our code needs to be integratable and usable inside the existing code base.
Reasonable models: this will be evaluated by comparing the performance of our distributed implementation against the single-processor implementation we began with.

Table without I/O Block (trails=10):

Table with I/O Block (trails=1000):

Note: because we were calculating the number of seconds, the number 1 seen in the calculation step is not necessarily a number, it could be something between 1 and 1.5 seconds. This is of course true for all of the numbers on the table, but it only becomes important to note for that instance.

Graph of second table (including all the different parts of the program):

Explanation of graph: The reason for not having a straight line for our read times is that it does not have a reliable result, it depends on when you run it, and what else is running with it. The numbers change drastically with each run, and the ones we put on the table is an average of different runs.

Our speedup was limited by I/O blocking limits, however we were able to hide that limitation by using more trials. It should be noted that the number of trails we used to hide our limitations (1000) is a reasonable number of trials for the purposes of a Monte Carlo simulation.

There is still a lot of room for improvement to completely get rid of I/O Block delays. So that column could also have perfect speedup

References

We spent a significant amount of time reading and internalizing the following papers:

List of Work

Kiraz:

Read papers, became familiar with the existing codebase, met to review code
Did most of the set-up process; very involved process with regards to making source code compatible with AFS, personal computers, and Blacklight (eventually scrapped PC/AFS in favor of working on Blacklight)
Wrote initial algorithm + did debugging
Helped write checkpoint
Took timings, keeping track of speedup
Wrote multiple algorithms to fix issues caused by I/O limit
Wrote final writeup

Jen:

Read papers, became familiar with the existing codebase, met to review code
Tried to get input files working for potential blocking solution to minimize false sharing (was scrapped in favor of interleaving which had better performance)
Wrote checkpoint
Set up timers, took initial and progress timings
Tried to set up TAU to watch code performance; got sample TAU working, but could not compile with TAU compile wrapper due to complicated linking in given source code Makefile
Wrote shell script to split given input data by processor; was also scrapped in final implementation
Wrote final writeup