1. Preface
The goal of these notes is to introduce the reader to the following.

The basic concepts of parallel computing.

Some basic parallel algorithm design principles and techniques,

Realworld performance and efficiency concerns in writing parallel software and techniques for dealing with them, and

Parallel programming in C++.
For parallel programming in C++, we use a library, called PASL, that we have been developing over the past 5 years. The implementation of the library uses advanced scheduling techniques to run parallel programs efficiently on modern multicores and provides a range of utilities for understanding the behavior of parallel programs.
PASL stands for Parallel Algorithm Scheduling Library. It also sounds a bit like the French phrase "pas seul" (pronounced "pasole"), meaning "not alone".
We expect that the instructions in this book and PASL to allow the reader to write performant parallel programs at a relatively high level (essentially at the same level of C++ code) without having to worry too much about lower level details such as machine specific optimizations, which might otherwise be necessary.
All code that associated with this book can be found at the Github repository linked by the following URL:
This codebase includes the examples presented in the book, see file
minicourse/examples.hpp
.
Some of the material in this book is based on the course, 15210, cotaught with Guy Blelloch at CMU.
This book does not focus on the design and analysis of parallel algorithms. The interested reader can find more details this topic in this book.
2. C++ Background
The material is entirely based on C++ and a library for writing parallel programs in C++. We use recent features of C++ such as closures or lambda expressions and templates. A deep understanding of these topics is not necessary to follow the course notes, because we explain them at a high level as we go, but such prior knowledge might be helpful; some pointers are provided below.
2.1. Template metaprogramming
Templates are C++'s way of providing for parametric polymorphism, which allows using the same code at multiple types. For example, in modern functional languages such as the ML family or Haskell, you can write a function $\lambda~x.x$ as an identity function that returns its argument for any type of $x$. You don’t have to write the function at every type that you plan to apply. Since functional languages such as ML and Haskell rely on type inference and have powerful type systems, they can infer from your code the most general type (within the constraints of the type system). For example, the function $\lambda~x.x$ can be given the type $\forall \alpha. \alpha \rightarrow \alpha$. This type says that the function works for any type $\alpha$ and given an argument of type $\alpha$, it returns a value of type $\alpha$.
C++ provides for polymorphism with templates. In its most basic form, a template is a class declaration or a function declaration, which is explicitly stated to be polymorphic, by making explicit the type variable. Since C++ does not in general perform type inference (in a rigorous sense of the word), it requires some help from the programmer.
For example, the following code below defines an array class that is
parametric in the type of its elements. The declaration template
<class T>
says that the declaration of class array
, which follows
is parameterized by the identifier T
. The definition of
class array
then uses T
as a type variable. For example, the
array defines a pointer to element sequences of type T
, and the
sub
function returns an element of type T
etc.
template <class T> class array { public: array (int size) {a = new T[size];} T sub (int i) { a[i];} private: *T a; }
Note that the only part of the syntax template <class T>
that is
changeable is the identifier T
. In other words, you should think of
the syntax template <class T>
as a binding form that allows you to
pick an identifier (in this case T
). You might ask why the type
identifier/variable T
is a class
. This is a good question. The
authors find it most helpful to not think much about such questions,
especially in the context of the C++ language.
Once defined a template class can be initialized with different type
variables by using the < >
syntax. For examples, we can define
different arrays such as the following.
array<int> myFavoriteNumbers(7); array<char*> myFavoriteNames(7);
Again, since C++ does not perform type inference for class instances, the C++ compiler expects the programmer to eliminate explicitly parametricity by specifying the argument type.
It is also possible to define polymorphic or generic functions. For example, the following declarations defines a generic identity function.
template <class T> T identity(T x) { return x;}
Once defined, this function can be used without explicitly specializing it at various types. In contrast to templated classes, C++ does provide some type inference for calls to templated functions. So generic functions can be specialized implicitly, as shown in the examples below.
i = identity (3) s = identity ("template programming can be ugly")
This brief summary of templates should suffice for the purposes of the material covered in this book. Templates are covered in significant detail by many books, blogs, and discussions boards. We refer the interested reader to those sources for further information.
2.2. Lambda expressions
The C++11 reference provides good documentation on lambda expressions.
3. Chapter: Forkjoin parallelism
Forkjoin parallelism, a fundamental model in parallel computing, dates back to 1963 and has since been widely used in parallel computing. In fork join parallelism, computations create opportunities for parallelism by branching at certain points that are specified by annotations in the program text.
Each branching point forks the control flow of the computation into two or more logical threads. When control reaches the branching point, the branches start running. When all branches complete, the control joins back to unify the flows from the branches. Results computed by the branches are typically read from memory and merged at the join point. Parallel regions can fork and join recursively in the same manner that divide and conquer programs split and join recursively. In this sense, fork join is the divide and conquer of parallel computing.
As we will see, it is often possible to extend an existing language with support for forkjoin parallelism by providing libraries or compiler extensions that support a few simple primitives. Such extensions to a language make it easy to derive a sequential program from a parallel program by syntactically substituting the parallelism annotations with corresponding serial annotations. This in turn enables reasoning about the semantics or the meaning of parallel programs by essentially "ignoring" parallelism.
PASL is a C++ library that enables writing implicitly parallel
programs. In PASL, fork join is expressed by application of the
fork2()
function. The function expects two arguments: one for each
of the two branches. Each branch is specified by one C++
lambda expression.
In the sample code below, the first branch writes the value 1 into the
cell b1
and the second 2 into b2
; at the join point, the sum of
the contents of b1
and b2
is written into the cell j
.
long b1 = 0; long b2 = 0; long j = 0; fork2([&] { // first branch b1 = 1; }, [&] { // second branch b2 = 2; }); // join point j = b1 + b2; std::cout << "b1 = " << b1 << "; b2 = " << b2 << "; "; std::cout << "j = " << j << ";" << std::endl;
Output:
b1 = 1; b2 = 2; j = 3;
When this code runs, the two branches of the fork join are both run to completion. The branches may or may not run in parallel (i.e., on different cores). In general, the choice of whether or not any two such branches are run in parallel is chosen by the PASL runtime system. The join point is scheduled to run by the PASL runtime only after both branches complete. Before both branches complete, the join point is effectively blocked. Later, we will explain in some more detail the scheduling algorithms that the PASL uses to handle such load balancing and synchronization duties.
In forkjoin programs, a thread is a sequence of instructions that do
not contain calls to fork2()
. A thread is essentially a piece of
sequential computation. The two branches passed to fork2()
in the
example above correspond, for example, to two independent
threads. Moreover, the statement following the join point (i.e., the
continuation) is also a thread.
Note

If the syntax in the code above is unfamiliar, it might be a
good idea to review the notes on lambda expressions in C++11. In a
nutshell, the two branches of fork2() are provided as
lambdaexpressions where all free variables are passed by reference. 
Note

Fork join of arbitrary arity is readily derived by repeated application of binary fork join. As such, binary fork join is universal because it is powerful enough to generalize to fork join of arbitrary arity. 
All writes performed by the branches of the binary fork join are
guaranteed by the PASL runtime to commit all of the changes that they
make to memory before the join statement runs. In terms of our code
snippet, all writes performed by two branches of fork2
are committed
to memory before the join point is scheduled. The PASL runtime
guarantees this property by using a local barrier. Such barriers are
efficient, because they involve just a single dynamic synchronization
point between at most two processors.
In the example below, both writes into b1
and b2
are guaranteed to
be performed before the print statement.
long b1 = 0; long b2 = 0; fork2([&] { b1 = 1; }, [&] { b2 = 2; }); std::cout << "b1 = " << b1 << "; b2 = " << b2 << std::endl;
Output:
b1 = 1; b2 = 2
PASL provides no guarantee on the visibility of writes between any two
parallel branches. In the code just above, for example, writes
performed by the first branch (e.g., the write to b1
) may or may not
be visible to the second, and vice versa.
3.1. Parallel Fibonacci
Now, we have all the tools we need to describe our first parallel code: the recursive Fibonacci function. Although useless as a program because of efficiency issues, this example is the "hello world" program of parallel computing.
Recall that the $n^{th}$ Fibonnacci number is defined by the recurrence relation
with base cases
Let us start by considering a sequential algorithm. Following the definition of Fibonacci numbers, we can write the code for (inefficiently) computing the $n^{th}$ Fibonnacci number as follows. This function for computing the Fibonacci numbers is inefficient because the algorithm takes exponential time, whereas there exist dynamic programming solutions that take linear time.
long fib_seq (long n) { long result; if (n < 2) { result = n; } else { long a, b; a = fib_seq(n1); b = fib_seq(n2); result = a + b; } return result; }
To write a parallel version, we remark that the two recursive calls
are completely independent: they do not depend on each other
(neither uses a piece of data generated or written by another). We
can therefore perform the recursive calls in parallel. In general,
any two independent functions can be run in parallel. To indicate
that two functions can be run in parallel, we use fork2()
.
long fib_par(long n) { long result; if (n < 2) { result = n; } else { long a, b; fork2([&] { a = fib_par(n1); }, [&] { b = fib_par(n2); }); result = a + b; } return result; }
3.2. Incrementing an array, in parallel
Suppose that we wish to map an array to another by incrementing each
element by one. We can write the code for a function map_incr
that
performs this computation serially.
void map_incr(const long* source, long* dest, long n) { for (long i = 0; i < n; i++) dest[i] = source[i] + 1; }
map_incr
.The code below illustrates an example use of map_incr
.
const long n = 4; long xs[n] = { 1, 2, 3, 4 }; long ys[n]; map_incr(xs, ys, n); for (long i = 0; i < n; i++) std::cout << ys[i] << " "; std::cout << std::endl;
Output:
2 3 4 5
This is not a good parallel algorithm but it is not difficult to give a parallel algorithm for incrementing an array. The code for such an algorithm is given below.
void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi  lo; if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }
It is easy to see that this algorithm has O(n) work and $O(\log{n})$ span.
3.3. The sequential elision
In the Fibonacci example, we started with a sequential algorithm and
derived a parallel algorithm by annotating independent functions. It
is also possible to go the other way and derive a sequential algorithm
from a parallel one. As you have probably guessed this direction is
easier, because all we have to do is remove the calls to the fork2
function. The sequential elision of our parallel Fibonacci code can be
written by replacing the call to fork2()
with a statement that
performs the two calls (arguments of fork2()
) sequentially as
follows.
long fib_par(long n) { long result; if (n < 2) { result = n; } else { long a, b; ([&] { a = fib_par(n1); })(); ([&] { b = fib_par(n2); })(); result = a + b; } return result; }
Note

Although this code is slightly different than the sequential
version that we wrote, it is not too far away, because the only the
difference is the creation and application of the lambdaexpressions.
An optimizing compiler for C++ can easily "inline" such
computations. Indeed, After an optimizing compiler applies certain
optimizations, the performance of this code the same as the
performance of fib_seq . 
The sequential elision is often useful for debugging and for optimization. It is useful for debugging because it is usually easier to find bugs in sequential runs of parallel code than in parallel runs of the same code. It is useful in optimization because the sequentialized code helps us to isolate the purely algorithmic overheads that are introduced by parallelism. By isolating these costs, we can more effectively pinpoint inefficiencies in our code.
3.4. Executing forkjoin algorithms
We defined forkjoin programs as a subclass case of multithreaded
programs. Let’s see more precisely how we can "map" a forkjoin
program to a multithreaded program. An our running example, let’s use the
map_incr_rec
, whose code is reproduced below.
void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi  lo; if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }
Since, a forkjoin program does not explicitly manipulate threads, it
is not immediately clear what a thread refers to. To define threads,
we can partition a forkjoin computation into pieces of serial
computations, each of which constitutes a thread. What we mean by a
serial computation is a computation that runs serially and also that
does not involve any synchronization with other threads except at the
start and at the end. More specifically, for forkjoin programs, we
can define a piece of serial computation a thread, if it executes
without performing parallel operations (fork2
) except perhaps as its
last action. When partitioning the computation into threads, it is
important for threads to be maximal; technically a thread can be as
small as a single instruction.
map_inc_rec
excluding the fork2
or the continuation of fork2
, which is empty, an is annotated with the interval of the input array that it operates on (its argument).The Figure above illustrates the dag for an execution
of map_incr_rec
. We partition each invocation of this function into
two threads labeled by "M" and "C" respectively. The threads labeled
by $M\lbrack i,j \rbrack $ corresponds to the part of the invocation of
map_incr_rec
with arguments lo
and hi
set to $i$ and
$j$ respectively; this first part corresponds to the part
of execution up and including the fork2
or all of the function if
this is a base case. The second corresponds to the "continuation" of
the fork2
, which is in this case includes no computation.
Based on this dag, we can create another dag, where each thread is replaced by the sequence of instructions that it represents. This would give us a picture similar to the dag we drew before for general multithreaded programs. Such a dag representation, where we represent each instruction by a vertex, gives us a direct way to calculate the work and span of the computation. If we want to calculate work and span ond the dag of threads, we can label each vertex with a weight that corresponds to the number of instruction in that thread.
Note

The term thread is very much overused in computer science. There are system threads, which are threads that are known to the operating system and which can perform a variety of operations. For example, Pthreads library enables creating such system threads and programming with them. There are also many libraries for programming with userlevel threads, which are threads that exist at the application level but the operating system does not know about them. Then there are threads that are much more specific such as those that we have defined for the forkjoin programs. Such threads can be mapped to system or userlevel threads but since they are more specific, they are usually implemented in a custom fashion, usually in the user/application space. For this reason, some authors prefer to use a different term for such threads, e.g., spark, strand, task. 
Let’s observe a few properties of forkjoin computations and their dags.

The computation dag of a forkjoin program applied to an input unfolds dynamically as the program executes. For example, when we run
map_inc_rec
with an input with $n$ elements, the dag initially contains just the root vertex (thread) corresponding to the first call tomap_inc_rec
but it grows as the execution proceeds. 
An execution of a forkjoin program can generate a massive number of threads. For example, our ‘map_inc_rec’ function generates approximately $4n$ threads for an input with $n$ elements.

The work/span of each thread can vary from a small amount to a very large amount depending on the algorithm. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation (continuation threads).
Suppose now we are given a computation dag and we wish to execute the dag by mapping each thread to one of the $P$ processor that is available on the hardware. To do this, we can use an online scheduling algorithm.
The following is a schedule for the dag shown in this Figure assuming that each thread takes unit time.
Time Step  Processor 1  Processor 2 

1 
M [0,8) 

2 
M [0,4) 
M [4,8) 
3 
M [0,2) 
M [4,6) 
4 
M [0,1) 
M [4,5) 
5 
M [1,2) 
M [5,6) 
6 
C [0,2) 
C [4,6) 
7 
M [2,4) 
M [6,8) 
8 
M [2,3) 
M [6,7) 
9 
M [3,4) 
M [7,8) 
10 
C [2,4) 
C [6,8) 
11 
C [0,4) 
C [4,8) 
12 
C [0,8) 
_ 
4. Race Conditions
A race condition is any behavior in a program that is determined by some feature of the system that cannot be controlled by the program, such as timing of the execution of instructions. In PASL, a race condition can occur whenever two parallel branches access the same location in memory and at least one of the two branches performs a write. When this situation occurs, the behavior of the program may be undefined, leading to (often) buggy behavior. We used the work "may" here because certain programs use race conditions in a careful way as a means to improve performance. Special attention to race conditions is crucial because race conditions introduce especially pernicious form error that can confound beginners and experts alike.
Race conditions can be highly problematic because, owing to their nondeterministic behavior, they are often extremely hard to debug. To make matters worse, it is also quite easy to create race conditions without even knowing it.
In the code below, both branches of fork2
are writing into b
.
What should then the output of this program be?
long b = 0; fork2([&] { b = 1; }, [&] { b = 2; }); std::cout << "b = " << std::endl;
At the time of the print, the contents of b
is determined by the
last write. Thus depending on which of the two branches perform the
write, we can see both possibilities:
Output:
b = 1
Output:
b = 2
Consider the following alternative implementation of the Fibonacci
function. By "inlining" the plus operation in both branches, the
programmer got rid of the addition operation after the fork2
.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { result += fib_par_racy(n1); },[&] { result += fib_par_racy(n2); }); } return result; }
This code is not correct because it has a race condition.
As in the example shows, separate threads are updating the value
result but it might look like this is not a race condition because the
update consists of an addition operation, which reads the value and
then writes to i. The race condition might be easier to see if we
expand out the applications of the +=
operator.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { long a1 = fib_par_racy(n1); long a2 = result; result = a1 + a2; },[&] { long b1 = fib_par_racy(n2); long b2 = result; result = b1 + b2; }); } return result; }
When written in this way, it is clear that these two parallel threads
are not independent: they both read result
and write to
result
. Thus the outcome depends on the order in which these reads
and writes are performed, as shown in the next example.
The following table takes us through one possible execution trace of
the call fib_par_racy(2)
. The number to the left of each instruction
describes the time at which the instruction is executed. Note that
since this is a parallel program, multiple instructions can be
executed at the same time. The particular execution that we have in
this example gives us a bogus result: the result is 0, not 1 as it
should be.
Time step  Thread 1  Thread 2 

1 
a1 = fib_par_racy(1) 
b2 = fib_par_racy(0) 
2 
a2 = result 
b3 = result 
3 
result = a1 + a2 
_ 
4 
_ 
result = b1 + b2 
The reason we get a bogus result is that both threads read the initial
value of result at the same time and thus do not see each others
write. In this example, the second thread "wins the race" and writes
into result
. The value 1 written by the first thread is effectively
lost by being overwritten by the second thread.
4.1. Synchronization Hardware
Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem. These operations may allow, for example, testing the contents of a (machine) word then modifying it, perhaps by swapping it with another word. Such operations are sometimes called atomic readmodifywrite or RMW, for short, operations.
A handful of different RMW operations have been proposed. They
include operations such as loadlink/storeconditional,
fetchandadd, and compareandswap. They typically take the
memory location x
, and a value v
and replace the value of stored
at x
with f(x,v)
. For example, the fetchandadd operation takes
the location x
and the incrementamount, and atomically increments
the value at that location by the specified amount, i.e., f(x,v) = *x
+ v
.
The compareandswap operation takes the location x
and takes a pair
of values (a,b)
as the second argument, and stores b
into x
if
the value in x
is a
, i.e., f(x,(a,b)) = if *x = a then b else a
;
the operation returns a boolean indicating whether the operation
successfully stored a new value in x
. The operation
"compareandswap" is a reasonably powerful synchronization operation:
it can be used by arbitrarily many threads to agree (reach consensus)
on a value. This instruction therefore is frequently provided by
modern parallel architectures such as Intel’s X86.
In C$++$, the atomic
class can be used to perform synchronization.
Objects af this type are guarantee to be free of race conditions; and
in fact, in C++, they are the only objects that are guaranteed to be
free from race conditions. The contents of an atomic
object can be
accessed by load
opeations, updated by store
operation, and also
updated by compare_exchange_weak
and compare_exchange_strong
operations, the latter of which implement the compareandswap
operation.
Access to the contents of any given cell is achieved by the load()
and store()
methods.
std::atomic<bool> flag; flag.store(false); std::cout << flag.load() << std::endl; flag.store(true); std::cout << flag.load() << std::endl;
Output:
0
1
The key operation that help with race conditions is the compareandexchange operation.
std::atomic<bool> flag; flag.store(false); bool expected = false; bool was_successful = flag.compare_exchange_strong(expected, true); std::cout << "was_successful = " << was_successful << "; flag = " << flag.load() << std::endl; bool expected2 = false; bool was_successful2 = flag.compare_exchange_strong(expected2, true); std::cout << "was_successful2 = " << was_successful2 << "; flag = " << flag.load() << std::endl;
Output:
was_successful = 1; flag = 1
was_successful2 = 0; flag = 1
As another example use of the atomic
class, recall our Fibonacci
example with the race condition. In that example, race condition
arises because of concurrent writes to the result
variable. We can
eliminate this kind of race condition by using different memory
locations, or by using an atomic class and using a
compare_exchange_strong
operation.
The following implementation of Fibonacci is not safe because the
variable result
is shared and updated by multiple threads.
long fib_par_racy(long n) { long result = 0; if (n < 2) { result = n; } else { fork2([&] { result += fib_par_racy(n1); },[&] { result += fib_par_racy(n2); }); } return result; }
We can solve this problem by declaring result
to be an atomic type
and using a standard busywaiting protocol based on compareandswap.
long fib_par_atomic(long n) { atomic<long> result = 0; if (n < 2) { result.store(n); } else { fork2([&] { long r = fib_par_racy(n1); // Atomically update result. while (true) { long exp = result.load(); bool flag = result.compare_exchange_strong(exp,exp+r) if (flag) {break;} } },[&] { long r = fib_par_racy(n2); // Atomically update result. while (true) { long exp = result.load(); bool flag = result.compare_exchange_strong(exp,exp+r) if (flag) {break;} } }); } return result; }
The idea behind the solution is to load the current value of result
and atomically update result
only if it has not been modified (by
another thread) since it was loaded. This guarantees that the
result
is always updated (read and modified) correctly without
missing an update from another thread.
The example above illustrates a typical use of the compareandswap operation. In this particular example, we can probably prove our code is correct. But this is not always as easy due to a problem called the "ABA problem."
4.2. ABA problem
While reasonably powerful, compareandswap suffers from the socalled
ABA problem. To see this consider the following scenario where a
shared variable result
is update by multiple threads in parallel: a
thread, say $T$, reads the result
and stores its current
value, say 2
, in current
. In the mean time some other thread also
reads result
and performs some operations on it, setting it back to
2
after it is done. Now, thread $T$ takes its turn
again and attempts to store a new value into result
by using 2
as
the old value and being successful in doing so, because the value
stored in result
appears to have not changed. The trouble is that
the value has actually changed and has been changed back to the value
that it used to be. Thus, compareandswap was not able to detect
this change because it only relies on a simple shallow notion of
equality. If for example, the value stored in result
was a pointer,
the fact that the pointer remains the same does not mean that values
accessible from the pointer has not been modified; if for example, the
pointer led to a tree structure, an update deep in the tree could
leave the pointer unchanged, even though the tree has changed.
This problem is called the ABA problem, because it involves cycling the atomic memory between the three values $A$, $B$, and again $A$). The ABA problem is an important limitation of compareandswap: the operation itself is not atomic but is able to behave as if it is atomic if it can be ensured that the equality test of the subject memory cell suffices for correctness.
In the example below, ABA problem may happen (if the counter is incremented and decremented again in between a load and a store) but it is impossible to observe because it is harmless. If however, the compareandswap was on a memory object with references, the ABA problem could have had observable effects.
The ABA problem can be exploited to give seemingly correct implementations that are in fact incorrect. To reduce the changes of bugs due to the ABA problem, memory objects subject to compareandswap are usually tagged with an additional field that counts the number of updates. This solves the basic problem but only up to a point because the counter itself can also wrap around. The loadlink/storeconditional operation solves this problem by performing the write only if the memory location has not been updated since the last read (load) but its practical implementations are hard to come by.
5. Chapter: Executing parallel algorithms
Implicit parallelism allows writing parallel programs at a high level
of abstraction. In this section, we discoss techniques for executing
such parallel programs on hardwaresharedmemory computers such as
multicore computers. As our running example, we use the
map_incr_rec
, whose code is reproduced below.
void map_incr_rec(const long* source, long* dest, long lo, long hi) { long n = hi  lo; if (n == 0) { // do nothing } else if (n == 1) { dest[lo] = source[lo] + 1; } else { long mid = (lo + hi) / 2; fork2([&] { map_incr_rec(source, dest, lo, mid); }, [&] { map_incr_rec(source, dest, mid, hi); }); } }
The basic idea is to partition a computation, that is a run of a parallel algorithm on a specified input, into pieces of serial computations, called threads, and map them to available processors while observing the dependencies between them. The task of mapping the threads to available processors is called thread scheduling or simply scheduling.
We call a piece of serial computation a thread, if it executes
without performing parallel operations (fork2
) except perhaps as
its last action. The term thread is short for userlevel thread
(as opposed to a operatingsystem thread). When partitioning the
computation into threads, it is important for threads to be maximal
to minimize scheduling overhead (technically a thread can be as small
as a sincle instruction).
When scheduling a parallel computation, it is important that we don’t
alter the intended meaning of the computation. Specifically, if a
thread depends another thread, because for example, it reads a piece
of data generated by the latter, it cannot be executed before the
thread that it depends on. We can conservatively approximate such
dependencies by observing the fork2
expressions and by organizing
dependencies consistently with them. More specifically, we can
represent a computations as a graph where each vertex represents a
thread and each edge represents a dependency. Vertices and edges are
created by execution of fork2
. Each fork2
creates two threads
(vertices) corresponding to the two branches and inserts an edge
between each branch and the thread that performs the fork2
branches;
in addition, each fork2
creates a join or continuation thread
(vertex) that depends on the two branches. Since such a graph cannot
contain cycles, it is a Directed Acyclic Graph (DAG).
map_inc_rec
excluding the fork2
or the continuation of fork2
, which is empty, an is annotated with the interval of the input array that it operates on (its argument).Figure [fig:parallelincdag] illustrates the DAG for an execution
of map_incr_rec
. We partition each invocation of this function into
two threads labeled by "M" and "C" respectively. The threads labeled
by $M \lbrack i,j \rbrack$ corresponds to the part of the invocation of
map_incr_rec
with arguments lo
and hi
set to $i$ and
$j$ respectively; this first part corresponds to the part
of execution up and including the fork2
or all of the function if
this is a base case. The second corresponds to the "continuation" of
the fork2
, which is in this case includes no computation.
There is an important connection between computation DAG’s and work and span. Suppose that we assign to each vertex a weight of at least $1$ that corresponds to the work of that vertex (since threads are serial work and span for each vertex is the same). We can then calculate the total weight and total depth of the DAG by summing up weights. The total weight of the DAG corresponds to the work of the computation and the depth corresponds to its span. In our example, each vertex has weight O(1). Thus for an array with n elements, the total weight (work) is O(n) and the depth (span) is $O(\log{n})$.
Having talked about DAG’s we are now ready to talk about how to map parallel computations to actual hardware so as to minimize their runtime, i.e., scheduling. But before we move on to scheduling let us observe a few properties of implicitly parallel computations.

The computation DAG of a parallel algorithm applied to an input unfolds dynamically as the algorithm executes. For example, when we run
map_inc_rec
with an input with $n$ elements, the DAG initially contains just the root vertex (thread) corresponding to the first call tomap_inc_rec
but it grows as the execution proceeds. 
An execution of a parallel algorithm can generate a massive number of threads. For example, our ‘map_inc_rec’ function generates approximately $4n$ threads for an input with $n$ elements.

The work/span of each thread can vary from a small amount to a very large amount depending on the algorithm. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation (continuation threads).
Suppose now we are given a computation DAG and we wish to execute the DAG by mapping each thread to one of the $P$ processor that is available on the hardware. When a thread is mapped to a processor, it will be executed requiring time proportional to the work (weight) of the thread. Once the processor completes the thread, we can map another thread to the processor, so that the processor does not idle unnecessarily. As we map threads to processors, it is important that we observe the dependencies between threads, i.e., we don’t execute a thread before its parents.
The goal of scheduling to minimize critical resources such as time. Computing the shortest schedule for a DAG turns out to be highly nontrivial. In fact, the related decision problem in NPcomplete. It is possible, however, to give a good approximation algorithm for the offline version of the problem to generate a 2factor approximation. Such an appraximation yields a schedule for a given DAG within a factortwo of the shortest schedule. In the online version of the problem, where the DAG unfolds as the computation executes, we don’t know the DAG a priori and we have to account for the costs for scheduling such as migrating threads between schedulers and finding work. To execute parallel programs, we need an solution to this online version of the problem.
An online scheduler or a simply a scheduler is an algorithm that performs scheduling by mapping threads to available processors. For example, if only one processor is available, a scheduler can map all threads to that one processor. If two processors are available, then the scheduler can divide the threads between the two processors as evenly as possible in an attempt to keep the two processors as busy as possible by load balancing.
The following is a valid schedule for the DAG shown in this Figure assuming that each thread takes unit time.
Time Step  Processor 1  Processor 2 

1 
M [0,8) 

2 
M [0,4) 
M [4,8) 
3 
M [0,2) 
M [4,6) 
4 
M [0,1) 
M [4,5) 
5 
M [1,2) 
M [5,6) 
6 
C [0,2) 
C [4,6) 
7 
M [2,4) 
M [6,8) 
8 
M [2,3) 
M [6,7) 
9 
M [3,4) 
M [7,8) 
10 
C [2,4) 
C [6,8) 
11 
C [0,4) 
C [4,8) 
12 
C [0,8) 
_ 
We say that a scheduler is greedy if, whenever there is a processor available and a thread ready to be executed, then the scheduler assigns the thread to the processor and starts running the thread immediately. Greedy schedulers have a nice property that is summarized by the following theorem.
This simple statement is powerful. To see this, note that the time to execute the computation is at least $\frac{W}{P}$ because we have a total of $W$ work. As such, the best possible execution strategy is to divide it evenly among the processors. Furthermore, execution time cannot be less than $S$ since $S$ represents the longest chain of sequential dependencies. Thus we have: $ T_P \geq \max\left(\frac{W}{P},S\right). $
This means that a greedy schudeler yields a schedule that is within a factor two of optimal: $\frac{W}{P} + S$ is never more than twice $\max(\frac{W}{P},S)$. Furthermore, when $\frac{W}{P} \gg S$, the difference between the greedy scheduler and the optimal scheduler is very small. In fact, we can rewrite equation above in terms of the average parallelism $\mathbb{P} = W/S$ as follows:
Therefore as long as $\mathbb{P} \gg P$ (the parallelism is much greater than the number of processors), then we obtain near perfect speedup. (Speedup is $W/T_p$ and perfect speedup would be $p$).
The quantity $\mathbb{P}$, sometimes called average parallelism, is usually quite large, because it usually grows polynomially with the input size.
We can give a simple greedy scheduler by using a queue of threads. At the start of the execution, the scheduler places the root of the DAG into the queue and then repeats the following step until the queue becomes empty: for each idle processor, take the vertex at the front of the queue and assign it to the processor, let each processor run for one step, if at the end of the step, there is a vertex in the DAG whose parents have all completed their execution, then insert that vertex at the tail of the queue.