## 1. Preface

The goal of these notes is to introduce the reader to the following.

1. The basic concepts of parallel computing.

2. Some basic parallel algorithm design principles and techniques,

3. Real-world performance and efficiency concerns in writing parallel software and techniques for dealing with them, and

4. Parallel programming in C++.

For parallel programming in C++, we use a library, called PASL, that we have been developing over the past 5 years. The implementation of the library uses advanced scheduling techniques to run parallel programs efficiently on modern multicores and provides a range of utilities for understanding the behavior of parallel programs.

PASL stands for Parallel Algorithm Scheduling Library. It also sounds a bit like the French phrase "pas seul" (pronounced "pa-sole"), meaning "not alone".

We expect that the instructions in this book and PASL to allow the reader to write performant parallel programs at a relatively high level (essentially at the same level of C++ code) without having to worry too much about lower level details such as machine specific optimizations, which might otherwise be necessary.

All code that associated with this book can be found at the Github repository linked by the following URL:

This code-base includes the examples presented in the book, see file minicourse/examples.hpp.

Some of the material in this book is based on the course, 15-210, co-taught with Guy Blelloch at CMU.

This book does not focus on the design and analysis of parallel algorithms. The interested reader can find more details this topic in this book.

## 2. C++ Background

The material is entirely based on C++ and a library for writing parallel programs in C++. We use recent features of C++ such as closures or lambda expressions and templates. A deep understanding of these topics is not necessary to follow the course notes, because we explain them at a high level as we go, but such prior knowledge might be helpful; some pointers are provided below.

### 2.1. Template metaprogramming

Templates are C++'s way of providing for parametric polymorphism, which allows using the same code at multiple types. For example, in modern functional languages such as the ML family or Haskell, you can write a function $\lambda~x.x$ as an identity function that returns its argument for any type of $x$. You don’t have to write the function at every type that you plan to apply. Since functional languages such as ML and Haskell rely on type inference and have powerful type systems, they can infer from your code the most general type (within the constraints of the type system). For example, the function $\lambda~x.x$ can be given the type $\forall \alpha. \alpha \rightarrow \alpha$. This type says that the function works for any type $\alpha$ and given an argument of type $\alpha$, it returns a value of type $\alpha$.

C++ provides for polymorphism with templates. In its most basic form, a template is a class declaration or a function declaration, which is explicitly stated to be polymorphic, by making explicit the type variable. Since C++ does not in general perform type inference (in a rigorous sense of the word), it requires some help from the programmer.

For example, the following code below defines an array class that is parametric in the type of its elements. The declaration template <class T> says that the declaration of class array, which follows is parameterized by the identifier T. The definition of class array then uses T as a type variable. For example, the array defines a pointer to element sequences of type T, and the sub function returns an element of type T etc.

template <class T>
class array {
public:
array (int size) {a = new T[size];}
T sub (int i) { a[i];}

private:
*T a;
}

Note that the only part of the syntax template <class T> that is changeable is the identifier T. In other words, you should think of the syntax template <class T> as a binding form that allows you to pick an identifier (in this case T). You might ask why the type identifier/variable T is a class. This is a good question. The authors find it most helpful to not think much about such questions, especially in the context of the C++ language.

Once defined a template class can be initialized with different type variables by using the < > syntax. For examples, we can define different arrays such as the following.

array<int> myFavoriteNumbers(7);
array<char*> myFavoriteNames(7);

Again, since C++ does not perform type inference for class instances, the C++ compiler expects the programmer to eliminate explicitly parametricity by specifying the argument type.

It is also possible to define polymorphic or generic functions. For example, the following declarations defines a generic identity function.

template <class T>
T identity(T x) { return x;}

Once defined, this function can be used without explicitly specializing it at various types. In contrast to templated classes, C++ does provide some type inference for calls to templated functions. So generic functions can be specialized implicitly, as shown in the examples below.

i = identity (3)
s = identity ("template programming can be ugly")

This brief summary of templates should suffice for the purposes of the material covered in this book. Templates are covered in significant detail by many books, blogs, and discussions boards. We refer the interested reader to those sources for further information.

### 2.2. Lambda expressions

The C++11 reference provides good documentation on lambda expressions.

## 3. Chapter: Fork-join parallelism

Fork-join parallelism, a fundamental model in parallel computing, dates back to 1963 and has since been widely used in parallel computing. In fork join parallelism, computations create opportunities for parallelism by branching at certain points that are specified by annotations in the program text.

Each branching point forks the control flow of the computation into two or more logical threads. When control reaches the branching point, the branches start running. When all branches complete, the control joins back to unify the flows from the branches. Results computed by the branches are typically read from memory and merged at the join point. Parallel regions can fork and join recursively in the same manner that divide and conquer programs split and join recursively. In this sense, fork join is the divide and conquer of parallel computing.

As we will see, it is often possible to extend an existing language with support for fork-join parallelism by providing libraries or compiler extensions that support a few simple primitives. Such extensions to a language make it easy to derive a sequential program from a parallel program by syntactically substituting the parallelism annotations with corresponding serial annotations. This in turn enables reasoning about the semantics or the meaning of parallel programs by essentially "ignoring" parallelism.

PASL is a C++ library that enables writing implicitly parallel programs. In PASL, fork join is expressed by application of the fork2() function. The function expects two arguments: one for each of the two branches. Each branch is specified by one C++ lambda expression.

Example 1. Fork join

In the sample code below, the first branch writes the value 1 into the cell b1 and the second 2 into b2; at the join point, the sum of the contents of b1 and b2 is written into the cell j.

long b1 = 0;
long b2 = 0;
long j  = 0;

fork2([&] {
// first branch
b1 = 1;
}, [&] {
// second branch
b2 = 2;
});
// join point
j = b1 + b2;

std::cout << "b1 = " << b1 << "; b2 = " << b2 << "; ";
std::cout << "j = " << j << ";" << std::endl;

Output:

b1 = 1; b2 = 2; j = 3;

When this code runs, the two branches of the fork join are both run to completion. The branches may or may not run in parallel (i.e., on different cores). In general, the choice of whether or not any two such branches are run in parallel is chosen by the PASL runtime system. The join point is scheduled to run by the PASL runtime only after both branches complete. Before both branches complete, the join point is effectively blocked. Later, we will explain in some more detail the scheduling algorithms that the PASL uses to handle such load balancing and synchronization duties.

In fork-join programs, a thread is a sequence of instructions that do not contain calls to fork2(). A thread is essentially a piece of sequential computation. The two branches passed to fork2() in the example above correspond, for example, to two independent threads. Moreover, the statement following the join point (i.e., the continuation) is also a thread.

 Note If the syntax in the code above is unfamiliar, it might be a good idea to review the notes on lambda expressions in C++11. In a nutshell, the two branches of fork2() are provided as lambda-expressions where all free variables are passed by reference.
 Note Fork join of arbitrary arity is readily derived by repeated application of binary fork join. As such, binary fork join is universal because it is powerful enough to generalize to fork join of arbitrary arity.

All writes performed by the branches of the binary fork join are guaranteed by the PASL runtime to commit all of the changes that they make to memory before the join statement runs. In terms of our code snippet, all writes performed by two branches of fork2 are committed to memory before the join point is scheduled. The PASL runtime guarantees this property by using a local barrier. Such barriers are efficient, because they involve just a single dynamic synchronization point between at most two processors.

Example 2. Writes and the join statement

In the example below, both writes into b1 and b2 are guaranteed to be performed before the print statement.

long b1 = 0;
long b2 = 0;

fork2([&] {
b1 = 1;
}, [&] {
b2 = 2;
});

std::cout << "b1 = " << b1 << "; b2 = " << b2 << std::endl;

Output:

b1 = 1; b2 = 2

PASL provides no guarantee on the visibility of writes between any two parallel branches. In the code just above, for example, writes performed by the first branch (e.g., the write to b1) may or may not be visible to the second, and vice versa.

### 3.1. Parallel Fibonacci

Now, we have all the tools we need to describe our first parallel code: the recursive Fibonacci function. Although useless as a program because of efficiency issues, this example is the "hello world" program of parallel computing.

Recall that the $n^{th}$ Fibonnacci number is defined by the recurrence relation

$$\begin{array}{lcl} F(n) & = & F(n-1) + F(n-2) \end{array}$$

with base cases

$$F(0) = 0, \, F(1) = 1$$

Let us start by considering a sequential algorithm. Following the definition of Fibonacci numbers, we can write the code for (inefficiently) computing the $n^{th}$ Fibonnacci number as follows. This function for computing the Fibonacci numbers is inefficient because the algorithm takes exponential time, whereas there exist dynamic programming solutions that take linear time.

long fib_seq (long n) {
long result;
if (n < 2) {
result = n;
} else {
long a, b;
a = fib_seq(n-1);
b = fib_seq(n-2);
result = a + b;
}
return result;
}

To write a parallel version, we remark that the two recursive calls are completely independent: they do not depend on each other (neither uses a piece of data generated or written by another). We can therefore perform the recursive calls in parallel. In general, any two independent functions can be run in parallel. To indicate that two functions can be run in parallel, we use fork2().

long fib_par(long n) {
long result;
if (n < 2) {
result = n;
} else {
long a, b;
fork2([&] {
a = fib_par(n-1);
}, [&] {
b = fib_par(n-2);
});
result = a + b;
}
return result;
}

### 3.2. Incrementing an array, in parallel

Suppose that we wish to map an array to another by incrementing each element by one. We can write the code for a function map_incr that performs this computation serially.

void map_incr(const long* source, long* dest, long n) {
for (long i = 0; i < n; i++)
dest[i] = source[i] + 1;
}
Example 3. Example: Using map_incr.

The code below illustrates an example use of map_incr.

const long n = 4;
long xs[n] = { 1, 2, 3, 4 };
long ys[n];
map_incr(xs, ys, n);
for (long i = 0; i < n; i++)
std::cout << ys[i] << " ";
std::cout << std::endl;

Output:

2 3 4 5

This is not a good parallel algorithm but it is not difficult to give a parallel algorithm for incrementing an array. The code for such an algorithm is given below.

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
long n = hi - lo;
if (n == 0) {
// do nothing
} else if (n == 1) {
dest[lo] = source[lo] + 1;
} else {
long mid = (lo + hi) / 2;
fork2([&] {
map_incr_rec(source, dest, lo, mid);
}, [&] {
map_incr_rec(source, dest, mid, hi);
});
}
}

It is easy to see that this algorithm has O(n) work and $O(\log{n})$ span.

### 3.3. The sequential elision

In the Fibonacci example, we started with a sequential algorithm and derived a parallel algorithm by annotating independent functions. It is also possible to go the other way and derive a sequential algorithm from a parallel one. As you have probably guessed this direction is easier, because all we have to do is remove the calls to the fork2 function. The sequential elision of our parallel Fibonacci code can be written by replacing the call to fork2() with a statement that performs the two calls (arguments of fork2()) sequentially as follows.

long fib_par(long n) {
long result;
if (n < 2) {
result = n;
} else {
long a, b;
([&] {
a = fib_par(n-1);
})();
([&] {
b = fib_par(n-2);
})();
result = a + b;
}
return result;
}
 Note Although this code is slightly different than the sequential version that we wrote, it is not too far away, because the only the difference is the creation and application of the lambda-expressions. An optimizing compiler for C++ can easily "inline" such computations. Indeed, After an optimizing compiler applies certain optimizations, the performance of this code the same as the performance of fib_seq.

The sequential elision is often useful for debugging and for optimization. It is useful for debugging because it is usually easier to find bugs in sequential runs of parallel code than in parallel runs of the same code. It is useful in optimization because the sequentialized code helps us to isolate the purely algorithmic overheads that are introduced by parallelism. By isolating these costs, we can more effectively pinpoint inefficiencies in our code.

### 3.4. Executing fork-join algorithms

We defined fork-join programs as a subclass case of multithreaded programs. Let’s see more precisely how we can "map" a fork-join program to a multithreaded program. An our running example, let’s use the map_incr_rec, whose code is reproduced below.

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
long n = hi - lo;
if (n == 0) {
// do nothing
} else if (n == 1) {
dest[lo] = source[lo] + 1;
} else {
long mid = (lo + hi) / 2;
fork2([&] {
map_incr_rec(source, dest, lo, mid);
}, [&] {
map_incr_rec(source, dest, mid, hi);
});
}
}

Since, a fork-join program does not explicitly manipulate threads, it is not immediately clear what a thread refers to. To define threads, we can partition a fork-join computation into pieces of serial computations, each of which constitutes a thread. What we mean by a serial computation is a computation that runs serially and also that does not involve any synchronization with other threads except at the start and at the end. More specifically, for fork-join programs, we can define a piece of serial computation a thread, if it executes without performing parallel operations (fork2) except perhaps as its last action. When partitioning the computation into threads, it is important for threads to be maximal; technically a thread can be as small as a single instruction.

A thread is a maximal computation consisting of a sequence of instructions that do not contain calls to fork2() except perhaps at the very end.

Figure 1. Dag for parallel increment on an array of $8$: Each vertex corresponds a call to map_inc_rec excluding the fork2 or the continuation of fork2, which is empty, an is annotated with the interval of the input array that it operates on (its argument).

The Figure above illustrates the dag for an execution of map_incr_rec. We partition each invocation of this function into two threads labeled by "M" and "C" respectively. The threads labeled by $M\lbrack i,j \rbrack$ corresponds to the part of the invocation of map_incr_rec with arguments lo and hi set to $i$ and $j$ respectively; this first part corresponds to the part of execution up and including the fork2 or all of the function if this is a base case. The second corresponds to the "continuation" of the fork2, which is in this case includes no computation.

Based on this dag, we can create another dag, where each thread is replaced by the sequence of instructions that it represents. This would give us a picture similar to the dag we drew before for general multithreaded programs. Such a dag representation, where we represent each instruction by a vertex, gives us a direct way to calculate the work and span of the computation. If we want to calculate work and span ond the dag of threads, we can label each vertex with a weight that corresponds to the number of instruction in that thread.

Let’s observe a few properties of fork-join computations and their dags.

1. The computation dag of a fork-join program applied to an input unfolds dynamically as the program executes. For example, when we run map_inc_rec with an input with $n$ elements, the dag initially contains just the root vertex (thread) corresponding to the first call to map_inc_rec but it grows as the execution proceeds.

2. An execution of a fork-join program can generate a massive number of threads. For example, our ‘map_inc_rec’ function generates approximately $4n$ threads for an input with $n$ elements.

3. The work/span of each thread can vary from a small amount to a very large amount depending on the algorithm. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation (continuation threads).

Suppose now we are given a computation dag and we wish to execute the dag by mapping each thread to one of the $P$ processor that is available on the hardware. To do this, we can use an online scheduling algorithm.

Example 4. An example 2-processor schedule

The following is a schedule for the dag shown in this Figure assuming that each thread takes unit time.

Time Step Processor 1 Processor 2

1

M [0,8)

2

M [0,4)

M [4,8)

3

M [0,2)

M [4,6)

4

M [0,1)

M [4,5)

5

M [1,2)

M [5,6)

6

C [0,2)

C [4,6)

7

M [2,4)

M [6,8)

8

M [2,3)

M [6,7)

9

M [3,4)

M [7,8)

10

C [2,4)

C [6,8)

11

C [0,4)

C [4,8)

12

C [0,8)

_

Exercise: Enabling Tree

Draw the enabling tree for the schedule above.

Example 5. Centralized scheduler illustrated: the state of the queue and the dag after step 4. Completed vertices are drawn in grey (shaded).
Example 6. Distributed scheduler illustrated: the state of the queue and the dag after step 4. Completed vertices are drawn in grey (shaded).

## 4. Race Conditions

A race condition is any behavior in a program that is determined by some feature of the system that cannot be controlled by the program, such as timing of the execution of instructions. In PASL, a race condition can occur whenever two parallel branches access the same location in memory and at least one of the two branches performs a write. When this situation occurs, the behavior of the program may be undefined, leading to (often) buggy behavior. We used the work "may" here because certain programs use race conditions in a careful way as a means to improve performance. Special attention to race conditions is crucial because race conditions introduce especially pernicious form error that can confound beginners and experts alike.

Race conditions can be highly problematic because, owing to their nondeterministic behavior, they are often extremely hard to debug. To make matters worse, it is also quite easy to create race conditions without even knowing it.

Example 7. Writing to the same location in parallel.

In the code below, both branches of fork2 are writing into b. What should then the output of this program be?

long b = 0;

fork2([&] {
b = 1;
}, [&] {
b = 2;
});

std::cout << "b = " << std::endl;

At the time of the print, the contents of b is determined by the last write. Thus depending on which of the two branches perform the write, we can see both possibilities:

Output:

b = 1

Output:

b = 2
Example 8. Fibonacci

Consider the following alternative implementation of the Fibonacci function. By "inlining" the plus operation in both branches, the programmer got rid of the addition operation after the fork2.

long fib_par_racy(long n) {
long result = 0;
if (n < 2) {
result = n;
} else {
fork2([&] {
result += fib_par_racy(n-1);
},[&] {
result += fib_par_racy(n-2);
});
}
return result;
}

This code is not correct because it has a race condition.

As in the example shows, separate threads are updating the value result but it might look like this is not a race condition because the update consists of an addition operation, which reads the value and then writes to i. The race condition might be easier to see if we expand out the applications of the += operator.

long fib_par_racy(long n) {
long result = 0;
if (n < 2) {
result = n;
} else {
fork2([&] {
long a1 = fib_par_racy(n-1);
long a2 = result;
result = a1 + a2;
},[&] {
long b1 = fib_par_racy(n-2);
long b2 = result;
result = b1 + b2;
});
}
return result;
}

When written in this way, it is clear that these two parallel threads are not independent: they both read result and write to result. Thus the outcome depends on the order in which these reads and writes are performed, as shown in the next example.

Example 9. Execution trace of a race condition

The following table takes us through one possible execution trace of the call fib_par_racy(2). The number to the left of each instruction describes the time at which the instruction is executed. Note that since this is a parallel program, multiple instructions can be executed at the same time. The particular execution that we have in this example gives us a bogus result: the result is 0, not 1 as it should be.

1

a1 = fib_par_racy(1)

b2 = fib_par_racy(0)

2

a2 = result

b3 = result

3

result = a1 + a2

_

4

_

result = b1 + b2

The reason we get a bogus result is that both threads read the initial value of result at the same time and thus do not see each others write. In this example, the second thread "wins the race" and writes into result. The value 1 written by the first thread is effectively lost by being overwritten by the second thread.

### 4.1. Synchronization Hardware

Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem. These operations may allow, for example, testing the contents of a (machine) word then modifying it, perhaps by swapping it with another word. Such operations are sometimes called atomic read-modify-write or RMW, for short, operations.

A handful of different RMW operations have been proposed. They include operations such as load-link/store-conditional, fetch-and-add, and compare-and-swap. They typically take the memory location x, and a value v and replace the value of stored at x with f(x,v). For example, the fetch-and-add operation takes the location x and the increment-amount, and atomically increments the value at that location by the specified amount, i.e., f(x,v) = *x + v.

The compare-and-swap operation takes the location x and takes a pair of values (a,b) as the second argument, and stores b into x if the value in x is a, i.e., f(x,(a,b)) = if *x = a then b else a; the operation returns a boolean indicating whether the operation successfully stored a new value in x. The operation "compare-and-swap" is a reasonably powerful synchronization operation: it can be used by arbitrarily many threads to agree (reach consensus) on a value. This instruction therefore is frequently provided by modern parallel architectures such as Intel’s X86.

In C$++$, the atomic class can be used to perform synchronization. Objects af this type are guarantee to be free of race conditions; and in fact, in C++, they are the only objects that are guaranteed to be free from race conditions. The contents of an atomic object can be accessed by load opeations, updated by store operation, and also updated by compare_exchange_weak and compare_exchange_strong operations, the latter of which implement the compare-and-swap operation.

Example 10. Accessing the contents of atomic memory cells

Access to the contents of any given cell is achieved by the load() and store() methods.

std::atomic<bool> flag;

flag.store(false);
flag.store(true);
std::cout << flag.load() << std::endl;

Output:

0
1

The key operation that help with race conditions is the compare-and-exchange operation.

Definition: compare and swap

When executed with a ‘target atomic object and an expected cell and a new value new’ this operation performs the following steps, atomically:

1. Read the contents of target.

2. If the contents equals the contents of expected, then writes new into the target and returns true.

3. Otherwise, returns false.

Example 11. Reading and writing atomic objects
std::atomic<bool> flag;

flag.store(false);
bool expected = false;
bool was_successful = flag.compare_exchange_strong(expected, true);
std::cout << "was_successful = " << was_successful << "; flag = " << flag.load() << std::endl;
bool expected2 = false;
bool was_successful2 = flag.compare_exchange_strong(expected2, true);
std::cout << "was_successful2 = " << was_successful2 << "; flag = " <<
flag.load() << std::endl;

Output:

was_successful = 1; flag = 1
was_successful2 = 0; flag = 1

As another example use of the atomic class, recall our Fibonacci example with the race condition. In that example, race condition arises because of concurrent writes to the result variable. We can eliminate this kind of race condition by using different memory locations, or by using an atomic class and using a compare_exchange_strong operation.

Example 12. Fibonacci

The following implementation of Fibonacci is not safe because the variable result is shared and updated by multiple threads.

long fib_par_racy(long n) {
long result = 0;
if (n < 2) {
result = n;
} else {
fork2([&] {
result += fib_par_racy(n-1);
},[&] {
result += fib_par_racy(n-2);
});
}
return result;
}

We can solve this problem by declaring result to be an atomic type and using a standard busy-waiting protocol based on compare-and-swap.


long fib_par_atomic(long n) {
atomic<long> result = 0;
if (n < 2) {
result.store(n);
} else {
fork2([&] {
long r = fib_par_racy(n-1);
// Atomically update result.
while (true) {
bool flag = result.compare_exchange_strong(exp,exp+r)
if (flag) {break;}
}
},[&] {
long r = fib_par_racy(n-2);
// Atomically update result.
while (true) {
bool flag = result.compare_exchange_strong(exp,exp+r)
if (flag) {break;}
}
});
}
return result;
}

The idea behind the solution is to load the current value of result and atomically update result only if it has not been modified (by another thread) since it was loaded. This guarantees that the result is always updated (read and modified) correctly without missing an update from another thread.

The example above illustrates a typical use of the compare-and-swap operation. In this particular example, we can probably prove our code is correct. But this is not always as easy due to a problem called the "ABA problem."

### 4.2. ABA problem

While reasonably powerful, compare-and-swap suffers from the so-called ABA problem. To see this consider the following scenario where a shared variable result is update by multiple threads in parallel: a thread, say $T$, reads the result and stores its current value, say 2, in current. In the mean time some other thread also reads result and performs some operations on it, setting it back to 2 after it is done. Now, thread $T$ takes its turn again and attempts to store a new value into result by using 2 as the old value and being successful in doing so, because the value stored in result appears to have not changed. The trouble is that the value has actually changed and has been changed back to the value that it used to be. Thus, compare-and-swap was not able to detect this change because it only relies on a simple shallow notion of equality. If for example, the value stored in result was a pointer, the fact that the pointer remains the same does not mean that values accessible from the pointer has not been modified; if for example, the pointer led to a tree structure, an update deep in the tree could leave the pointer unchanged, even though the tree has changed.

This problem is called the ABA problem, because it involves cycling the atomic memory between the three values $A$, $B$, and again $A$). The ABA problem is an important limitation of compare-and-swap: the operation itself is not atomic but is able to behave as if it is atomic if it can be ensured that the equality test of the subject memory cell suffices for correctness.

In the example below, ABA problem may happen (if the counter is incremented and decremented again in between a load and a store) but it is impossible to observe because it is harmless. If however, the compare-and-swap was on a memory object with references, the ABA problem could have had observable effects.

The ABA problem can be exploited to give seemingly correct implementations that are in fact incorrect. To reduce the changes of bugs due to the ABA problem, memory objects subject to compare-and-swap are usually tagged with an additional field that counts the number of updates. This solves the basic problem but only up to a point because the counter itself can also wrap around. The load-link/store-conditional operation solves this problem by performing the write only if the memory location has not been updated since the last read (load) but its practical implementations are hard to come by.

## 5. Chapter: Executing parallel algorithms

Implicit parallelism allows writing parallel programs at a high level of abstraction. In this section, we discoss techniques for executing such parallel programs on hardware-shared-memory computers such as multicore computers. As our running example, we use the map_incr_rec, whose code is reproduced below.

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
long n = hi - lo;
if (n == 0) {
// do nothing
} else if (n == 1) {
dest[lo] = source[lo] + 1;
} else {
long mid = (lo + hi) / 2;
fork2([&] {
map_incr_rec(source, dest, lo, mid);
}, [&] {
map_incr_rec(source, dest, mid, hi);
});
}
}

The basic idea is to partition a computation, that is a run of a parallel algorithm on a specified input, into pieces of serial computations, called threads, and map them to available processors while observing the dependencies between them. The task of mapping the threads to available processors is called thread scheduling or simply scheduling.

We call a piece of serial computation a thread, if it executes without performing parallel operations (fork2) except perhaps as its last action. The term thread is short for user-level thread (as opposed to a operating-system thread). When partitioning the computation into threads, it is important for threads to be maximal to minimize scheduling overhead (technically a thread can be as small as a sincle instruction).

A thread is a maximal computation (execution of) consisting of a sequence of instructions that do not contain calls to fork2() except perhaps as its last action.

When scheduling a parallel computation, it is important that we don’t alter the intended meaning of the computation. Specifically, if a thread depends another thread, because for example, it reads a piece of data generated by the latter, it cannot be executed before the thread that it depends on. We can conservatively approximate such dependencies by observing the fork2 expressions and by organizing dependencies consistently with them. More specifically, we can represent a computations as a graph where each vertex represents a thread and each edge represents a dependency. Vertices and edges are created by execution of fork2. Each fork2 creates two threads (vertices) corresponding to the two branches and inserts an edge between each branch and the thread that performs the fork2 branches; in addition, each fork2 creates a join or continuation thread (vertex) that depends on the two branches. Since such a graph cannot contain cycles, it is a Directed Acyclic Graph (DAG).

Figure 2. DAG for parallel increment on an array of $8$: Each vertex corresponds a call to map_inc_rec excluding the fork2 or the continuation of fork2, which is empty, an is annotated with the interval of the input array that it operates on (its argument).

Figure [fig:parallel-inc-dag] illustrates the DAG for an execution of map_incr_rec. We partition each invocation of this function into two threads labeled by "M" and "C" respectively. The threads labeled by $M \lbrack i,j \rbrack$ corresponds to the part of the invocation of map_incr_rec with arguments lo and hi set to $i$ and $j$ respectively; this first part corresponds to the part of execution up and including the fork2 or all of the function if this is a base case. The second corresponds to the "continuation" of the fork2, which is in this case includes no computation.

There is an important connection between computation DAG’s and work and span. Suppose that we assign to each vertex a weight of at least $1$ that corresponds to the work of that vertex (since threads are serial work and span for each vertex is the same). We can then calculate the total weight and total depth of the DAG by summing up weights. The total weight of the DAG corresponds to the work of the computation and the depth corresponds to its span. In our example, each vertex has weight O(1). Thus for an array with n elements, the total weight (work) is O(n) and the depth (span) is $O(\log{n})$.

Having talked about DAG’s we are now ready to talk about how to map parallel computations to actual hardware so as to minimize their run-time, i.e., scheduling. But before we move on to scheduling let us observe a few properties of implicitly parallel computations.

1. The computation DAG of a parallel algorithm applied to an input unfolds dynamically as the algorithm executes. For example, when we run map_inc_rec with an input with $n$ elements, the DAG initially contains just the root vertex (thread) corresponding to the first call to map_inc_rec but it grows as the execution proceeds.

2. An execution of a parallel algorithm can generate a massive number of threads. For example, our ‘map_inc_rec’ function generates approximately $4n$ threads for an input with $n$ elements.

3. The work/span of each thread can vary from a small amount to a very large amount depending on the algorithm. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation (continuation threads).

Suppose now we are given a computation DAG and we wish to execute the DAG by mapping each thread to one of the $P$ processor that is available on the hardware. When a thread is mapped to a processor, it will be executed requiring time proportional to the work (weight) of the thread. Once the processor completes the thread, we can map another thread to the processor, so that the processor does not idle unnecessarily. As we map threads to processors, it is important that we observe the dependencies between threads, i.e., we don’t execute a thread before its parents.

Definition: Scheduling

The (thread) scheduling problem requires assigning to each thread in a given DAG a processor number and a time step such that

1. each thread is assigned to a unique processor for as many consecutive steps as its weight,

2. no thread is executed before its descendants in the DAG, and

3. no processor is assigned more at most one thread at a time.

The goal of scheduling to minimize critical resources such as time. Computing the shortest schedule for a DAG turns out to be highly nontrivial. In fact, the related decision problem in NP-complete. It is possible, however, to give a good approximation algorithm for the offline version of the problem to generate a 2-factor approximation. Such an appraximation yields a schedule for a given DAG within a factor-two of the shortest schedule. In the online version of the problem, where the DAG unfolds as the computation executes, we don’t know the DAG a priori and we have to account for the costs for scheduling such as migrating threads between schedulers and finding work. To execute parallel programs, we need an solution to this online version of the problem.

An online scheduler or a simply a scheduler is an algorithm that performs scheduling by mapping threads to available processors. For example, if only one processor is available, a scheduler can map all threads to that one processor. If two processors are available, then the scheduler can divide the threads between the two processors as evenly as possible in an attempt to keep the two processors as busy as possible by load balancing.

Example 13. An example 2-processor schedule

The following is a valid schedule for the DAG shown in this Figure assuming that each thread takes unit time.

Time Step Processor 1 Processor 2

1

M [0,8)

2

M [0,4)

M [4,8)

3

M [0,2)

M [4,6)

4

M [0,1)

M [4,5)

5

M [1,2)

M [5,6)

6

C [0,2)

C [4,6)

7

M [2,4)

M [6,8)

8

M [2,3)

M [6,7)

9

M [3,4)

M [7,8)

10

C [2,4)

C [6,8)

11

C [0,4)

C [4,8)

12

C [0,8)

_

We say that a scheduler is greedy if, whenever there is a processor available and a thread ready to be executed, then the scheduler assigns the thread to the processor and starts running the thread immediately. Greedy schedulers have a nice property that is summarized by the following theorem.

Theorem: Greedy Scheduling Principle

If a computation is run on $P$ processors using a perfect greedy scheduler that incurs no costs in creating, locating, and moving threads, then the total time (clock cycles) for running the computation $T_P$ is bounded by

$$T_P < \frac{W}{p} + S.$$

Here $W$ is the work of the computation, and $S$ is the span of the computation (both measured in units of clock cycles).

This simple statement is powerful. To see this, note that the time to execute the computation is at least $\frac{W}{P}$ because we have a total of $W$ work. As such, the best possible execution strategy is to divide it evenly among the processors. Furthermore, execution time cannot be less than $S$ since $S$ represents the longest chain of sequential dependencies. Thus we have: $T_P \geq \max\left(\frac{W}{P},S\right).$

This means that a greedy schudeler yields a schedule that is within a factor two of optimal: $\frac{W}{P} + S$ is never more than twice $\max(\frac{W}{P},S)$. Furthermore, when $\frac{W}{P} \gg S$, the difference between the greedy scheduler and the optimal scheduler is very small. In fact, we can rewrite equation above in terms of the average parallelism $\mathbb{P} = W/S$ as follows:

$$\begin{array}{rcl} T_p & < & \frac{W}{P} + S \\ & = & \frac{W}{P} + \frac{W}{\mathbb{P}}\\ & = & \frac{W}{P}\left(1 + \frac{p}{\mathbb{P}}\right) \end{array}$$

Therefore as long as $\mathbb{P} \gg P$ (the parallelism is much greater than the number of processors), then we obtain near perfect speedup. (Speedup is $W/T_p$ and perfect speedup would be $p$).

The quantity $\mathbb{P}$, sometimes called average parallelism, is usually quite large, because it usually grows polynomially with the input size.

Example 14. Scheduler with a global thread queue.

We can give a simple greedy scheduler by using a queue of threads. At the start of the execution, the scheduler places the root of the DAG into the queue and then repeats the following step until the queue becomes empty: for each idle processor, take the vertex at the front of the queue and assign it to the processor, let each processor run for one step, if at the end of the step, there is a vertex in the DAG whose parents have all completed their execution, then insert that vertex at the tail of the queue.