v1.0 2016-01

Copyright: Umut A. Acar

1. Preface

The goal of this book is to cover the fundamental concepts of parallel computing, including models of computation, parallel algorithms, and techniques for implementing and evaluating parallel algorithms.

Our primany focus will be hardware-shared memory parallelism.

For implementations, we use a C++ library, called PASL, which we have been developing over the past 5 years. PASL stands for Parallel Algorithm Scheduling Library. It also sounds a bit like the French phrase "pas seul" (pronounced "pa-sole"), meaning "not alone".

The library provides several scheduling algorithms for executing parallel programs on modern multicores and provides a range of utilities for inspecting the empirical behavior of parallel programs. We expect that the instructions in this book to allow the reader to write performant parallel programs at a relatively high level (essentially at the same level of C++ code) without having to worry too much about lower level details such as machine specific optimizations, which might otherwise be necessary.

All code that discussed in this book can be found at the Github repository linked by the following URL:

This code-base includes the examples presented in the book, see file minicourse/examples.hpp.

Some of the material in this book is based on the course, 15-210, co-taught with Guy Blelloch at CMU. The interested reader can find more details on this material in this book.

Starting point for this book was this book on PASL.

2. Administrative Matters

Course combines theory and practice. We will try ask the following two questions.

  1. Does it work in practice?

  2. Does it work in theory?

As you know this is a graduate class. This means that I don’t care about your grade. But if you are sitting here, you might as well work for an A.

Grading will be based on some assignments, one midterm exam, and one project. We shall make time so that you can devote a good chunk of your semester to the project. You will also be receive a participation score, which amounts to 20\% of the grade.

For each lecture, I will make some notes and we shall make them available for you to comment, perhaps via a github repository.

3. Preliminaries

3.1. Processors, Processes, and Threads

We assume a machine model that consists of a shared memory by a number of processors, usually written as $P$. The processors have access to a shared memory, which is readable and writable by all processors.

We assume that an operating system or a that allows us to create processes. The kernel schedules processes on the available processors in a way that is mostly out of our control with one exception: the kernel allows us to create any number of processes and pin them on the available processors as long as no more than one process is pinned on a processor.

We define a thread to be a piece of sequential computation whose boundaries, i.e., its start and end points, are defined on a case by case basis, usually based on the programming model. In reality, there different notions of threads. For example, a system-level thread is created by a call to the kernel and scheduled by the kernel much like a process. A user-level thread is created by the application program and is scheduled by the application—user level threads are invisible to the kernel. Common property of all threads is that they perform a sequential computation. In this class, we will usually talk about user-level threads. In the literature, you will encounter many different terms for a user-level thread, such as "fiber", "sparc", "strand", etc.

For our purposes in this book an application, a piece of runnable software, can only create threads but no processes. We will assume that we can assign to an application any number of processes to be used for execution. If an application is run all by itself (without any other application running at the same time) and if all of its processes are pinned, then we refer to such an execution as occurring in the dedicated mode.

Note
For now, we leave the details of the memory consistency model unspecified.

3.2. C++ Background

This book is entirely based on C++ and a library for writing parallel programs in C++. We use recent features of C++ such as closures or lambda expressions and templates. A deep understanding of these topics is not necessary to follow the course notes, because we explain them at a high level as we go, but such prior knowledge might be helpful; some pointers are provided below.

Template metaprogramming

Templates are C++'s way of providing for parametric polymorphism, which allows using the same code at multiple types. For example, in modern functional languages such as the ML family or Haskell, you can write a function $\lambda~x.x$ as an identity function that returns its argument for any type of $x$. You don’t have to write the function at every type that you plan to apply. Since functional languages such as ML and Haskell rely on type inference and have powerful type systems, they can infer from your code the most general type (within the constraints of the type system). For example, the function $\lambda~x.x$ can be given the type $\forall \alpha. \alpha \rightarrow \alpha$. This type says that the function works for any type $\alpha$ and given an argument of type $\alpha$, it returns a value of type $\alpha$.

C++ provides for polymorphism with templates. In its most basic form, a template is a class declaration or a function declaration, which is explicitly stated to be polymorphic, by making explicit the type variable. Since C++ does not in general perform type inference (in a rigorous sense of the word), it requires some help from the programmer.

For example, the following code below defines an array class that is parametric in the type of its elements. The declaration template <class T> says that the declaration of class array, which follows is parameterized by the identifier T. The definition of class array then uses T as a type variable. For example, the array defines a pointer to element sequences of type T, and the sub function returns an element of type T etc.

template <class T>
class array {
  public:
   array (int size) {a = new T[size];}
   T sub (int i) { a[i];}

  private:
    *T a;
}

Note that the only part of the syntax template <class T> that is changeable is the identifier T. In other words, you should think of the syntax template <class T> as a binding form that allows you to pick an identifier (in this case T). You might ask why the type identifier/variable T is a class. This is a good question. The authors find it most helpful to not think much about such questions, especially in the context of the C++ language.

Once defined a template class can be initialized with different type variables by using the < > syntax. For examples, we can define different arrays such as the following.

array<int> myFavoriteNumbers(7);
array<char*> myFavoriteNames(7);

Again, since C++ does not perform type inference for class instances, the C++ compiler expects the programmer to eliminate explicitly parametricity by specifying the argument type.

It is also possible to define polymorphic or generic functions. For example, the following declarations defines a generic identity function.

template <class T>
T identity(T x) { return x;}

Once defined, this function can be used without explicitly specializing it at various types. In contrast to templated classes, C++ does provide some type inference for calls to templated functions. So generic functions can be specialized implicitly, as shown in the examples below.

i = identity (3)
s = identity ("template programming can be ugly")

This brief summary of templates should suffice for the purposes of the material covered in this book. Templates are covered in significant detail by many books, blogs, and discussions boards. We refer the interested reader to those sources for further information.

Lambda expressions

The C++11 reference provides good documentation on lambda expressions.

4. Introduction

This class is motivated by recent advances in architecture that put sequential hardware on the path to extinction. Due to fundamental architectural limitations, sequential performance of processors have not been increasing since 2005. In fact performance has been decreasing by some measures because hardware manufacturer’s have been simplifying processorns by simplifying the handling of instruction level parallelism.

Moore’s law, however, continues to be holding, at least for the time being, enabling hardware manufacturers to squeeze an increasing number of transistors into the same chip area. The result, not surprisingly, has been increased paralellelism, more processors that is. In particular, manufacturers have been producing multicore chips where each chip consists of a number of processors that fit snugly into a small area, making communication between them fast and efficient. You can read more about the history of modern multicore computers.

This simple change in hardware has led to dramatic changes in computing. Parallel computing, once a niche domain for computational scientists, is now an everyday reality. Essentially all computing media ranging from mobile phones to laptops and computers operate on parallel computers.

This change was anticipated by computers scientists. There was in fact much work on parallel algorithms in 80’s and 90’s. The models of computation assumed then turned out to be unrealistic. This makes it somewhat of a challenge to use the algorithms from that era. Some of the ideas, however, transcends those earlier models can still be used today to design and implement parallel algorithms on modern architectures.

The goal of this class is to give an introduction to the theory and the practice of parallol computing. Specifically, we will cover the following topics.

  1. Multithreaded computation

  2. Work and span

  3. Offline scheduling

  4. Structured or implicit parallel computation

    1. Fork-join, async-finish, nested parallelism.

    2. Futures.

    3. Parallelism versus concurrency

  5. Parallel algorithms for sequences

  6. Online scheduling: work-stealing algorithms and its analysis

  7. Parallel algorithms for trees

  8. Parallel algorithms for graphs

  9. Concurrency

5. Chapter: Multithreading, Parallelism, and Concurrency

The term multithreading refers to computing with multiple threads of control. Once created, a thread performs a computation by executing a sequence of instructions, as specified by the program, until it terminates. A multithreaded computation starts by executing a main thread, which is the thread at which the execution starts. A thread can create or spawn another thread and synchronize with other threads by using a variety of synchronization constructs such as locks, mutex’s, synchronization variables, and semaphores.

5.1. DAG Representation

A multithreaded computation can be represented by a dag, a Directed Acyclic Graph, or written also more simply a dag, of vertices. The figure below show an example multithreaded computation and its dag. Each vertex represents the execution of an instruction, such as an addition, a multiplication, a memory operation, a (thread) spawn operation, or a synchronization operation. A vertex representing a spawn operation has outdegree two. A synchronization operation waits for an operation belonging to a thread to complete, and thus a vertex representing a synchronization operation has indegree two. Recall that a dag represents a partial order. Thus the dag of the computation represents the partial ordering of the dependencies between the instructions in the computation.

A multithreaded computation.
Figure 1. A multithreaded computation.

Throughout this book, we make two assumptions about the structure of the dag:

  1. Each vertex has outdegree at most two.

  2. The dag has exactly one root vertex with indegree zero and one final vertex vertex with outdegree zero. The root is the first instruction of the root thread.

The outdegree assumption naturally follows by the fact that each vertex represents an instruction, which can create at most one thread.

5.2. Cost Model: Work and Span

For analyzing the efficiency and performance of multithreaded programs, we use several cost measures, the most important ones include work and span. We define the work of a computation as the number of vertices in the dag and the span as the length of the longest path in the dag. In the example dag above, work is $15$ and span is $9$.

5.3. Execution and Scheduling

The execution of a multithreaded computation executes the vertices in the dag of the computation in some partial order that is consistent with the partial order specified by the dag, that is, if vertices $u,v$ are ordered then the execution orders them in the same way.

Multithreaded programs are executed by using a scheduler that assigns vertices of the dag to processes.

Definition: Execution Schedule

Given a dag $G$, an (execution) schedule for $G$ is a function from processes and (time) steps to instructions of the dag such that

  1. if a vertex $u$ is ordered before another $v$ in $G$, then $v$ is not executed at a time step before $u$, and

  2. each vertex in $G$ is executed exactly once.

The length of a schedule is the number of steps in the schedule.

The first condition ensures that a schedule observes the dependencies in the dag. Specifically, for each arc $(u,v)$ in the dag, the vertex $u$ is executed before vertex $v$.

For any step in the execution, we call a vertex ready if all the ancestors of the vertex in the dag are executed prior to that step. Similarly, we say that a thread is ready if it contains a ready vertex. Note that a thread can contain only one ready vertex at any time.

Example 1. Schedule

An example schedule with 3 processes. The length of this schedule is $10$

Time Step Process 1 Process 2 Process 3

1

M1

2

M2

3

M3

A1

4

A2

5

B1

A3

6

B2

A4

7

B2

M4

8

A5

M5

9

A6

10

M6

Fact: Scheduling Invariant

Consider a computation dag $G$ and consider an execution using any scheduling algorithm. At any time during the execution, color the vertices that are executed as blue and the others as red.

  1. The blue vertices induce a blue sub-dag of $G$ that is connected and that has the same root as $G$.

  2. The red vertices incude a red sub-dag of $G$ that is connected.

  3. All the vertices of G are in the blue or the red sub-dag. In other words, the blue and red vertices partitions the dag into two sub-dags.

5.4. Scheduling Lower Bounds

Theorem: Lower bounds

Consider any multithreaded computation with work $W$ and span $S$ and $P$ processes. The following lower bounds hold.

  1. Every execution schedule has length at least $\frac{W}{P}$.

  2. Every execution schedule has length at least $S$.

The first lower bound follows by the simple observation that a schedule can only execute $P$ instructions at a time. Since all vertices must be executed, a schedule has length at least $\frac{W}{P}$. The second lower bound follows by the observation that the schedule cannot execute a vertex before its ancestors and thus the length of the schedule is at least as long as any path in the dag, which can be as large as the span $S$.

5.5. Offline Scheduling

Having established a lower bound, we now move on to establish an upper bound for the offline scheduling problem, where we are given a dag and wish to find an execution schedule that minimizes the run time. It is known that the related decision problem in NP-complete but that 2-approximation is relatively easy. We shall consider two distinct schedulers: level-by-level scheduler and greedy scheduler.

A level-by-level schedule is a schedule that executes the instructions in a given dag level order, where the level of a vertex is the longest distance from the root of the dag to the vertex. More specifically, the vertices in level 0 are executed first, followed by the vertices in level 1 and so on.

Theorem:[Offline level-by-level schedule]

For a dag with work $W$ and span $S$, the length of a level-by-level schedule is $W/P + S$.

Proof

Let $W_i$ denote the work of the instructions at level $i$. These instructions can be executed in $\lceil \frac{W_i}{P} \rceil$ steps. Thus the total time is

\[ \sum_{i=1}^{S}{\lceil \frac{W_i}{P} \rceil} \le \sum_{i=1}^{S}{\lfloor \frac{W_i}{P} \rfloor + 1} \le \lfloor \frac{W}{P} \rfloor + S \]

This theorem called Brent’s theorem was proved by Brent in 1974. It shows that the lower bound can be approximated within a factor of $2$.

Brent’s theorem has later been generalized to all greedy schedules. A greedy schedule is a schedule that never leaves a process idle unless there are no ready vertices. In other words, greedy schedules keep processes as busy as possibly by greedily assigning ready vertices.

Theorem: Offline Greedy Schedule

Consider a dag $G$ with work $W$ and span $S$. Any greedy $P$-process schedule has length at most $\frac{W}{P} + S \cdot \frac{P-1}{P}$.

Proof

Consider any greedy schedule with length $T$.

For each step $1 \le i \le T$, and for each process that is scheduled at that step, collect a token. The token goes to the work bucket if the process executes a vertex in that step, otherwise the process is idle and the token goes to an idle bucket.

Since each token in the work bucket corresponds to an executed vertex, there are exactly $W$ tokens in that bucket.

We will now bound the tokens in the idle bucket by $S \cdot (P-1)$. Observe that at any step in the execution schedule, there is a ready vertex to be executed (because otherwise the execution is complete). This means that at each step, at most $P-1$ processes can contribute to the idle bucket. Furthermore at each step where there is at least one idle process, we know that the number of ready vertices is less than the number of available processes. Note now that at that step, all the ready vertices have no incoming edges in the red sub-dag consisting of the vertices that are not yet executed, and all the vertices that have no incoming edges in the red sub-dag are ready. Thus executing all the ready vertices at the step reduces the length of all the paths that originate at these vertices and end at the final vertex by one. This means that the span of the red sub-dag is reduced by one because all paths with length equal to span must originate in a ready vertex. Since the red-subdag is initially equal to the dag $G$, its span is $S$, and thus there are at most $S$ steps at which a process is idle. As a result the total number of tokens in the idle bucket is at most $S \cdot (P-1)$].

Since we collect $P$ tokens in each step, the bound thus follows.

Exercise

Show that the bounds for Brent’s level-by-level scheduler and for any greedy scheduler is within a factor $2$ of optimal.

5.6. Online Scheduling

In offline scheduling, we are given a dag and are interested in finding a schedule with minimal length. When executing multithreaded program, however, we don’t have full knowledge of the dag. Instead, the dag unfolds as we run the program. Furthermore, we are interested in not minimizing the length of the schedule but also the work and time it takes to compute the schedule. These two additional conditions define the online scheduling problem.

An online scheduler or a simply a scheduler is an algorithm that solves the online scheduling problem by mapping threads to available processes. For example, if only one processor is available, a scheduler can map all threads to that one processor. If two processors are available, then the scheduler can divide the threads between the two processors as evenly as possible in an attempt to keep the two processors as busy as possible by load balancing.

There are many different online-scheduling algorithms but these algorithms all operate similarly. We can outline a typical scheduling algorithm as follows.

Typical Online Scheduling Algorithm

The algorithm maintains a work pool of work, consisting of ready threads, and executes them. Execution starts with the root thread in the pool. It ends when the final vertex is executed. In order to minimize the cost of computing the schedule, the algorithm executes a thread until there is a need for synchronization with other threads.

To obtain work, a process removes a thread from the pool and executes its ready vertex. We refer to the thread executed by a process as the assigned thread. When executed, the ready vertex can make the next vertex of the thread ready, which then also gets executed an so on until one of the following synchronization actions occur.

  1. Die: The process executes last vertex of the thread, causing the thread to die. The process then obtains other work.

  2. Block: The assigned vertex executes but the next vertex does not become ready. This blocks the thread and thus the process obtains other work.

  3. Enable: The assigned vertex makes ready the continuation of the vertex and unblocks another previously blocked thread by making a vertex from that thread ready. In this case, the process inserts both (any) one thread into the work pool and continues to execute the other.

  4. Spawn: The assigned vertex spaws another thread. As in the previous case, the process inserts one thread into the work pool and continues to execute the other.

These actions are not mutually exclusive. For example, a thread may spawn/enable a thread and die. In this case, the process performs the corresponding steps for each action.

Exercise: Scheduling Invariant

Convince yourself that the scheduling invariant holds in online scheduling.

For a given schedule generated by an online scheduling algorithm, we can define a tree of vertices, which tell us far a vertex, the vertex that enabled it.

Definition: Enabling Tree

Consider the execution of a dag. If the execution of a vertex $u$ enables another vertex $v$, then we call the edge $(u,v)$ an enabling edge and we call $u$ the enabling parent of $v$. For simplicity, we simply use the term parent instead of enabling parent.

Note that any vertex other than the root vertex has one enabling parent. Thus the subgraph induced by the enabling edges is a rooted tree that we call the enabling tree.

Example 2. Scheduler with a global thread queue.

We can give a simple greedy scheduler by using a queue of threads. At the start of the execution, the scheduler places the root thread into the queue and then repeats the following step until the queue becomes empty: for each idle process, take the thread at the front of the queue and assign it to the processor, let each processor run for one step, if at the end of the step, there are new ready threads, then insert them onto the tail of the queue.

The centralized scheduler with the global thread queue is a greedy scheduler that generates a greedy schedule under the assumption that the queue operations take zero time and that the dag is given. This algorithm, however, does not work well for online scheduling the operations on the queue take time. In fact, since the thread queue is global, the algorithm can only insert and remove one thread at a time. For this reason, centralized schedulers do not scale beyond a handful of processors.

Definition: Scheduling friction.

No matter how efficient a scheduler is there is real cost to creating threads, inserting and deleting them from queues, and to performing load balancing. We refer to these costs cumulatively as scheduling friction, or simply as friction.

There has been much research on the problem of reducing friction in scheduling. This research shows that distrubuted scheduling algorithms can work quite well. In a distributed algorithm, each processor has its own queue and primarily operates on its own queue. A load-balancing technique is then used to balance the load among the existing processors by redistributing threads, usually on a needs basis. This strategy ensures that processors can operate in parallel to obtain work from their queues.

A specific kind of distributed scheduling technique that can leads to schedules that are close to optimal is work stealing schedulers. In a work-stealing scheduler, processors work on their own queues as long as their is work in them, and if not, go "steal" work from other processors by removing the thread at the tail end of the queue. It has been proven that randomized work-stealing algorithm, where idle processors randomly select processors to steal from, deliver close to optimal schedules in expectation (in fact with high probability) and furthermore incur minimal friction. Randomized schedulers can also be implemented efficiently in practice. PASL uses an scheduling algorithm that is based on work stealing. We consider work-stealing in greater detail in a future chapter.

5.7. Writing Multithreaded Programs: Pthreads

Multithreaded programs can be written using a variety of language abstractions interfaces. One of the most widely used interfaces is the POSIX Threads* or *Pthreads interface, which specifies a programming interface for a standardized C language in the IEEE POSIX 1003.1c standard. Pthreads provide a rich interface that enable the programmer to create multiple threads of control that can synchronize by using the nearly the whole range of the synchronization facilities mentioned above.

Hello world with Pthreads

An example Pthread program is shown below. The main thread (executing function main) creates 8 child threads and terminates. Each child in turn runs the function helloWorld and immediately terminates. Since the main thread does not wait for the children to terminate, it may terminate before the children does, depending on how threads are scheduled on the available processors.

#include <iostream>
#include <cstdlib>
#include <pthread.h>

using namespace std;

#define NTHREADS 8

void *helloWorld(void *threadid)
{
   long tid;
   tid = (long)threadid;
   cout << "Hello world! It is me, 00" << tid << endl;
   pthread_exit(NULL);
}

int main ()
{
   pthread_t threads[NTHREADS];
   int rc;
   int i;
   for( i=0; i < NTHREADS; i++ ){
      cout << "main: creating thread 00" << i << endl;
      error = pthread_create(&threads[i], NULL, helloWorld, (void *) i);
      if (error) {
         cout << "Error: unable to create thread," << error << endl;
         exit(-1);
      }
   }
   pthread_exit(NULL);
}

When executed this program may print the following.

main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 002
Hello world! It is me, 003
Hello world! It is me, 004
Hello world! It is me, 005
Hello world! It is me, 006
Hello world! It is me, 007

But that would be unlikely, a more likely output would look like this:

main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 006
Hello world! It is me, 003
Hello world! It is me, 002
Hello world! It is me, 005
Hello world! It is me, 004
Hello world! It is me, 007

And may even look like this

main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 003
Hello world! It is me, 002
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 006
Hello world! It is me, 005
Hello world! It is me, 004
Hello world! It is me, 007

5.8. Writing Multithreaded Programs: Structured or Implicit Multithreading

Interface such as Pthreads enable the programmer to create a wide variety of multithreaded computations that can be structured in many different ways. Large classes of interesting multithreaded computations, however, can be expcessed using a more structured approach, where threads are restricted in the way that they synchronize with other threads. One such interesting class of computations is fork-join computations where a thread can spawn or "fork" another thread or "join" with another existing thread. Joining a thread is the only mechanism through which threads synchronize. The figure below illustrates a fork-join computation. The main threads forks thread A, which then spaws thread B. Thread B then joins thread A, which then joins Thread M.

A multithreaded, fork-join computation.
Figure 2. A multithreaded fork-join computation.

In addition to fork-join, there are other interfaces for structured multithreading such as async-finish, and futures. These interfaces are adopted in many programming languages: the Cilk language is primarily based on fork-join but also has some limited support for async-finish; X10 language is primarily based on async-finish but also supports futures; the Haskell language Haskell language provides support for fork-join and futures as well as others; Parallel ML language as implemented by the Manticore project is primarily based on fork-join parallelism. Such languages are sometimes called implicitly parallel.

The class computations that can be expressed as fork-join and async-finish programs are sometimes called nested parallel. The term "nested" refers to the fact that a parallel computation can be nested within another parallel computation. This is as opposed to flat parallelism where a parallel computation can only perform sequential computations in parallel. Flat parallelism used to be common technique in the past but becoming increasingly less prominent.

5.9. Parallelism versus concurrency

Structured multithreading offers important benefits both in terms of efficiency and expressiveness. Using programming constructs such as fork-join and futures, it is usually possible to write parallel programs such that the program accepts a "sequential semantics" but executes in parallel. The sequential semantics enables the programmer to treat the program as a serial program for the purposes of correctness. A run-time system then creates threads as necessary to execute the program in parallel. This approach offers is some ways the best of both worlds: the programmer can reason about correctness sequentially but the program executes in parallel. The benefit of structured multithreading in terms of efficiency stems from the fact that threads are restricted in the way that they communicate. This makes it possible to implement an efficient run-time system.

More precisely, consider some sequential language such as the untyped (pure) lambda calculus and its sequential dynamic semantics specified as a strict, small step transition relation. We can extend this language with the structured multithreading by enriching the syntax language with "fork-join" and "futures" constructs. We can now extend the dynamic semantics of the language in two ways: 1) trivially ignore these constructs and execute serially as usual, and 2) execute in parallel by creating parallel threads. We can then show that these two semantics are in fact identical, i.e., that they produce the same value for the same expressions. In other words, we can extend a rich programming language with fork-join and futures and still give the language a sequential semantics. This shows that structured multithreading is nothing but an efficiency and performance concern; it can be ignored from the perspective of correctness.

We use the term parallelism to refer to the idea of computing in parallel by using such structured multithreading constructs. As we shall see, we can write parallel algorithms for many interesting problems. While parallel algorithms or applications constitute a large class, they don’t cover all applications. Specifically applications that can be expressed by using richer forms of multithreading such as the one offered by Pthreads do not always accept a sequential semantics. In such concurrent applications, threads can communicate and coordinate in complex ways to accomplish the intended result. A classic concurrency example is the "producer-consumer problem", where a consumer and a producer thread coordinate by using a fixed size buffer of items. The producer fills the buffer with items and the consumer removes items from the buffer and they coordinate to make sure that the buffer is never filled more than it can take. We can use operating-system level processes instead of threads to implement similar concurrent applications.

In summary, parallelism is a property of the hardware or the software platform where the computation takes place, whereas concurrency is a property of the application. Pure parallelism can be ignored for the purposes of correctness; concurrency cannot be ignored for understanding the behavior of the program.

Parallelism and concurrency are orthogonal dimensions in the space of all applications. Some applications are concurrent, some are not. Many concurrent applications can benefit from parallelism. For example, a browser, which is a concurrent application itself as it may use a parallel algorithm to perform certain tasks. On the other hand, there is often no need to add concurrency to a parallel application, because this unnecessarily complicates software. It can, however, lead to improvements in efficiency.

The following quote from Dijkstra suggest pursuing the approach of making parallelism just a matter of execution (not one of semantics), which is the goal of the much of the work on the development of programming languages today. Note that in this particular quote, Dijkstra does not mention that parallel algorithm design requires thinking carefully about parallelism, which is one aspect where parallel and serial computations differ.

From the past terms such as "sequential programming" and "parallel programming" are still with us, and we should try to get rid of them, for they are a great source of confusion. They date from the period that it was the purpose of our programs to instruct our machines, now it is the purpose of the machines to execute our programs. Whether the machine does so sequentially, one thing at a time, or with considerable amount of concurrency, is a matter of implementation, and should not be regarded as a property of the programming language.
Selected Writings on Computing: A Personal Perspective (EWD 508)
— Edsger W. Dijkstra

6. Chapter: Fork-join parallelism

Fork-join parallelism, a fundamental model in parallel computing, dates back to 1963 and has since been widely used in parallel computing. In fork join parallelism, computations create opportunities for parallelism by branching at certain points that are specified by annotations in the program text.

Each branching point forks the control flow of the computation into two or more logical threads. When control reaches the branching point, the branches start running. When all branches complete, the control joins back to unify the flows from the branches. Results computed by the branches are typically read from memory and merged at the join point. Parallel regions can fork and join recursively in the same manner that divide and conquer programs split and join recursively. In this sense, fork join is the divide and conquer of parallel computing.

As we will see, it is often possible to extend an existing language with support for fork-join parallelism by providing libraries or compiler extensions that support a few simple primitives. Such extensions to a language make it easy to derive a sequential program from a parallel program by syntactically substituting the parallelism annotations with corresponding serial annotations. This in turn enables reasoning about the semantics or the meaning of parallel programs by essentially "ignoring" parallelism.

PASL is a C++ library that enables writing implicitly parallel programs. In PASL, fork join is expressed by application of the fork2() function. The function expects two arguments: one for each of the two branches. Each branch is specified by one C++ lambda expression.

Example 3. Fork join

In the sample code below, the first branch writes the value 1 into the cell b1 and the second 2 into b2; at the join point, the sum of the contents of b1 and b2 is written into the cell j.

long b1 = 0;
long b2 = 0;
long j  = 0;

fork2([&] {
  // first branch
  b1 = 1;
}, [&] {
  // second branch
  b2 = 2;
});
// join point
j = b1 + b2;

std::cout << "b1 = " << b1 << "; b2 = " << b2 << "; ";
std::cout << "j = " << j << ";" << std::endl;

Output:

b1 = 1; b2 = 2; j = 3;

When this code runs, the two branches of the fork join are both run to completion. The branches may or may not run in parallel (i.e., on different cores). In general, the choice of whether or not any two such branches are run in parallel is chosen by the PASL runtime system. The join point is scheduled to run by the PASL runtime only after both branches complete. Before both branches complete, the join point is effectively blocked. Later, we will explain in some more detail the scheduling algorithms that the PASL uses to handle such load balancing and synchronization duties.

In fork-join programs, a thread is a sequence of instructions that do not contain calls to fork2(). A thread is essentially a piece of sequential computation. The two branches passed to fork2() in the example above correspond, for example, to two independent threads. Moreover, the statement following the join point (i.e., the continuation) is also a thread.

Note
If the syntax in the code above is unfamiliar, it might be a good idea to review the notes on lambda expressions in C++11. In a nutshell, the two branches of fork2() are provided as lambda-expressions where all free variables are passed by reference.
Note
Fork join of arbitrary arity is readily derived by repeated application of binary fork join. As such, binary fork join is universal because it is powerful enough to generalize to fork join of arbitrary arity.

All writes performed by the branches of the binary fork join are guaranteed by the PASL runtime to commit all of the changes that they make to memory before the join statement runs. In terms of our code snippet, all writes performed by two branches of fork2 are committed to memory before the join point is scheduled. The PASL runtime guarantees this property by using a local barrier. Such barriers are efficient, because they involve just a single dynamic synchronization point between at most two processors.

Example 4. Writes and the join statement

In the example below, both writes into b1 and b2 are guaranteed to be performed before the print statement.

long b1 = 0;
long b2 = 0;

fork2([&] {
  b1 = 1;
}, [&] {
  b2 = 2;
});

std::cout << "b1 = " << b1 << "; b2 = " << b2 << std::endl;

Output:

b1 = 1; b2 = 2

PASL provides no guarantee on the visibility of writes between any two parallel branches. In the code just above, for example, writes performed by the first branch (e.g., the write to b1) may or may not be visible to the second, and vice versa.

6.1. Parallel Fibonacci

Now, we have all the tools we need to describe our first parallel code: the recursive Fibonacci function. Although useless as a program because of efficiency issues, this example is the "hello world" program of parallel computing.

Recall that the $n^{th}$ Fibonnacci number is defined by the recurrence relation

$$ \begin{array}{lcl} F(n) & = & F(n-1) + F(n-2) \end{array} $$

with base cases

$$ F(0) = 0, \, F(1) = 1 $$

Let us start by considering a sequential algorithm. Following the definition of Fibonacci numbers, we can write the code for (inefficiently) computing the $n^{th}$ Fibonnacci number as follows. This function for computing the Fibonacci numbers is inefficient because the algorithm takes exponential time, whereas there exist dynamic programming solutions that take linear time.

long fib_seq (long n) {
  long result;
  if (n < 2) {
    result = n;
  } else {
    long a, b;
    a = fib_seq(n-1);
    b = fib_seq(n-2);
    result = a + b;
  }
  return result;
}

To write a parallel version, we remark that the two recursive calls are completely independent: they do not depend on each other (neither uses a piece of data generated or written by another). We can therefore perform the recursive calls in parallel. In general, any two independent functions can be run in parallel. To indicate that two functions can be run in parallel, we use fork2().

long fib_par(long n) {
  long result;
  if (n < 2) {
    result = n;
  } else {
    long a, b;
    fork2([&] {
      a = fib_par(n-1);
    }, [&] {
      b = fib_par(n-2);
    });
    result = a + b;
  }
  return result;
}

6.2. Incrementing an array, in parallel

Suppose that we wish to map an array to another by incrementing each element by one. We can write the code for a function map_incr that performs this computation serially.

void map_incr(const long* source, long* dest, long n) {
  for (long i = 0; i < n; i++)
    dest[i] = source[i] + 1;
}
Example 5. Example: Using map_incr.

The code below illustrates an example use of map_incr.

const long n = 4;
long xs[n] = { 1, 2, 3, 4 };
long ys[n];
map_incr(xs, ys, n);
for (long i = 0; i < n; i++)
  std::cout << ys[i] << " ";
std::cout << std::endl;

Output:

2 3 4 5

This is not a good parallel algorithm but it is not difficult to give a parallel algorithm for incrementing an array. The code for such an algorithm is given below.

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
  long n = hi - lo;
  if (n == 0) {
    // do nothing
  } else if (n == 1) {
    dest[lo] = source[lo] + 1;
  } else {
    long mid = (lo + hi) / 2;
    fork2([&] {
      map_incr_rec(source, dest, lo, mid);
    }, [&] {
      map_incr_rec(source, dest, mid, hi);
    });
  }
}

It is easy to see that this algorithm has O(n) work and $O(\log{n})$ span.

6.3. The sequential elision

In the Fibonacci example, we started with a sequential algorithm and derived a parallel algorithm by annotating independent functions. It is also possible to go the other way and derive a sequential algorithm from a parallel one. As you have probably guessed this direction is easier, because all we have to do is remove the calls to the fork2 function. The sequential elision of our parallel Fibonacci code can be written by replacing the call to fork2() with a statement that performs the two calls (arguments of fork2()) sequentially as follows.

long fib_par(long n) {
  long result;
  if (n < 2) {
    result = n;
  } else {
    long a, b;
    ([&] {
      a = fib_par(n-1);
    })();
    ([&] {
      b = fib_par(n-2);
    })();
    result = a + b;
  }
  return result;
}
Note
Although this code is slightly different than the sequential version that we wrote, it is not too far away, because the only the difference is the creation and application of the lambda-expressions. An optimizing compiler for C++ can easily "inline" such computations. Indeed, After an optimizing compiler applies certain optimizations, the performance of this code the same as the performance of fib_seq.

The sequential elision is often useful for debugging and for optimization. It is useful for debugging because it is usually easier to find bugs in sequential runs of parallel code than in parallel runs of the same code. It is useful in optimization because the sequentialized code helps us to isolate the purely algorithmic overheads that are introduced by parallelism. By isolating these costs, we can more effectively pinpoint inefficiencies in our code.

6.4. Executing fork-join algorithms

We defined fork-join programs as a subclass case of multithreaded programs. Let’s see more precisely how we can "map" a fork-join program to a multithreaded program. An our running example, let’s use the map_incr_rec, whose code is reproduced below.

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
  long n = hi - lo;
  if (n == 0) {
    // do nothing
  } else if (n == 1) {
    dest[lo] = source[lo] + 1;
  } else {
    long mid = (lo + hi) / 2;
    fork2([&] {
      map_incr_rec(source, dest, lo, mid);
    }, [&] {
      map_incr_rec(source, dest, mid, hi);
    });
  }
}

Since, a fork-join program does not explicitly manipulate threads, it is not immediately clear what a thread refers to. To define threads, we can partition a fork-join computation into pieces of serial computations, each of which constitutes a thread. What we mean by a serial computation is a computation that runs serially and also that does not involve any synchronization with other threads except at the start and at the end. More specifically, for fork-join programs, we can define a piece of serial computation a thread, if it executes without performing parallel operations (fork2) except perhaps as its last action. When partitioning the computation into threads, it is important for threads to be maximal; technically a thread can be as small as a single instruction.

Definition: Thread

A thread is a maximal computation consisting of a sequence of instructions that do not contain calls to fork2() except perhaps at the very end.

Dag for parallel increment
Figure 3. Dag for parallel increment on an array of $8$: Each vertex corresponds a call to map_inc_rec excluding the fork2 or the continuation of fork2, which is empty, an is annotated with the interval of the input array that it operates on (its argument).

The Figure above illustrates the dag for an execution of map_incr_rec. We partition each invocation of this function into two threads labeled by "M" and "C" respectively. The threads labeled by $M\lbrack i,j \rbrack $ corresponds to the part of the invocation of map_incr_rec with arguments lo and hi set to $i$ and $j$ respectively; this first part corresponds to the part of execution up and including the fork2 or all of the function if this is a base case. The second corresponds to the "continuation" of the fork2, which is in this case includes no computation.

Based on this dag, we can create another dag, where each thread is replaced by the sequence of instructions that it represents. This would give us a picture similar to the dag we drew before for general multithreaded programs. Such a dag representation, where we represent each instruction by a vertex, gives us a direct way to calculate the work and span of the computation. If we want to calculate work and span ond the dag of threads, we can label each vertex with a weight that corresponds to the number of instruction in that thread.

Note
The term thread is very much overused in computer science. There are system threads, which are threads that are known to the operating system and which can perform a variety of operations. For example, Pthreads library enables creating such system threads and programming with them. There are also many libraries for programming with user-level threads, which are threads that exist at the application level but the operating system does not know about them. Then there are threads that are much more specific such as those that we have defined for the fork-join programs. Such threads can be mapped to system or user-level threads but since they are more specific, they are usually implemented in a custom fashion, usually in the user/application space. For this reason, some authors prefer to use a different term for such threads, e.g., spark, strand, task.

Let’s observe a few properties of fork-join computations and their dags.

  1. The computation dag of a fork-join program applied to an input unfolds dynamically as the program executes. For example, when we run map_inc_rec with an input with $n$ elements, the dag initially contains just the root vertex (thread) corresponding to the first call to map_inc_rec but it grows as the execution proceeds.

  2. An execution of a fork-join program can generate a massive number of threads. For example, our ‘map_inc_rec’ function generates approximately $4n$ threads for an input with $n$ elements.

  3. The work/span of each thread can vary from a small amount to a very large amount depending on the algorithm. In our example, each thread performs either a conditional, sometimes an addition and a fork operation or performs no actual computation (continuation threads).

Suppose now we are given a computation dag and we wish to execute the dag by mapping each thread to one of the $P$ processor that is available on the hardware. To do this, we can use an online scheduling algorithm.

Example 6. An example 2-processor schedule

The following is a schedule for the dag shown in this Figure assuming that each thread takes unit time.

Time Step Processor 1 Processor 2

1

M [0,8)

2

M [0,4)

M [4,8)

3

M [0,2)

M [4,6)

4

M [0,1)

M [4,5)

5

M [1,2)

M [5,6)

6

C [0,2)

C [4,6)

7

M [2,4)

M [6,8)

8

M [2,3)

M [6,7)

9

M [3,4)

M [7,8)

10

C [2,4)

C [6,8)

11

C [0,4)

C [4,8)

12

C [0,8)

_

Exercise: Enabling Tree

Draw the enabling tree for the schedule above.

Example 7. Centralized scheduler illustrated: the state of the queue and the dag after step 4. Completed vertices are drawn in grey (shaded).
Centralized scheduler illustrated
Example 8. Distributed scheduler illustrated: the state of the queue and the dag after step 4. Completed vertices are drawn in grey (shaded).
Distributed scheduler illustrated

7. Chapter: Structured Parallelism with Async-Finish

The "async-finish" approach offers another mechanism for structured parallelism. It is similar to fork-join parallelism but is more flexible, because it allows essentially any number of parallel branches to synchronize at the same point called a "finish". Each "async" creates a separate parallel branch, which by construction must signal a "finish."

As with fork-join, it is often possible to extend an existing language with support for async-finish parallelism by providing libraries or compiler extensions that support a few simple primitives. Such extensions to a language make it easy to derive a sequential program from a parallel program by syntactically substituting the parallelism annotations with corresponding serial annotations. This in turn enables reasoning about the semantics or the meaning of parallel programs by essentially "ignoring" parallelism, e.g., via the sequential elision mechanism.

7.1. Parallel Fibonacci via Async-Finish

Recall that the $n^{th}$ Fibonnacci number is defined by the recurrence relation

$$ \begin{array}{lcl} F(n) & = & F(n-1) + F(n-2) \end{array} $$

with base cases

$$ F(0) = 0, \, F(1) = 1 $$

To write a parallel version, we remark that the two recursive calls are completely independent: they do not depend on each other (neither uses a piece of data generated or written by another). We can therefore perform the recursive calls in parallel. In general, any two independent functions can be run in parallel. To indicate that two functions can be run in parallel, we use async(). It is a requirement that each async takes place in the context of a finish. All async'ed computations that takes place in the context of a finish synchronize at that finish, i.e., finish completes only when all the async'ed computations complete. The important point about a finish is that it can serve as the synchronization point for an arbitrary number of async'ed computations. This is not quite useful in Fibonacci because there are only two computations to synchronize at a time, but it will be useful in our next example.

long fib_par(long n, long &result) {
  if (n < 2) {
    *result = n;
  } else {
    long a, b;
    finish {
      async [&] fib_par(n-1, a);
      aysnc [&] fib_par(n-2, b);
    });
    *result = *a + *b;
  }
}

7.2. Incrementing an array, in parallel

Recall our example for mapping an array to another by incrementing each element by one. We can write the code for a function map_incr that performs this computation serially.

void map_incr(const long* source, long* dest, long n) {
  for (long i = 0; i < n; i++)
    dest[i] = source[i] + 1;
}

Using async-finish, we can write to code parallel array increment by using only a single ‘finish` block and async’ ing all the parallel computations inside that finish.

void map_incr_rec_aux (const long* source, long* dest, long lo, long hi) {
  long n = hi - lo;
  if (n == 0) {
    // do nothing
  } else if (n == 1) {
    dest[lo] = source[lo] + 1;
  } else {
    long mid = (lo + hi) / 2;
    async ([&] {
      map_incr_rec_aux(source, dest, lo, mid);
    };
    async [&] {
      map_incr_rec_aux(source, dest, mid, hi);
    };
  }
}


void map_incr_rec (const long* source, long* dest, long lo, long hi) {

  finish ([&] {
    map_inc_rec_aux (source, dest, lo, hi));
  };
}

It is helpful to compare this to fork-join, where we had to synchronize parallel branches in pairs. Here, we are able to synchronize them all at a single point.

Dag for parallel increment using async-finish
Figure 4. Dag for parallel increment on an array of $8$ using async-finish: Each vertex corresponds a call to map_inc_rec_aux excluding the async or the continuation of async, which is empty, and is annotated with the interval of the input array that it operates on (its argument).

The Figure above illustrates the dag for an execution of map_incr_rec. Each invocation of this function corresponds to a thread labeled by "M". The threads labeled by $M\lbrack i,j \rbrack $ corresponds to the part of the invocation of map_incr_rec_aux with arguments lo and hi set to $i$ and $j$ respectively. Note that all threads synchronize at the single finish node.

Based on this dag, we can create another dag, where each thread is replaced by the sequence of instructions that it represents. This would give us a picture similar to the dag we drew before for general multithreaded programs. Such a dag representation, where we represent each instruction by a vertex, gives us a direct way to calculate the work and span of the computation. If we want to calculate work and span on the dag of threads, we can label each vertex with a weight that corresponds to the number of instruction in that thread.

Using async-finish does not alter the asymptotic work and span of the computation compared to fork-join, which remain as O(n) work and $O(\log{n})$ span. In practice, however, the async-finish version creates fewer threads (by about a factor of two), which can make a difference.

8. Chapter: Structured Parallelism with Futures

Futures were first used for expressing parallelism in the context of functional languages, because they permit a parallel computation to be a first-class value. Such a value can be treated just like any other value in the language. For example, it can be placed into a data structure, passed to other functions as arguments. Another crucial difference between futures and fork-join and async-finish parallelism is synchronization: in the latter synchronization is guided by "control dependencies", which are made apparent by the code. By inspecting the code we can identify the synchronization points for a piece of parallel computation, which correspond to join and finish. In futures, synchronization can be more complex because it is based on "data dependencies." When a parallel computation is needed, the programmer can demand computation to be completed. If the computation has not completed, then it will be executed to completion. If it has completed, its result will be used. Using futures, we can parallelize essentially any piece of computation, even if it depends on other parallel computations. Recall that when using fork-join and async-finish, we have had to ensure that the parallel computations being spawned are indeed independent. This is not necessary with futures.

To indicate a parallel computations, we shall use the future construct, which takes an expression as an argument and starts a parallel computation that will compute the value of that expression sometime in the future. The construct returns a "future" value representing the computation, which can be used just like any other value of the language. When the time comes to use the value of the future, we demand its completion by using the force construct.

8.1. Parallel Fibonacci via Futures

Recall that the $n^{th}$ Fibonnacci number is defined by the recurrence relation

$$ \begin{array}{lcl} F(n) & = & F(n-1) + F(n-2) \end{array} $$

with base cases

$$ F(0) = 0, \, F(1) = 1. $$

We can write a parallel version of Fibonacci using futures as follows.

long fib_par(long n) {
  if (n < 2) {
     return n;
  } else {
    long future a, b;
    a = future ( [&] {fib_par(n-1)});
    b = future ( [&] {fib_par(n-2)});
    return (force a) +  (force b);
  }
}

Note that this code is very similar to the fork-join version. In fact, we can directly map any use of fork-join parallelism to futures.

Much the same way that we represent fork-join and async-finish computations with a dag of threads, we can also represent future-based computations with a dag. The idea is to spawn a subdag for each future and for each force add an edge from the terminus of that subdag to the vertex that corresponds to the force. For example, for the Fibonacci function, the dags with futures are identical to those with fork-join. As we shall see, however, we can create more interesting dags with futures.

8.2. Incrementing an array, in parallel

Recall our example for mapping an array to another by incrementing each element by one. We can write the code for a function map_incr that performs this computation serially.

void map_incr(const long* source, long* dest, long n) {
  for (long i = 0; i < n; i++)
    dest[i] = source[i] + 1;
}

We can write a parallel version of this code using futures by following the same approach that we did for Fibonacci. But let’s consider something slightly different. Suppose that the array is given to us an array of future values. We can write a serial version of map_incr for such an array as follows.

void future_map_incr(const long^* source, long^* dest, long n) {
  for (long i = 0; i < n; i++)
    dest[i] = future [&] {force (source[i]) + 1};
}

The function future_map_incr takes an array of futures of long, written long^ and returns an array of the same type. To compute each value of the array, the function force's the corresponding element of the source inside a future.

8.3. Futures and Pipelining

Futures enable expressing parallelism at a very fine level of granularity, at the level of individual data dependencies. This in turn allows parallelizing computations at a similarly finer grain, enabling a technique called pipelining.

The idea behind pipelining is to decompose a task $T$ into a sequence of smaller subtasks $T_0, \ldots, T_k$ to be performed in that order and overlap the computation of multiple tasks by starting another instance of the same kind of task as soon as the first subtask completes. As a result, if we have a collection of the same kinds of tasks that can be decomposed in the same way, we can have multiple of them "in flight" without having to wait for one to complete.

When computing sequentially, pipelining does not help performance because we can perform one computation at each time step. But when using parallel hardware, we can keep multiple processors busy by pipelining.

Suppose as an example, we have a sequence of tasks $T^0, T^1, T^2, T^3, T^4, T^5$ that we wish to compute. Suppose furthermore that these task depend on each other, that is a later task use the results of an earlier task. Thus it might seem that there is no parallelism that we can exploit. Upon further inspection, however, we may realize that we can partition each task $T_i$ into subtasks $T_0^i, T_1^i, T_2^i, T_3^i, T_4^i, T_5^i$ such that the subtask $j$ of task $i$ is used by the subtask $j+1$ of task $i+1$. We can then execute these tasks in parallel as shown below. Thus in the "steady state" where we have a large supply of these tasks, we can increase performance by a factor $5$, the number of subtasks that a task decomposes.

Processor Step 0 Step 1 Step 2 Step 3 Step 4 Step 5

$P_0$

$T_0^0$

$T_1^0$

$T_2^0$

$T_3^0$

$T_4^0$

$T_5^0$

$P_1$

$T_0^1$

$T_1^1$

$T_2^1$

$T_3^1$

$T_4^1$

$P_1$

$T_0^2$

$T_1^2$

$T_2^2$

$T_3^2$

$P_3$

$T_0^3$

$T_1^3$

$T_2^3$

$P_4$

$T_0^4$

$T_1^4$

$P_5$

$T_0^5$

This idea of pipelining turns out to be quite important in some algorithms, leading sometimes to asymptotic improvements in run-time. It can, however, be painful to design the algorithm to take advantage of pipelining, especially if all we have available at our disposal are fork-join and async-finish parallelism, which require parallel computations to be independent. When using these constructs, we might therefore have to redesign the algorithm so that independent computations can be structurally separated and spawned in parallel. On the other hand, futures make it trivial to express pipelined algorithms because we can express data dependencies and ignore how exactly the individual computations may need to be executed in parallel. For example, in our hypothetical example, all we have to do is create a future for each sub-task, and force the relevant subtask as needed, leaving it to the scheduler to take advantage of the parallelism made available.

As a more concrete example, let’s go back to our array increment function and generalize to array map, and assume that we have another function mk_array that populates the contents of the array.


long f (long i) {
  ...
}

long g (long i) {
  ...
}

void mk_array (long^* source, long n) {
  for (long i = 0; i < n; i++)
    source[i] = future ( [&] { f(i) } );
}

void future_map_g (const long^* source, long^* dest, long n) {
  for (long i = 0; i < n; i++)
    dest[i] = future ( [&] {g (force (source[i]))} );
}

main (long n) {
  long^* source = alloc (long, n);
  long^* dest = alloc (long, n);
  mk_array (source, n)
  future_map_g (source, dest, n)

}

In this example, the $i^{th}$ element of the destination array depends only on the $i^{th}$ element of the source, thus as soon as that element is available, the function g might be invoked. This allows us to pipeline the execution of the functions mk_array and future_map_g.

While we shall not discuss this in detail, it is possible to improve the asymptotic complexity of certain algorithms by using futures and pipelining.

An important research question regarding futures is their scheduling. One challenge is primarily contention. When using futures, it is easy to create dag vertices, whose out-degree is non-constant. The question is how can such dag nodes can be represented to make sure that they don’t create a bottleneck, while also allowing the small-degree instance, which is the common case, to proceed efficiently. Another challenge is data locality. Async-finish and fork-join programs can be scheduled to exhibit good data locality, for example, using work stealing. With futures, this is more tricky. It can be shown for example, that even a single scheduling action can cause a large negative impact on the data locality of the computation.

9. Critical Sections and Mutual Exclusion

In a multithreaded program, a critical section is a part of the program that may not be executed by more than one thread at the same time. Critical sections typically contain code that alters shared objects, such as shared (e.g., global) variables. This means that the a critical section requires mutual exclusion: only one thread can be inside the critical section at any time.

Since only one thread can be inside a critical section at a time, threads must coordinate to make sure that they don’t enter the critical section at the same time. If threads do not coordinate and multiple threads enter the critical section at the same time, we say that a race condition occurs, because the outcome of the program depends on the relative timing of the threads, and thus can vary from one execution to another. Race conditions are sometimes benign but usually not so, because they can lead to incorrect behavior. Spectacular examples of race conditions' effects include the "Northeast Blackout" of 2003, which affected 45 million people in the US and 10 million people in Canada.

It can be extremely difficult to find a race condition, because of the non-determinacy of execution. A race condition may lead to an incorrect behavior only a tiny fraction of the time, making it extremely difficult to observe and reproduce it. For example, the software fault that lead to the Northeast blackout took software engineers "weeks of poring through millions of lines of code and data to find it" according to one of the companies involved.

The problem of designing algorithms or protocols for ensuring mutual exclusion is called the mutual exclusion problem or the critical section problem. There are many ways of solving instances of the mutual exclusion problem. But broadly, we can distinguish two categories: spin-locks and blocking-locks. The idea in spin locks is to busy wait until the critical section is clear of other threads. Solutions based on blocking locks is similar except that instead of waiting, threads simply block. When the critical section is clear, a blocked thread receives a signal that allows it to proceed. The term mutex, short for "mutual exclusion" is sometimes used to refer to a lock.

Mutual exclusions problems have been studied extensively in the context of several areas of computer science. For example, in operating systems research, processes, which like threads are independent threads of control, belonging usually but not always to different programs, can share certain systems' resources. To enable such sharing safely and efficiently, researchers have proposed various forms of locks such as semaphores, which accepts both a busy-waiting and blocking semantics. Another class of locks, called condition variables enable blocking synchronization by conditioning an the value of a variable.

9.1. Parallelism and Mutual Exclusion

In parallel programming, mutual exclusion problems do not have to arise. For example, if we program in a purely functional language extended with structured multithreading primitives such as fork-join and futures, programs remain purely functional and mutual-exclusion problems, and hence race conditions, do not arise. If we program in an imperative language, however, where memory is always a shared resource, even when it is not intended to be so, threads can easily share memory objects, even unintentionally, leading to race conditions. .Writing to the same location in parallel.

In the code below, both branches of fork2 are writing into b. What should then the output of this program be?

long b = 0;

fork2([&] {
  b = 1;
}, [&] {
  b = 2;
});

std::cout << "b = " << std::endl;

At the time of the print, the contents of b is determined by the last write. Thus depending on which of the two branches perform the write, we can see both possibilities:

Output:

b = 1

Output:

b = 2
Example 9. Fibonacci

Consider the following alternative implementation of the Fibonacci function. By "inlining" the plus operation in both branches, the programmer got rid of the addition operation after the fork2.

long fib_par_racy(long n) {
  long result = 0;
  if (n < 2) {
    result = n;
  } else {
    fork2([&] {
      result += fib_par_racy(n-1);
    },[&] {
      result += fib_par_racy(n-2);
    });
  }
  return result;
}

This code is not correct because it has a race condition.

As in the example shows, separate threads are updating the value result but it might look like this is not a race condition because the update consists of an addition operation, which reads the value and then writes to i. The race condition might be easier to see if we expand out the applications of the += operator.

long fib_par_racy(long n) {
  long result = 0;
  if (n < 2) {
    result = n;
  } else {
    fork2([&] {
      long a1 = fib_par_racy(n-1);
      long a2 = result;
      result = a1 + a2;
    },[&] {
      long b1 = fib_par_racy(n-2);
      long b2 = result;
      result = b1 + b2;
    });
  }
  return result;
}

When written in this way, it is clear that these two parallel threads are not independent: they both read result and write to result. Thus the outcome depends on the order in which these reads and writes are performed, as shown in the next example.

Example 10. Execution trace of a race condition

The following table takes us through one possible execution trace of the call fib_par_racy(2). The number to the left of each instruction describes the time at which the instruction is executed. Note that since this is a parallel program, multiple instructions can be executed at the same time. The particular execution that we have in this example gives us a bogus result: the result is 0, not 1 as it should be.

Time step Thread 1 Thread 2

1

a1 = fib_par_racy(1)

b2 = fib_par_racy(0)

2

a2 = result

b3 = result

3

result = a1 + a2

_

4

_

result = b1 + b2

The reason we get a bogus result is that both threads read the initial value of result at the same time and thus do not see each others write. In this example, the second thread "wins the race" and writes into result. The value 1 written by the first thread is effectively lost by being overwritten by the second thread.

9.2. Synchronization Hardware

Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem. These operations may allow, for example, testing the contents of a (machine) word then modifying it, perhaps by swapping it with another word. Such operations are sometimes called atomic read-modify-write or RMW, for short, operations.

A handful of different RMW operations have been proposed. They include operations such as load-link/store-conditional, fetch-and-add, and compare-and-swap. They typically take the memory location x, and a value v and replace the value of stored at x with f(x,v). For example, the fetch-and-add operation takes the location x and the increment-amount, and atomically increments the value at that location by the specified amount, i.e., f(x,v) = *x + v.

The compare-and-swap operation takes the location x and takes a pair of values (a,b) as the second argument, and stores b into x if the value in x is a, i.e., f(x,(a,b)) = if *x = a then b else a; the operation returns a Boolean indicating whether the operation successfully stored a new value in x. The operation "compare-and-swap" is a reasonably powerful synchronization operation: it can be used by arbitrarily many threads to agree (reach consensus) on a value. This instruction therefore is frequently provided by modern parallel architectures such as Intel’s X86.

In C$++$, the atomic class can be used to perform synchronization. Objects of this type are guarantee to be free of race conditions; and in fact, in C++, they are the only objects that are guaranteed to be free from race conditions. The contents of an atomic object can be accessed by load operations, updated by store operation, and also updated by compare_exchange_weak and compare_exchange_strong operations, the latter of which implement the compare-and-swap operation.

Example 11. Accessing the contents of atomic memory cells

Access to the contents of any given cell is achieved by the load() and store() methods.

std::atomic<bool> flag;

flag.store(false);
std::cout << flag.load() << std::endl;
flag.store(true);
std::cout << flag.load() << std::endl;

Output:

0
1

The key operation that help with race conditions is the compare-and-exchange operation.

Definition: compare and swap

When executed with a ‘target` atomic object and an expected cell and a new value `new’ this operation performs the following steps, atomically:

  1. Read the contents of target.

  2. If the contents equals the contents of expected, then writes new into the target and returns true.

  3. Otherwise, returns false.

Example 12. Reading and writing atomic objects
std::atomic<bool> flag;

flag.store(false);
bool expected = false;
bool was_successful = flag.compare_exchange_strong(expected, true);
std::cout << "was_successful = " << was_successful << "; flag = " << flag.load() << std::endl;
bool expected2 = false;
bool was_successful2 = flag.compare_exchange_strong(expected2, true);
std::cout << "was_successful2 = " << was_successful2 << "; flag = " <<
flag.load() << std::endl;

Output:

was_successful = 1; flag = 1
was_successful2 = 0; flag = 1

As another example use of the atomic class, recall our Fibonacci example with the race condition. In that example, race condition arises because of concurrent writes to the result variable. We can eliminate this kind of race condition by using different memory locations, or by using an atomic class and using a compare_exchange_strong operation.

Example 13. Fibonacci

The following implementation of Fibonacci is not safe because the variable result is shared and updated by multiple threads.

long fib_par_racy(long n) {
  long result = 0;
  if (n < 2) {
    result = n;
  } else {
    fork2([&] {
      result += fib_par_racy(n-1);
    },[&] {
      result += fib_par_racy(n-2);
    });
  }
  return result;
}

We can solve this problem by declaring result to be an atomic type and using a standard busy-waiting protocol based on compare-and-swap.


long fib_par_atomic(long n) {
  atomic<long> result = 0;
  if (n < 2) {
    result.store(n);
  } else {
    fork2([&] {
      long r = fib_par_racy(n-1);
      // Atomically update result.
      while (true) {
        long exp = result.load();
        bool flag = result.compare_exchange_strong(exp,exp+r)
        if (flag) {break;}
      }
    },[&] {
      long r = fib_par_racy(n-2);
      // Atomically update result.
      while (true) {
        long exp = result.load();
        bool flag = result.compare_exchange_strong(exp,exp+r)
        if (flag) {break;}
      }
    });
  }
  return result;
}

The idea behind the solution is to load the current value of result and atomically update result only if it has not been modified (by another thread) since it was loaded. This guarantees that the result is always updated (read and modified) correctly without missing an update from another thread.

The example above illustrates a typical use of the compare-and-swap operation. In this particular example, we can probably prove our code is correct. But this is not always as easy due to a problem called the "ABA problem."

9.3. ABA problem

While reasonably powerful, compare-and-swap suffers from the so-called ABA problem. To see this consider the following scenario where a shared variable result is update by multiple threads in parallel: a thread, say $T$, reads the result and stores its current value, say 2, in current. In the mean time some other thread also reads result and performs some operations on it, setting it back to 2 after it is done. Now, thread $T$ takes its turn again and attempts to store a new value into result by using 2 as the old value and being successful in doing so, because the value stored in result appears to have not changed. The trouble is that the value has actually changed and has been changed back to the value that it used to be. Thus, compare-and-swap was not able to detect this change because it only relies on a simple shallow notion of equality. If for example, the value stored in result was a pointer, the fact that the pointer remains the same does not mean that values accessible from the pointer has not been modified; if for example, the pointer led to a tree structure, an update deep in the tree could leave the pointer unchanged, even though the tree has changed.

This problem is called the ABA problem, because it involves cycling the atomic memory between the three values $A$, $B$, and again $A$). The ABA problem is an important limitation of compare-and-swap: the operation itself is not atomic but is able to behave as if it is atomic if it can be ensured that the equality test of the subject memory cell suffices for correctness.

In the example below, ABA problem may happen (if the counter is incremented and decremented again in between a load and a store) but it is impossible to observe because it is harmless. If however, the compare-and-swap was on a memory object with references, the ABA problem could have had observable effects.

The ABA problem can be exploited to give seemingly correct implementations that are in fact incorrect. To reduce the changes of bugs due to the ABA problem, memory objects subject to compare-and-swap are usually tagged with an additional field that counts the number of updates. This solves the basic problem but only up to a point because the counter itself can also wrap around. The load-link/store-conditional operation solves this problem by performing the write only if the memory location has not been updated since the last read (load) but its practical implementations are hard to come by.

10. Chapter: Experimenting with PASL

We are now going to study the practical performance of our parallel algorithms written with PASL on multicore computers.

To be concrete with our instructions, we assume that our username is pasl and that our home directory is /home/pasl/. You need to replace these settings with your own where appropriate.

10.1. Obtain source files

Let’s start by downloading the PASL sources. The PASL sources that we are going to use are part of a branch that we created specifically for this course. You can access the sources either via the tarball linked by the github webpage or, if you have git, via the command below.

$ cd /home/pasl
$ git clone -b edu https://github.com/deepsea-inria/pasl.git

10.2. Software Setup

You can skip this section if you are using a computer already setup by us or you have installed an image file containing our software. To skip this part and use installed binaries, see the heading "Starting with installed binaries", below.

10.2.1. Check for software dependencies

Currently, the software associated with this course supports Linux only. Any machine that is configured with a recent version of Linux and has access to at least two processors should be fine for the purposes of this course. Before we can get started, however, the following packages need to be installed on your system.

Software dependency Version Nature of dependency

gcc

>= 4.9.0

required to build PASL binaries

php

>= 5.3.10

required by PASL makefiles to build PASL binaries

ocaml

>= 4.0.0

required to build the benchmarking tools (i.e., pbench and pview)

R

>= 2.4.1

required by benchmarking tools to generate reports in bar plot and scatter plot form

latex

recent

optional; required by benchmarking tools to generate reports in tabular form

git

recent

optional; can be used to access PASL source files

tcmalloc

>= 2.2

optional; may be useful to improve performance of PASL binaries

hwloc

recent

optional; might be useful to improve performance on large systems with NUMA (see below)

The rest of this section explains what are the optional software dependencies and how to configure PASL to use them. We are going to assume that all of these software dependencies have been installed in the folder /home/pasl/Installs/.

10.2.2. Use a custom parallel heap allocator

At the time of writing this document, the system-default implementations of malloc and free that are provided by Linux distributions do not scale well with even moderately large amounts of concurrent allocations. Fortunately, for this reason, organizations, such as Google and Facebook, have implemented their own scalable allocators that serve as drop-in replacements for malloc and free. We have observed the best results from Google’s allocator, namely, tcmalloc. Using tcmalloc for your own experiements is easy. Just add to the /home/pasl/pasl/minicourse folder a file named settings.sh with the following contents.

Example 14. Configuration to select tcmalloc

We assume that the package that contains tcmalloc, namely gperftools, has been installed already in the folder /home/pasl/Installs/gperftools-install/. The following lines need to be in the settings.sh file in the /home/pasl/pasl/minicourse folder.

USE_ALLOCATOR=tcmalloc
TCMALLOC_PATH=/home/pasl/Installs/gperftools-install/lib/

Also, the environment linkder needs to be instructed where to find tcmalloc.

export LD_PRELOAD=/home/pasl/Installs/gperftools-install/lib/libtcmalloc.so

This assignment can be issued either at the command line or in the environment loader script, e.g., ~/.bashrc.

Warning
Changes to the settings.sh file take effect only after recompiling the binaries.

10.2.3. Use hwloc

If your system has a non-uniform memory architecture (i.e., NUMA), then you may improve performance of PASL applications by using optional support for hwloc, which is a library that reports detailed information about the host system, such as NUMA layout. Currently, PASL leverages hwloc to configure the NUMA allocation policy for the program. The particular policy that works best for our applications is round-robin NUMA page allocation. Do not worry if that term is unfamiliar: all it does is disable NUMA support, anyway!

Example 15. How to know whether my machine has NUMA

Run the following command.

$ dmesg | grep -i numa

If the output that you see is something like the following, then your machine has NUMA. Otherwise, it probably does not.

[    0.000000] NUMA: Initialized distance table, cnt=8
[    0.000000] NUMA: Node 4 [0,80000000) + [100000000,280000000) -> [0,280000000)

We are going to assume that hwloc has been installed already and is located at /home/pasl/Installs/hwloc-install/. To configure PASL to use hwloc, add the following lines to the settings.sh file in the /home/pasl/pasl/minicourse folder.

Example 16. Configuration to use hwloc
USE_HWLOC=1
HWLOC_PATH=/home/pasl/Installs/hwloc-install/

10.3. Starting with installed binaries

At this point, you have either installed all the necessary software to work with PASL or these are installed for you. In either case, make sure that your PATH variable makes the software visible. For setting up your PATH variable on andrew.cmu domain, see below.

10.3.1. Specific set up for the andrew.cmu domain

We have installed much of the needed software on andrew.cmu.edu. So you need to go through a relatively minimal set up.

First set up your PATH variable to refer to the right directories. Using cshell

setenv PATH  /opt/rh/devtoolset-3/root/usr/bin:/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:./

The part added to the default PATH on andrew is

/opt/rh/devtoolset-3/root/usr/bin

It is important that this is at the beginning of the PATH variable. To make interaction easier, we also added the relative path ./ to the PATH variable.

10.3.2. Fetch the benchmarking tools (pbench)

We are going to use two command-line tools to help us to run experiments and to analyze the data. These tools are part of a library that we developed, which is named pbench. The pbench sources are available via github.

$ cd /home/pasl
$ git clone https://github.com/deepsea-inria/pbench.git

The tarball of the sources can be downloaded from the github page.

10.3.3. Build the tools

The following command builds the tools, namely prun and pplot. The former handles the collection of data and the latter the human-readable output (e.g., plots, tables, etc.).

$ make -C /home/pasl/pbench/

Make sure that the build succeeded by checking the pbench directory for the files prun and pplot. If these files do not appear, then the build failed.

10.3.4. Create aliases

We recommend creating the following aliases.

$ alias prun '/home/pasl/pbench/prun'
$ alias pplot '/home/pasl/pbench/pplot'

It will be convenient for you to make these aliases persistent, so that next time you log in, the aliases will be set. Add the commands above to your shell configuration file.

10.3.5. Visualizer Tool

When we are tuning our parallel algorithms, it can be helpful to visualize their processor utilization over time, just in case there are patterns that help to assign blame to certain regions of code. Later, we are going to use the utilization visualizer that comes packaged along with PASL. To build the tool, run the following make command.

$ make -C /home/pasl/pasl/tools/pview pview

Let us create an alias for the tool.

$ alias pview '/home/pasl/pasl/tools/pview/pview'

We recommend that you make this alias persistent by putting it into your shell configuration file (as you did above for the pbench tools).

10.4. Using the Makefile

PASL comes equipped with a Makefile that can generate several different kinds of executables. These different kinds of executables and how they can be generated is described below for a benchmark program pgm.

  • baseline: build the baseline with command make pgm.baseline

  • elision: build the sequential elision with command make pgm.elision

  • optimized: build the optimized binary with command make pgm.opt

  • log: build the log binary with command make pgm.log

  • debug: build the debug binary with the command make pgm.dbg

To speed up the build process, add to the make command the option -j (e.g., make -j pgm.opt). This option enables make to parallelize the build process. Note that, if the build fails, the error messages that are printed to the terminal may be somewhat garbled. As such, it is better to use -j only if after the debugging process is complete.

10.5. Task 1: Run the baseline Fibonacci

We are going to start our experimentation with three different instances of the same program, namely bench. This program serves as a "driver" for the benchmarks that we have implemented. These implementations are good parallel codes that we expect to deliver good performance. We first build the baseline version.

$ cd /home/pasl/pasl/minicourse
$ make bench.baseline
Warning
The command-line examples that we show here assume that you have . in your $PATH. If not, you may need to prefix command-line calls to binaries with ./ (e.g., ./bench.baseline).

The file extension .baseline means that every benchmark in the binary uses the sequential-baseline version of the specified algorithm.

We can now run the baseline for one of our benchmarks, say Fibonacci by using the -bench argument to specify the benchmark and the -n argument to specify the input value for the Fibonacci function.

$ bench.baseline -bench fib -n 39

On our machine, the output of this run is the following.

exectime 0.556
utilization 1.0000
result 63245986

The three lines above provide useful information about the run.

  • The exectime indicates the wall-clock time in seconds that is taken by the benchmark. In general, this time measures only the time taken by the benchmark under consideration. It does not include the time taken to generate the input data, for example.

  • The utilization relates to the utilization of the processors available to the program. In the present case, for a single-processor run, the utilization is by definition 100%. We will return to this measure soon.

  • The result field reports a value computed by the benchmark. In this case, the value is the $39^{th}$ Fibonacci number.

10.6. Task 2: Run the sequential elision of Fibonacci

The .elision extension means that parallel algorithms (not sequential baseline algorithms) are compiled. However, all instances of fork2() are erased as described in an earlier chapter.

$ make bench.elision
$ bench.elision -bench fib -n 39

The run time of the sequential elision in this case is similar to the run time of the sequential baseline because the two are similar codes. However, for most other algorithms, the baseline will typically be at least a little faster.

exectime 0.553
utilization 1.0000
result 63245986

10.7. Task 3: Run parallel Fibonacci

The .opt extension means that the program is compiled with full support for parallel execution. Unless specified otherwise, however, the parallel binary uses just one processor.

$ make bench.opt
$ bench.opt -bench fib -n 39

The output of this program is similar to the output of the previous two programs.

exectime 0.553
utilization 1.0000
result 63245986

Because our machine has 40 processors, we can run the same application using all available processors. Before running this command, please adjust the -proc option to match the number of cores that your machine has. Note that you can use any number of cores up to the number you have available. You can use nproc or lscpu to determine the number of cores your machine has.

$ bench.opt -bench fib -n 39 -proc 40

We see from the output of the 40-processor run that our program ran faster than the sequential runs. Moreover, the utilization field tells us that approximately 86% of the total time spent by the 40 processors was spent performing useful work, not idling.

exectime 0.019
utilization 0.8659
result 63245986
Warning
PASL allows the user to select the number of processors by the -proc key. The maximum value for this key is the number of processors that are available on the machine. PASL raises an error if the programmer asks for more processors than are available.

10.8. Measuring performance with "speedup"

We may ask at this point: What is the improvement that we just observed from the parallel run of our program? One common way to answer this question is to measure the "speedup".

Definition: $P$-processor speedup

The speedup on $P$ processors is the ratio $T_B/T_P$, where the term $T_B$ represents the run time of the sequential baseline program and the term $T_P$ the time measured for the $P$-processor run.

Important
The importance of selecting a good baseline

Note that speedup is defined with respect to a baseline program. How exactly should this baseline program be chosen? One option is to take the sequential elision as a baseline. The speedup curve with such a baseline can be helpful in determining the scalability of a parallel algorithm but it can also be misleading, especially if speedups are taken as a indicator of good performance, which they are not because they are only relative to a specific baseline. For speedups to be a valid indication of good performance, they must be calculated against an optimized implementation of the best serial algorithm (for the same problem.)

The speedup at a given number of processors is a good starting point on the way to evaluating the scalability of the implementation of a parallel algorithm. The next step typically involves considering speedups taken from varying numbers of processors available to the program. The data collected from such a speedup experiment yields a speedup curve, which is a curve that plots the trend of the speedup as the number of processors increases. The shape of the speedup curve provides valuable clues for performance and possibly for tuning: a flattening curve suggests lack of parallelism; a curve that arcs up and then downward suggests that processors may be wasting time by accessing a shared resource in an inefficient manner (e.g., false sharing); a speedup curve with a constant slope indicates at least some scaling.

Example 17. Speedup for our run of Fibonacci on 40 processors

The speedup $T_B/T_{40}$ equals $0.556/0.019 = 29.26$x. Although not linear (i.e., 40x), this speedup is decent considering factors such as: the capabilities of our machine; the overheads relating to parallelism; and the small size of the problem compared to the computing power that our machine offers.

10.8.1. Generate a speedup plot

Let us see what a speedup curve can tell us about our parallel Fibonacci program. We need to first get some data. The following command performs a sequence of runs of the Fibonacci program for varying numbers of processors. You can now run the command yourself.

$ prun speedup -baseline "bench.baseline" -parallel "bench.opt -proc 1,10,20,30,40" -bench fib -n 39

Here is another example on a 24-core machine.

$ prun speedup -baseline "bench.baseline" -parallel "bench.opt -proc 1,4,8,16,24" -bench fib -n 39

Run the following command to generate the speedup plot.

$ pplot speedup

If successful, the command generates a file named plots.pdf. The output should look something like the plot in speedup plot below.

Starting to generate 1 charts.
Produced file plots.pdf.
Speedup curve for the computation of the 39th Fibonacci number.
Figure 5. Speedup curve for the computation of the 39th Fibonacci number.

The plot shows that our Fibonacci application scales well, up to about twenty processors. As expected, at twenty processors, the curve dips downward somewhat. We know that the problem size is the primary factor leading to this dip. How much does the problem size matter? The speedup plot in the Figure below shows clearly the trend. As our problem size grows, so does the speedup improve, until at the calculation of the $45^{th}$ Fibonacci number, the speedup curve is close to being linear.

Speedup plot showing speedup curves at different problem sizes.
Figure 6. Speedup plot showing speedup curves at different problem sizes.
Note
The prun and pplot tools have many more features than those demonstrated here. For details, see the documentation provided with the tools in the file named README.md.
Warning
Noise in experiments

The run time that a given parallel program takes to solve the same problem can vary noticeably because of certain effects that are not under our control, such as OS scheduling, cache effects, paging, etc. We can consider such noise in our experiments random noise. Noise can be a problem for us because noise can lead us to make incorrect conclusions when, say, comparing the performance of two algorithms that perform roughly the same. To deal with randomness, we can perform multiple runs for each data point that we want to measure and consider the mean over these runs. The prun tool enables taking multiple runs via the -runs argument. Moreover, the pplot tool by default shows mean values for any given set of runs and optionally shows error bars. The documentation for these tools gives more detail on how to use the statistics-related features.

10.8.2. Superlinear speedup

Suppose that, on our 40-processor machine, the speedup that we observe is larger than 40x. It might sound improbable or even impossible. But it can happen. Ordinary circumstances should preclude such a superlinear speedup, because, after all, we have only forty processors helping to speed up the computation. Superlinear speedups often indicate that the sequential baseline program is suboptimal. This situation is easy to check: just compare its run time with that of the sequential elision. If the sequential elision is faster, then the baseline is suboptimal. Other factors can cause superlinear speedup: sometimes parallel programs running on multiple processors with private caches benefit from the larger cache capacity. These issues are, however, outside the scope of this course. As a rule of thumb, superlinear speedups should be regarded with suspicion and the cause should be investigated.

10.9. Visualize processor utilization

The 29x speedup that we just calculated for our Fibonacci benchmark was a little dissapointing, and the 86% processor utilization of the run left 14% utilization for improvement. We should be suspicious that, although seemingly large, the problem size that we chose, that is, $n = 39$, was probably a little too small to yield enough work to keep all the processors well fed. To put this hunch to the test, let us examine the utilization of the processors in our system. We need to first build a binary that collects and outputs logging data.

$ make bench.log

We run the program with the new binary in the same fashion as before.

$ bench.log -bench fib -proc 40 -n 39

The output looks something like the following.

exectime 0.019
launch_duration 0.019
utilization     0.8639
thread_send     205
thread_exec     4258
thread_alloc    2838
utilization 0.8639
result 63245986

We need to explain what the new fields mean.

  • The thread_send field tells us that 233 threads were exchaged between processors for the purpose of load balancing;

  • the thread_exec field that 5179 threads were executed by the scheduler;

  • the thread_alloc field that 3452 threads were freshly allocated.

Each of these fields can be useful for tracking down inefficiencies. The number of freshly allocated threads can be a strong indicator because in C++ thread allocation costs can sometimes add up to a significant cost. In the present case, however, none of the new values shown above are highly suspicious, considering that there are all at most in the thousands.

Since we have not yet found the problem, let us look at the visualization of the processor utilization using our pview tool. To get the necessary logging data, we need to run our program again, this time passing the argument --pview.

$ bench.log -bench fib -n 39 -proc 40 --pview

When the run completes, a binary log file named LOG_BIN should be generated in the current directory. Every time we run with --pview this binary file is overwritten. To see the visualization of the log data, we call the visualizer tool from the same directory.

$ pview

The output we see on our 40-processor machine is shown in the Figure below. The window shows one bar per processor. Time goes from left to right. Idle time is represented by red and time spent busy with work by grey. You can zoom in any part of the plot by clicking on the region with the mouse. To reset to the original plot, press the space bar. From the visualization, we can see that most of the time, particularly in the middle, all of the processors keep busy. However, there is a lot of idle time in the beginning and end of the run. This pattern suggests that there just is not enough parallelism in the early and late stages of our Fibonacci computation.

Utilization plot for computation of 39th Fibonacci number.
Figure 7. Utilization plot for computation of 39th Fibonacci number.

10.10. Strong versus weak scaling

We are pretty sure that or Fibonacci program is not scaling as well is it could. But poor scaling on one particular input for $n$ does not necessarily mean there is a problem with the scalability our parallel Fibonacci program in general. What is important is to know more precisely what it is that we want our Fibonacci program to achieve. To this end, let us consider a distinction that is important in high-performance computing: the distinction between strong and weak scaling. So far, we have been studying the strong-scaling profile of the computation of the $39^{th}$ Fibonacci number. In general, strong scaling concerns how the run time varies with the number of processors for a fixed problem size. Sometimes strong scaling is either too ambitious, owing to hardware limitations, or not necessary, because the programmer is happy to live with a looser notion of scaling, namely weak scaling. In weak scaling, the programmer considers a fixed-size problem per processor. We are going to consider something similar to weak scaling. In the Figure below, we have a plot showing how processor utilization varies with the input size. The situation dramatically improves from 12% idle time for the $39^{th}$ Fibonacci number down to 5% idle time for the $41^{st}$ and finally to 1% for the $45^{th}$. At just 1% idle time, the utilization is excellent.

How processor utilization of Fibonacci computation varies with input size.
Figure 8. How processor utilization of Fibonacci computation varies with input size.

The scenario that we just observed is typical of multicore systems. For computations that perform relatively little work, such as the computation of the $39^{th}$ Fibonacci number, properties that are specific to the hardware, OS, and PASL load-balancing algorithm can noticeably limit processor utilization. For computations that perform lots of highly parallel work, such limitations are barely noticeable, because processors spend most of their time performing useful work. Let us return to the largest Fibonacci instance that we considered, namely the computation of the $45^{th}$ Fibonacci number, and consider its utilization plot.

$ bench.log -bench fib -n 45 -proc 40 --pview
$ pview

The utilization plot is shown in the Figure below. Compared the to utilization plot we saw in the Figure above for n=39, the red regions are much less prominent overall and the idle regions at the beginning and end are barely noticeable.

Utilization plot for computation of 45th Fibonacci number.
Figure 9. Utilization plot for computation of 45th Fibonacci number.

10.11. Chapter Summary

We have seen in this lab how to build, run, and evaluate our parallel programs. Concepts that we have seen, such as speedup curves, are going to be useful for evaluating the scalability of our future solutions. Strong scaling is the gold standard for a parallel implementation. But as we have seen, weak scaling is a more realistic target in most cases.

11. Chapter: Work efficiency

In many cases, a parallel algorithm which solves a given problem performs more work than the fastest sequential algorithm that solves the same problem. This extra work deserves careful consideration for several reasons. First, since it performs additional work with respect to the serial algorithm, a parallel algorithm will generally require more resources such as time and energy. By using more processors, it may be possible to reduce the time penalty, but only by using more hardware resources.

For example, if an algorithm performs $O(\log{n})$-factor more work than the serial algorithm, then, assuming that the constant factor hidden by the asymptotic notation is $1$, when $n = 2^{20}$, it will perform $20$-times more actual work, consuming $20$ times more energy consumption than the serial algorithm. Assuming perfect scaling, we can reduce the time penalty by using more processors. For example, with $40$ processors, the algorithm may require half the time of the serial algorithm.

Sometimes, a parallel algorithm has the same asymptotic complexity of the best serial algorithm for the problem but it has larger constant factors. This is generally true because scheduling friction, especially the cost of creating threads, can be significant. In addition to friction, parallel algorithms can incur more communication overhead than serial algorithms because data and processors may be placed far away in hardware. For example, it is not unusual for a parallel algorithm to incur a $10-100\times$ overhead over a similar serial algorithm because of scheduling friction and communication.

These considerations motivate considering "work efficiency" of parallel algorithm. Work efficiency is a measure of the extra work performed by the parallel algorithm with respect to the serial algorithm. We define two types of work efficiency: asymptotic work efficiency and observed work efficiency. The former relates to the asymptotic performance of a parallel algorithm relative to the fastest sequential algorithm. The latter relates to running time of a parallel algorithm relative to that of the fastest sequential algorithm.

Definition: asymptotic work efficiency

An algorithm is asymptotically work efficient if the work of the algorithm is the same as the work of the best known serial algorithm.

Example 18. Asymptotic work efficiency
  • A parallel algorithm that comparison-sorts $n$ keys in span $O(\log^3 n)$ and work $O(n \log n)$ is asymptotically work efficient because the work cost is as fast as the best known sequential comparison-based sorting algorithm. However, a parallel sorting algorithm that takes $O(n\log^2{n})$ work is not asymptotically work efficient.

  • The parallel array increment algorithm that we consider in an earlier Chapter is asymptotically work efficient, because it performs linear work, which is optimal (any sequential algorithm must perform at least linear work).

To assess the practical effectiveness of a parallel algorithm, we define observed work efficiency, parameterized a value $r$.

Definition: observed work efficiency

A parallel algorithm that runs in time $T_1$ on a single processor has observed work efficient factor of $r$ if $r = \frac{T_1}{T_{seq}}$, where $T_{seq}$ is the time taken by the fastest known sequential algorithm.

Example 19. Observed work efficiency
  • A parallel algorithm that runs $10\times$ slower on a single processor that the fastest sequential algorithm has an observed work efficiency factor of $10$. We consider such algorithms unacceptable, as they are too slow and wasteful.

  • A parallel algorithm that runs $1.1\times-1.2\times$. slower on a single processor that the fastest sequential algorithm has observed work efficiency factor of $1.2$. We consider such algorithms to be acceptable.

Example 20. Observed work efficiency of parallel increment

To obtain this measure, we first run the baseline version of our parallel-increment algorithm.

$ bench.baseline -bench map_incr -n 100000000
exectime 0.884
utilization 1.0000
result 2

We then run the parallel algorithm, which is the same exact code as map_incr_rec. We build this code by using the special optfp "force parallel" file extension. This special file extension forces parallelism to be exposed all the way down to the base cases. Later, we will see how to use this special binary mode for other purposes.

$ make bench.optfp
$ bench.optfp -bench map_incr -n 100000000
exectime 45.967
utilization 1.0000
result 2

Our algorithm has an observed work efficiency factor of $60\times$. Such poor observed work efficiency suggests that the parallel algorithm would require more than an order of magnitude more energy and that it would not run faster than the serial algorithm even when using less than $60$ processors.

In practice, observed work efficiency is a major concern. First, the whole effort of parallel computing is wasted if parallel algorithms consistently require more work than the best sequential algorithms. In other words, in parallel computing, both asymptotic complexity and constant factors matter.

Based on these discussions, we define a good parallel algorithm as follows.

Definition: good parallel algorithm

We say that a parallel algorithm is good if it has the following three characteristics:

  1. it is asymptotically work efficient;

  2. it is observably work efficient;

  3. it has low span.

11.1. Improving work efficiency with granularity control

It is common for a parallel algorithm to be asymptotically and/or observably work inefficient but it is often possible to improve work efficiency by observing that work efficiency increases with parallelism and can thus be controlled by limiting it.

For example, a parallel algorithm that performs linear work and has logarithmic span leads to average parallelism in the orders of thousands with the small input size of one million. For such a small problem size, we usually would not need to employ thousands of processors. It would be sufficient to limit the parallelism so as to feed tens of processors and as a result reduce impact of excess parallelism on work efficiency.

In many parallel algorithms such as the algorithms based on divide-and-conquer, there is a simple way to achieve this goal: switch from parallel to sequential algorithm when the problem size falls below a certain threshold. This technique is sometimes called coarsening or granularity control.

But which code should we switch to: one idea is to simply switch to the sequential elision, which we always have available in PASL. If, however, the parallel algorithm is asymptotically work inefficient, this would be ineffective. In such cases, we can specify a separate sequential algorithm for small instances.

Optimizing the practical efficiency of a parallel algorithm by controlling its parallelism is sometimes called optimization, sometimes it is called performance engineering, and sometimes performance tuning or simply tuning. In the rest of this document, we use the term "tuning."

Example 21. Tuning the parallel array-increment function

We can apply coarsening to map_inc_rec code by switching to the sequential algorithm when the input falls below an established threshold.

long threshold = Some Number;

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
  long n = hi - lo;
  if (n <= threshold) {
    for (long i = lo; i < hi; i++)
      dest[i] = source[i] + 1;
  } else {
    long mid = (lo + hi) / 2;
    fork2([&] {
      map_incr_rec(source, dest, lo, mid);
    }, [&] {
      map_incr_rec(source, dest, mid, hi);
    });
  }
}
Note
Even in sequential algorithms, it is not uncommon to revert to a different algorithm for small instances of the problem. For example, it is well known that insertion sort is faster than other sorting algorithms for very small inputs containing 30 keys or less. Many optimize sorting algorithm therefore revert to insertion sort when the input size falls within that range.
Example 22. Observed work efficiency of tuned array increment

As can be seen below, after some tuning, map_incr_rec program becomes highly work efficient. In fact, there is barely a difference between the serial and the parallel runs. The tuning is actually done automatically here by using an automatic-granularity-control technique described in the section.

$ bench.baseline -bench map_incr -n 100000000
exectime 0.884
utilization 1.0000
result 2
$ bench.opt -bench map_incr -n 100000000
exectime 0.895
utilization 1.0000
result 2

In this case, we have $r = \frac{T_1}{T_{seq}} = \frac{0.895}{0.884} = 1.012$ observed work efficiency. Our parallel program on a single processor is one percent slower than the sequential baseline. Such work efficiency is excellent.

11.2. Determining the threshold

The basic idea behind coarsening or granularity control is to revert to a fast serial algorithm when the input size falls below a certain threshold. To determine the optimal threshold, we can simply perform a search by running the code with different threshold settings.

While this approach can help find the right threshold on the particular machine that we performed the search, there is no guarantee that the same threshold would work on another machine. In fact, there are examples in the literature that show that such optimizations are not portable, i.e., a piece of code optimized for a particular architecture may behave poorly on another.

In the general case, determining the right threshold is even more difficult. To see the difficulty consider a generic (polymorphic), higher-order function such as map that takes a sequence and a function and applies the function to the sequence. The problem is that the threshold depends both on the type of the elements of the sequence and the function to be mapped over the sequence. For example, if each element itself is a sequence (the sequence is nested), the threshold can be relatively small. If, however, the elements are integers, then the threshold will likely need to be relatively large. This makes it difficult to determine the threshold because it depends on arguments that are unknown at compile time. Essentially the same argument applies to the function being mapped over the sequence: if the function is expensive, then the threshold can be relatively small, but otherwise it will need to be relatively large.

As we describe in this chapter, it is sometimes possible to determine the threshold completely automatically.

12. Chapter: Automatic granularity control

There has been significant research into determining the right threshold for a particular algorithm. This problem, known as the granularity-control problem, turns out to be a rather difficult one, especially if we wish to find a technique that can ensure close-to-optimal performance across different architectures. In this section, we present a technique for automatically controlling granularity by using asymptotic cost functions.

12.1. Complexity functions

Our automatic granularity-control technique requires assistance from the application programmer: for each parallel region of the program, the programmer must annotate the region with a complexity function, which is simply a C++ function that returns the asymptotic work cost associated with running the associated region of code.

Example 23. Complexity function of map_incr_rec

An application of our map_incr_rec function on a given range $[lo, hi)$ of an array has work cost $cn = c (hi - lo)$ for some constant $c$. As such, the following lambda expression is one valid complexity function for our parallel map_incr_rec program.

auto map_incr_rec_complexity_fct = [&] (long lo, long hi) {
  return hi - lo;
};

In general, the value returned by the complexity function need only be precise with respect to the asymptotic complexity class of the associated computation. The following lambda expression is another valid complexity function for function map_incr_rec. The complexity function above is preferable, however, because it is simpler.

const long k = 123;
auto map_incr_rec_complexity_fct2 = [&] (long lo, long hi) {
  return k * (hi - lo);
};

More generally, suppose that we know that a given algorithm has work cost of $W = n + \log n$. Although it would be fine to assign to this algorithm exactly $W$, we could just as well assign to the algorithm the cost $n$, because the second term is dominated by the first. In other words, when expressing work costs, we only need to be precise up to the asymptotic complexity class of the work.

12.2. Controlled statements

In PASL, a controlled statement, or cstmt, is an annotation in the program text that activates automatic granularity control for a specified region of code. In particular, a controlled statement behaves as a C++ statement that has the special ability to choose on the fly whether or not the computation rooted at the body of the statement spawns parallel threads. To support such automatic granularity control PASL uses a prediction algorithm to map the asymptotic work cost (as returned by the complexity function) to actual processor cycles. When the predicted processor cycles of a particular instance of the controlled statement falls below a threshold (determined automatically for the specific machine), then that instance is sequentialized, by turning off the ability to spawn parallel threads for the execution of that instance. If the predicted processor cycle count is higher than the threshold, then the statement instance is executed in parallel.

In other words, the reader can think of a controlled statement as a statement that executes in parallel when the benefits of parallel execution far outweigh its cost and that executes sequentially in a way similar to the sequential elision of the body of the controlled statement would if the cost of parallelism exceeds its benefits. We note that while the sequential exection is similar to a sequential elision, it is not exactly the same, because every call to fork2 must check whether it should create parallel threads or run sequentially. Thus the execution may differ from the sequential elision in terms of performance but not in terms of behavior or semantics.

Example 24. Array-increment function with automatic granularity control

The code below uses a controlled statement to automatically select, at run time, the threshold size for our parallel array-increment function.

controller_type map_incr_rec_contr("map_incr_rec");

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
  long n = hi - lo;
  cstmt(map_incr_rec_contr, [&] { return n; }, [&] {
    if (n == 0) {
      // do nothing
    } else if (n == 1) {
      dest[lo] = source[lo] + 1;
    } else {
      long mid = (lo + hi) / 2;
      fork2([&] {
        map_incr_rec(source, dest, lo, mid);
      }, [&] {
        map_incr_rec(source, dest, mid, hi);
      });
    }
  });
}

The controlled statement takes three arguments, whose requirements are specified below, and returns nothing (i.e., void). The effectiveness of the granularity controller may be compromised if any of the requirements are not met.

  • The first argument is a reference to the controller object. The controller object is used by the controlled statement to collect profiling data from the program as the program runs. Every controller object is initialized with a string label (in the code above "map_incr_rec"). The label must be unique to the particular controller. Moreover, the controller must be declared as a global variable.

  • The second argument is the complexity function. The type of the return value should be long.

  • The third argument is the body of the controlled statement. The return type of the controlled statement should be void.

When the controlled statement chooses sequential evaluation for its body the effect is similar to the effect where in the code above the input size falls below the threshold size: the body and the recursion tree rooted there is sequentialized. When the controlled statement chooses parallel evaluation, the calls to fork2() create parallel threads.

12.2.1. Granularity control with alternative sequential bodies

It is not unusual for a divide-and-conquer algorithm to switch to a different algorithm at the leaves of its recursion tree. For example, sorting algorithms, such as quicksort, may switch to insertion sort at small problem sizes. In the same way, it is not unusual for parallel algorithms to switch to different sequential algorithms for handling small problem sizes. Such switching can be beneficial especially when the parallel algorithm is not asymptotically work efficient.

To provide such algorithmic switching, PASL provides an alternative form of controlled statement that accepts a fourth argument: the alternative sequential body. This alternative form of controlled statement behaves essentially the same way as the original described above, with the exception that when PASL run time decides to sequentialize a particular instance of the controlled statement, it falls through to the provided alternative sequential body instead of the "sequential elision."

Example 25. Array-increment function with automatic granularity control and sequential body
controller_type map_incr_rec_contr("map_incr_rec");

void map_incr_rec(const long* source, long* dest, long lo, long hi) {
  long n = hi - lo;
  cstmt(map_incr_rec_contr, [&] { return n; }, [&] {
    if (n == 0) {
      // do nothing
    } else if (n == 1) {
      dest[lo] = source[lo] + 1;
    } else {
      long mid = (lo + hi) / 2;
      fork2([&] {
        map_incr_rec(source, dest, lo, mid);
      }, [&] {
        map_incr_rec(source, dest, mid, hi);
      });
    }
  }, [&] {
    for (long i = lo; i < hi; i++)
      dest[i] = source[i] + 1;
  });
}

Even though the parallel and sequential array-increment algorithms are algorithmically identical, except for the calls to fork2(), there is still an advantage to using the alternative sequential body: the sequential code does not pay for the parallelism overheads due to fork2(). Even when eliding fork2(), the run-time-system has to perform a conditional branch to check whether or not the context of the fork2() call is parallel or sequential. Because the cost of these conditional branches adds up, the version with the sequential body is going to be more work efficient. Another reason for why a sequential body may be more efficient is that it can be written more simply, as for example using a for-loop instead of recursion, which will be faster in practice.

Note
Recommended style for programming with controlled statements

In general, we recommend that the code of the parallel body be written so as to be completely self contained, at least in the sense that the parallel body code contains the logic that is necessary to handle recursion all the way down to the base cases. The code for map_incr_rec honors this style by the fact that the parallel body handles the cases where n is zero or one (base cases) or is greater than one (recursive case). Put differently, it should be the case that, if the parallelism-specific annotations (including the alternative sequential body) are erased, the resulting program is a correct program.

We recommend this style because such parallel codes can be debugged, verified, and tuned, in isolation, without relying on alternative sequential codes.

12.3. Controlled parallel-for loops

Let us add one more component to our granularity-control toolkit: the parallel-for from. By using this loop construct, we can avoid having to explicitly express recursion-trees over and over again. For example, the following function performs the same computation as the example function we defined in the first lecture. Only, this function is much more compact and readable. Moreover, this code takes advantage of our automatic granularity control, also by replacing the parallel-for with a serial-for.

loop_controller_type map_incr_contr("map_incr");

void map_incr(const long* source, long* dest, long n) {
  parallel_for(map_incr_contr, (long)0, n, [&] (long i) {
    dest[i] = source[i] + 1;
  });
}

Underneath, the parallel-for loop uses a divide-and-conquer routine whose structure is similar to the structure of the divide-and-conquer routine of our map_incr_rec. Because the parallel-for loop generates the log-height recursion tree, the map_incr routine just above has the same span as the map_incr routine that we defined earlier: $\log n$, where $n$ is the size of the input array.

Notice that the code above specifies no complexity function. The reason is that this particular instance of the parallel-for loop implicitly defines a complexity function. The implicit complexity function reports a linear-time cost for any given range of the iteration space of the loop. In other words, the implicit complexity function assumes that per iteration the body of the loop performs a constant amount of work. Of course, this assumption does not hold in general. If we want to specify explicitly the complexity function, we can use the form shown in the example below. The complexity function is passed to the parallel-for loop as the fourth argument. The complexity function takes as argument the range [lo, hi). In this case, the complexity is linear in the number of iterations. The function simply returns the number of iterations as the complexity.

loop_controller_type map_incr_contr("map_incr");

void map_incr(const long* source, long* dest, long n) {
  auto linear_complexity_fct = [] (long lo, long hi) {
    return hi-lo;
  };
  parallel_for(map_incr_contr, linear_complexity_fct, (long)0, n, [&] (long i) {
    dest[i] = source[i] + 1;
  });
}

The following code snippet shows a more interesting case for the complexity function. In this case, we are performing a multiplication of a dense matrix by a dense vector. The outer loop iterates over the rows of the matrix. The complexity function in this case gives to each of these row-wise iterations a cost in proportion to the number of scalars in each column.

loop_controller_type dmdvmult_contr("dmdvmult");

// mtx: nxn dense matrix, vec: length n dense vector
// dest: length n dense vector
void dmdvmult(double* mtx, double* vec, double* dest, long n) {
  auto compl_fct = [&] (long lo, long hi) {
    return (hi-lo)*n;
  };
  parallel_for(dmdvmult_contr, compl_fct, (long)0, n, [&] (long i) {
    ddotprod(mtx, v, dest, i);
  });
  return dest;
}
Example 26. Speedup for matrix multiply

Matrix multiplication has been widely used as an example for parallel computing since the early days of the field. There are good reasons for this. First, matrix multiplication is a key operation that can be used to solve many interesting problems. Second, it is an expansive computation that is nearly cubic in the size of the input---it can thus can become very expensive even with modest inputs.

Fortunately, matrix multiplication can be parallelized relatively easily as shown above. The figure below shows the speedup for a sample run of this code. Observe that the speedup is rather good, achieving nearly excellent utilization.

Speedup plot for matrix multiplication for latexmath:[$25000 	imes 25000$] matrices.
Figure 10. Speedup plot for matrix multiplication for $25000 \times 25000$ matrices.

While parallel matrix multiplication delivers excellent speedups, this is not common for many other algorithms on modern multicore machines where many computations can quickly become limited by the availability of bandwidth. Matrix multiplication does not suffer as much from the memory-bandwidth limitations because it performs significant work per memory operation: it touches $\Theta(n^2)$ memory