Parallel Computing: Theory and Practice

v1.0 2016-01

Author: Umut A. Acar

1. Chapter: Multithreading, Parallelism, and Concurrency

The term multithreading refers to computing with multiple threads of control where all threads share the same memory. Once created, a thread performs a computation by executing a sequence of instructions, as specified by the program, until it terminates. A multithreaded computation starts by executing a main thread, which is the thread at which the execution starts. A thread can create or spawn another thread and synchronize with other threads by using a variety of synchronization constructs such as locks, mutex’s, synchronization variables, and semaphores.

1.1. DAG Representation

A multithreaded computation can be represented by a dag, a Directed Acyclic Graph, or written also more simply a dag, of vertices. The figure below show an example multithreaded computation and its dag. Each vertex represents the execution of an instruction, such as an addition, a multiplication, a memory operation, a (thread) spawn operation, or a synchronization operation. A vertex representing a spawn operation has outdegree two. A synchronization operation waits for an operation belonging to a thread to complete, and thus a vertex representing a synchronization operation has indegree at least two. Recall that a dag represents a partial order. Thus the dag of the computation represents the partial ordering of the dependencies between the instructions in the computation. Perhaps the simplest multithreaded program is a sequential program, which can be represented as a chain of (totally ordered) vertices.

Note

Multithreaded computing can be viewed as a natural generalization of sequential computing in the following sense: in sequential computing, an computation can be defined as a totally ordered set of instructions, where the ordering corresponds to the sequential execution order. In contrast, in multithreaded computing, a computation can be viewed as a partially ordered set of instructions (as specified by the dag), where the instructions may be executed in any order compatible with the specified partial order.

Figure 1. A multithreaded computation.

Throughout this book, we make two assumptions about the structure of the dag:

Each vertex has outdegree at most two.
The dag has exactly one root vertex with indegree zero and one final vertex vertex with outdegree zero. The root is the first instruction of the root thread.

The outdegree assumption naturally follows by the fact that each vertex represents an instruction, which can create at most one thread.

1.2. Cost Model: Work and Span

For analyzing the efficiency and performance of multithreaded programs, we use several cost measures, the most important ones include work and span. We define the work of a computation as the number of vertices in the dag and the span as the length of the longest path in the dag. In the example dag above, work is $15$ and span is $9$.

1.3. Execution and Scheduling

The execution of a multithreaded computation executes the vertices in the dag of the computation in some partial order that is consistent with the partial order specified by the dag, that is, if vertices $u,v$ are ordered then the execution orders them in the same way.

Multithreaded programs are executed by using a scheduler that assigns vertices of the dag to processes.

Definition: Execution Schedule

Given a dag $G$, an (execution) schedule for $G$ is a function from processes and (time) steps to instructions of the dag such that

if a vertex $u$ is ordered before another vertex $v$ in $G$, then $u$ is executed before $u$, and
each vertex in $G$ is executed exactly once.

The length of a schedule is the number of steps in the schedule.

The first condition ensures that a schedule observes the dependencies in the dag. Specifically, for each arc $(u,v)$ in the dag, the vertex $u$ is executed before vertex $v$.

For any step in the execution, we call a vertex ready if all the ancestors of the vertex in the dag are executed prior to that step. Similarly, we say that a thread is ready if it contains a ready vertex. Note that a thread can contain only one ready vertex at any time.

Example 1. Schedule

An example schedule with 3 processes. The length of this schedule is $10$

Time Step	Process 1	Process 2	Process 3
1	M1
2	M2
3	M3	A1
4		A2
5	B1	A3
6	B2	A4
7	B2		M4
8		A5	M5
9	A6
10		M6

Fact: Scheduling Invariant

Consider a computation dag $G$ and consider an execution using any scheduling algorithm. At any time during the execution, color the vertices that are executed as blue and the others as red.

The blue vertices induce a blue sub-dag of $G$ that is connected and that has the same root as $G$.
The red vertices incude a red sub-dag of $G$ that is connected.
All the vertices of G are in the blue or the red sub-dag. In other words, the blue and red vertices partitions the dag into two sub-dags.

1.4. Scheduling Lower Bounds

Theorem: Lower bounds

Consider any multithreaded computation with work $W$ and span $S$ and $P$ processes. The following lower bounds hold.

Every execution schedule has length at least $\frac{W}{P}$.
Every execution schedule has length at least $S$.

The first lower bound follows by the simple observation that a schedule can only execute $P$ instructions at a time. Since all vertices must be executed, a schedule has length at least $\frac{W}{P}$. The second lower bound follows by the observation that the schedule cannot execute a vertex before its ancestors and thus the length of the schedule is at least as long as any path in the dag, which can be as large as the span $S$.

1.5. Offline Scheduling

Having established a lower bound, we now move on to establish an upper bound for the offline scheduling problem, where we are given a dag and wish to find an execution schedule that minimizes the run time. It is known that the related decision problem in NP-complete but that 2-approximation is relatively easy. We shall consider two distinct schedulers: level-by-level scheduler and greedy scheduler.

A level-by-level schedule is a schedule that executes the instructions in a given dag level order, where the level of a vertex is the longest distance from the root of the dag to the vertex. More specifically, the vertices in level 0 are executed first, followed by the vertices in level 1 and so on.

Theorem:[Offline level-by-level schedule]

For a dag with work $W$ and span $S$, the length of a level-by-level schedule is $W/P + S$.

Proof

Let $W_i$ denote the work of the instructions at level $i$. These instructions can be executed in $\lceil \frac{W_i}{P} \rceil$ steps. Thus the total time is

\[ \sum_{i=1}^{S}{\lceil \frac{W_i}{P} \rceil} \le \sum_{i=1}^{S}{\lfloor \frac{W_i}{P} \rfloor + 1} \le \lfloor \frac{W}{P} \rfloor + S \]

This theorem called Brent’s theorem was proved by Brent in 1974. It shows that the lower bound can be approximated within a factor of $2$.

Brent’s theorem has later been generalized to all greedy schedules. A greedy schedule is a schedule that never leaves a process idle unless there are no ready vertices. In other words, greedy schedules keep processes as busy as possibly by greedily assigning ready vertices.

Theorem: Offline Greedy Schedule

Consider a dag $G$ with work $W$ and span $S$. Any greedy $P$-process schedule has length at most $\frac{W}{P} + S \cdot \frac{P-1}{P}$.

Proof

Consider any greedy schedule with length $T$.

For each step $1 \le i \le T$, and for each process that is scheduled at that step, collect a token. The token goes to the work bucket if the process executes a vertex in that step, otherwise the process is idle and the token goes to an idle bucket.

Since each token in the work bucket corresponds to an executed vertex, there are exactly $W$ tokens in that bucket.

We will now bound the tokens in the idle bucket by $S \cdot (P-1)$. Observe that at any step in the execution schedule, there is a ready vertex to be executed (because otherwise the execution is complete). This means that at each step, at most $P-1$ processes can contribute to the idle bucket. Furthermore at each step where there is at least one idle process, we know that the number of ready vertices is less than the number of available processes. Note now that at that step, all the ready vertices have no incoming edges in the red sub-dag consisting of the vertices that are not yet executed, and all the vertices that have no incoming edges in the red sub-dag are ready. Thus executing all the ready vertices at the step reduces the length of all the paths that originate at these vertices and end at the final vertex by one. This means that the span of the red sub-dag is reduced by one because all paths with length equal to span must originate in a ready vertex. Since the red-subdag is initially equal to the dag $G$, its span is $S$, and thus there are at most $S$ steps at which a process is idle. As a result the total number of tokens in the idle bucket is at most $S \cdot (P-1)$].

Since we collect $P$ tokens in each step, the bound thus follows.

Exercise

Show that the bounds for Brent’s level-by-level scheduler and for any greedy scheduler is within a factor $2$ of optimal.

1.6. Online Scheduling

In offline scheduling, we are given a dag and are interested in finding a schedule with minimal length. When executing multithreaded program, however, we don’t have full knowledge of the dag. Instead, the dag unfolds as we run the program. Furthermore, we are interested in not minimizing the length of the schedule but also the work and time it takes to compute the schedule. These two additional conditions define the online scheduling problem.

An online scheduler or a simply a scheduler is an algorithm that solves the online scheduling problem by mapping threads to available processes. For example, if only one processor is available, a scheduler can map all threads to that one processor. If two processors are available, then the scheduler can divide the threads between the two processors as evenly as possible in an attempt to keep the two processors as busy as possible by load balancing, which involves migrating pieces of work between processors so as to minimize idle time.

There are many different online-scheduling algorithms but these algorithms all operate similarly. We can outline a typical scheduling algorithm as follows.

Typical Online Scheduling Algorithm

The algorithm maintains a work pool, consisting of ready threads, and executes them. Execution starts with the root thread in the pool. It ends when the final vertex is executed. The algorithm executes a thread without pre-emption, i.e., until the thread terminates or blocks to synchronize with other threads.

To obtain work, a process removes a thread from the pool and executes its ready vertex. We refer to the thread executed by a process as the assigned thread. When executed, the ready vertex can make the next vertex of the thread ready, which then also gets executed an so on until one of the following synchronization actions occur.

Die: The process executes last vertex of the thread, causing the thread to die. The process then obtains other work.
Block: The assigned vertex executes but the next vertex does not become ready. This blocks the thread and thus the process obtains other work.
Enable: The assigned vertex makes ready the continuation of the vertex and unblocks another previously blocked thread by making a vertex from that thread ready. In this case, the process inserts both (any) one thread into the work pool and continues to execute the other.
Spawn: The assigned vertex spaws another thread. As in the previous case, the process inserts one thread into the work pool and continues to execute the other.

These actions are not mutually exclusive. For example, a thread may spawn/enable a thread and die. In this case, the process performs the corresponding steps for each action.

Exercise: Scheduling Invariant

Convince yourself that the scheduling invariant holds in online scheduling.

For a given schedule generated by an online scheduling algorithm, we can define a tree of vertices, which tell us far a vertex, the vertex that enabled it.

Definition: Enabling Tree

Consider the execution of a dag. If the execution of a vertex $u$ enables another vertex $v$, then we call the edge $(u,v)$ an enabling edge and we call $u$ the enabling parent of $v$. For simplicity, we simply use the term parent instead of enabling parent.

Note that any vertex other than the root vertex has one enabling parent. Thus the subgraph induced by the enabling edges is a rooted tree that we call the enabling tree.

Example 2. Scheduler with a global thread queue.

We can give a simple greedy scheduler by using a queue of threads. At the start of the execution, the scheduler places the root thread into the queue and then repeats the following step until the queue becomes empty: for each idle process, take the thread at the front of the queue and assign it to the processor, let each processor run for one step, if at the end of the step, there are new ready threads, then insert them onto the tail of the queue.

The centralized scheduler with the global thread queue is a greedy scheduler that generates a greedy schedule under the assumption that the queue operations take zero time and that the dag is given. This algorithm, however, does not work well for online scheduling the operations on the queue take time. In fact, since the thread queue is global, the algorithm can only insert and remove one thread at a time. For this reason, centralized schedulers do not scale beyond a handful of processors.

Definition: Scheduling friction.

No matter how efficient a scheduler is there is real cost to creating threads, inserting and deleting them from queues, and to performing load balancing. We refer to these costs cumulatively as scheduling friction, or simply as friction.

There has been much research on the problem of reducing friction in scheduling. This research shows that distrubuted scheduling algorithms can work quite well. In a distributed algorithm, each processor has its own queue and primarily operates on its own queue. A load-balancing technique is then used to balance the load among the existing processors by redistributing threads, usually on a needs basis. This strategy ensures that processors can operate in parallel to obtain work from their queues.

A specific kind of distributed scheduling technique that can leads to schedules that are close to optimal is work stealing schedulers. In a work-stealing scheduler, processors work on their own queues as long as their is work in them, and if not, go "steal" work from other processors by removing the thread at the tail end of the queue. It has been proven that randomized work-stealing algorithm, where idle processors randomly select processors to steal from, deliver close to optimal schedules in expectation (in fact with high probability) and furthermore incur minimal friction. Randomized schedulers can also be implemented efficiently in practice. PASL uses an scheduling algorithm that is based on work stealing. We consider work-stealing in greater detail in a future chapter.

1.7. Writing Multithreaded Programs

Multithreaded programs can be written using a variety of language abstractions interfaces. In this section, we briefly outline one of the most widely used interfaces, the POSIX Threads or Pthreads interface, which specifies a programming interface for a standardized C language in the IEEE POSIX 1003.1c standard. Pthreads provide a rich interface that enable the programmer to create multiple threads of control that can synchronize by using the nearly the whole range of the synchronization facilities mentioned above. There are many other threading libraries, all of which provide similar facilities.

Hello world with Pthreads

An example Pthread program is shown below. The main thread (executing function main) creates 8 child threads and terminates. Each child in turn runs the function helloWorld and immediately terminates. Since the main thread does not wait for the children to terminate, it may terminate before the children does, depending on how threads are scheduled on the available processors.

#include <iostream>
#include <cstdlib>
#include <pthread.h>

using namespace std;

#define NTHREADS 8

void *helloWorld(void *threadid)
{
   long tid;
   tid = (long)threadid;
   cout << "Hello world! It is me, 00" << tid << endl;
   pthread_exit(NULL);
}

int main ()
{
   pthread_t threads[NTHREADS];
   int rc;
   int i;
   for( i=0; i < NTHREADS; i++ ){
      cout << "main: creating thread 00" << i << endl;
      error = pthread_create(&threads[i], NULL, helloWorld, (void *) i);
      if (error) {
         cout << "Error: unable to create thread," << error << endl;
         exit(-1);
      }
   }
   pthread_exit(NULL);
}

When executed this program may print the following.

main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 002
Hello world! It is me, 003
Hello world! It is me, 004
Hello world! It is me, 005
Hello world! It is me, 006
Hello world! It is me, 007

But that would be unlikely, a more likely output would look like this:

main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 006
Hello world! It is me, 003
Hello world! It is me, 002
Hello world! It is me, 005
Hello world! It is me, 004
Hello world! It is me, 007

Or may look like this

main: creating thread 000
main: creating thread 001
main: creating thread 002
main: creating thread 003
Hello world! It is me, 000
Hello world! It is me, 001
Hello world! It is me, 003
Hello world! It is me, 002
main: creating thread 004
main: creating thread 005
main: creating thread 006
main: creating thread 007
Hello world! It is me, 006
Hello world! It is me, 005
Hello world! It is me, 004
Hello world! It is me, 007

The pThreads library provides a rich interface for synchronizing between threads, e.g.,

a thread $t_1$ can join with another thread $t_2$ by blocking its execution until $t_2$ terminates,
threads can synchronize via mutex variables, e.g., a thread can lock a mutex , which if already locked, causes the thread to block,
threads can synchronize via condition variables, which are closely related to locks.

2. Critical Sections and Mutual Exclusion

In a multithreaded program, a critical section is a part of the program that may not be executed by more than one thread at the same time. Critical sections typically contain code that alters shared objects, such as shared (e.g., global) variables. This means that the a critical section requires mutual exclusion: only one thread can be inside the critical section at any time.

Since only one thread can be inside a critical section at a time, threads must coordinate to make sure that they don’t enter the critical section at the same time. If threads do not coordinate and multiple threads enter the critical section at the same time, we say that a race condition occurs, because the outcome of the program depends on the relative timing of the threads, and thus can vary from one execution to another. Race conditions are sometimes benign but usually not so, because they can lead to incorrect behavior. Spectacular examples of race conditions' effects include the "Northeast Blackout" of 2003, which affected 45 million people in the US and 10 million people in Canada.

It can be extremely difficult to find a race condition, because of the non-determinacy of execution. A race condition may lead to an incorrect behavior only a tiny fraction of the time, making it extremely difficult to observe and reproduce it. For example, the software fault that lead to the Northeast blackout took software engineers "weeks of poring through millions of lines of code and data to find it" according to one of the companies involved.

The problem of designing algorithms or protocols for ensuring mutual exclusion is called the mutual exclusion problem or the critical section problem. There are many ways of solving instances of the mutual exclusion problem. But broadly, we can distinguish two categories: spin-locks and blocking-locks. The idea in spin locks is to busy wait until the critical section is clear of other threads. Solutions based on blocking locks is similar except that instead of waiting, threads simply blocks or stops executing. When the critical section is clear, a blocked thread receives a signal that allows it to proceed. The term mutex, short for "mutual exclusion" is sometimes used to refer to a lock.

Mutual exclusions problems have been studied extensively in the context of several areas of computer science.

In operating systems research, processes, which like threads are independent threads of control, belonging usually but not always to different programs, can share certain systems' resources. To enable such sharing safely and efficiently, researchers have proposed various forms of locks such as semaphores, which accepts both a busy-waiting and blocking semantics. Another class of locks, called condition variables enable blocking synchronization by conditioning or the value of a variable.
In the area of concurrency, researchers have designed and implemented special concurrent data structures that ensure mutual exclusion without requiring locks or blocking by using special synchronization operations, called read-modify-write operations, provided by the hardware.

In these notes, we will mostly focus on the second approach.

2.1. Synchronization Hardware

Since mutual exclusion is a common problem in computer science, many hardware systems provide specific synchronization operations that can help solve instances of the problem. These operations may allow, for example, testing the contents of a (machine) word then modifying it, perhaps by swapping it with another word. Such operations are sometimes called atomic read-modify-write or RMW, for short, operations.

A handful of different RMW operations have been proposed. They include operations such as load-link/store-conditional, fetch-and-add, and compare-and-swap. They typically take the memory location x, and a value v and replace the value of stored at x with f(x,v). For example, the fetch-and-add operation takes the location x and the increment-amount, and atomically increments the value at that location by the specified amount, i.e., f(x,v) = *x + v.

The compare-and-swap operation takes the location x and takes a pair of values (a,b), and stores b into x if the value in x is a, i.e., f(x,(a,b)) = if *x = a then b else a; the operation returns a Boolean indicating whether the operation successfully stored a new value in x. The operation "compare-and-swap" is a reasonably powerful synchronization operation: it can be used by arbitrarily many threads to agree (reach consensus) on a value. This instruction therefore is frequently provided by modern parallel architectures such as Intel’s X86.

In C$++$, the atomic class can be used to perform synchronization. Objects of this type are guarantee to be free of race conditions; and in fact, in C++, they are the only objects that are guaranteed to be free from race conditions. The contents of an atomic object can be accessed by load operations, updated by store operation, and also updated by compare_exchange_weak and compare_exchange_strong operations, the latter of which implement the compare-and-swap operation.

Example 3. Accessing the contents of atomic memory cells

Access to the contents of any given cell is achieved by the load() and store() methods.

std::atomic<bool> flag;

flag.store(false);
std::cout << flag.load() << std::endl;
flag.store(true);
std::cout << flag.load() << std::endl;

Output:

0
1

The key operation that help with race conditions is the compare-and-exchange operation.

Definition: compare and exchange

When executed with a ‘target` atomic object and an expected cell and a new value `new’ this operation performs the following steps, atomically:

Read the contents of target.
If the contents equals the contents of expected, then writes new into the target and returns true.
Otherwise, returns false after updating the contents of the cell expected with the contents of the target (exchange).

Example 4. Reading and writing atomic objects

std::atomic<bool> flag;

flag.store(false);
bool expected = false;
bool was_successful = flag.compare_exchange_strong(expected, true);
std::cout << "was_successful = " << was_successful
                                << "; flag = " << flag.load()
                                        << "; expected = " << expected
                                        << std::endl;

bool expected2 = false;
bool was_successful2 = flag.compare_exchange_strong(expected2, true);
std::cout << "was_successful2 = " << was_successful2
                                        << "; flag = " <<       flag.load()
                                        << "; expected = " << expected2
                                        << std::endl;

Output:

was_successful = 1; flag = 1; expected = 0
was_successful2 = 0; flag = 1; expected2 = 1

Example 5. Spin Locks with Compare-And-Exchange

We can implement a spin lock using compare-and-exchange. Spin locks can ensure exclusive access to a shared resource such as a memory cell or a file. For example, suppose that multiple threads wants to print to the Standard IO. We could write such a program as follows.

std::atomic<bool> BigLock;

struct args {
  int tid;
};

void *hello(void *a)
{
  args* myargs = (args*) a;
  int tid = myargs->tid;

  cout << "Hello world! It is me, Thread " << tid << endl;

  pthread_exit(NULL);
}

int main ()
{
  pthread_t threads[NTHREADS];

  for(int i=0; i < NTHREADS; i++ ){
    args* a = new args;
    a->tid = i;

    cout << "main: creating thread 00" << i << endl;

    int error = pthread_create(&threads[i], NULL, hello, (void *) a);
    if (error) {
      cout << "Error: unable to create thread," << error << endl;
      exit(-1);
    }
  }
  pthread_exit(NULL);
}

Since the Standard IO is a shared resource, the multiple threads can write to it at the same time, resulting in garbled output. The example run below uses just two processors.

bash-3.2$ g++ -std=c++11  biglock-driver.cpp
bash-3.2$ ./a.out
main: creating thread 000
main: creating thread 001
main: creating thread 002
Hello world! It imsHHa eeimllnell:,oo    cTwwrhooerrraelltaddid!!n   g0II
ttt  hiirsse  ammdee ,,0  0TT3hh
rreeaadmdH a e1i2l
n
l:o  cwroeraltdi!n gI tt hirse amde ,0 0T4h
read 3
Hmealilno:  wcorreladt!i nIgt  tihsr emaed,  0T0h5r
ead 4
Hello world! It is me, Thread 5
main: creating thread 006
main: creating thread 007
Hello world! It is me, Thread 6
Hello world! It is me, Thread 7
bash-3.2$

To ensure that each thread gets exclusive access to the standard IO, we can use a spin lock, where each thread waits for other threads to exit the critical section, where they access the Standard IO. To wait, a thread simply "spins" while checking the lock. Before entering its critical section, each thread takes the lock and upon exiting it releases the lock.

std::atomic<bool> BigLock;


/* Take BigLock */
void takeBigLock () {
  while (1) {
    bool flag = false;
    if (BigLock.compare_exchange_strong(flag,true)) {
      return;
                }
        }
}

/* ReleaseBig BigLock */
void releaseBigLock () {
  while (1) {
    bool flag = true;
    if (BigLock.compare_exchange_strong(flag,false)) {
      return;
                }
        }
}

void *hello(void *a)
{
  args* myargs = (args*) a;
        int tid = myargs->tid;

  takeBigLock ();
  cout << "Hello world! It is me, Thread " << tid << endl;
  releaseBigLock ();

  pthread_exit(NULL);
}

int main ()
{
  pthread_t threads[NTHREADS];
  BigLock.store(false);

  for(int i=0; i < NTHREADS; i++ ){
    args* a = new args;
    a->tid = i;

    takeBigLock ();
    cout << "main: creating thread 00" << i << endl;
    releaseBigLock ();
    int error = pthread_create(&threads[i], NULL, hello, (void *) a);
    if (error) {
      cout << "Error: unable to create thread," << error << endl;
      exit(-1);
    }
  }
  pthread_exit(NULL);
}

Here is an example run of the program.


bash-3.2$ g++ -std=c++11  biglock-driver.cpp
bash-3.2$ ./a.out
main: creating thread 000
main: creating thread 001
main: creating thread 002
Hello world! It is me, Thread 0
main: creating thread 003
Hello world! It is me, Thread 1
main: creating thread 004
Hello world! It is me, Thread 3
Hello world! It is me, Thread 2
main: creating thread 005
Hello world! It is me, Thread 4
main: creating thread 006
Hello world! It is me, Thread 5
main: creating thread 007
Hello world! It is me, Thread 6
Hello world! It is me, Thread 7

Note that the threads do not necessarily run in order even on to cores but their messages are printed correctly, without being garbled.

2.2. Nonblocking Concurrent Data Structures

In some cases, we have multiple threads operating on a shared data structure such that the updates of one thread become visible to others. We refer to such data structures as concurred data structures. For example, we may have multiple threads using a stack or a queue data structure to send and receive items between each other.

The crux of the problem of designing a concurrent data structure is ensuring correctness without penalizing performance. Indeed, if we don’t care about performance, it is trivial to implement a concurrent data structure by using a lock for the data structure such that any thread that wants to use the data structure takes a lock for it before using it, using it, and releases the lock afterwards. In some cases, this is all we can do, e.g., the accesses to the Standard IO in our prior example cannot be handled differently without using a different, perhaps a more "finer grain" interface for IO, But in many cases, there is a lot that can be done to ensure good performance.

One the most commonly used techniques is to use non-blocking atomic read-modify-write operations such as compare-and-exchange to implement concurrent data structures. In this approach, since threads never block when using locks, a thread cannot prevent another thread from making progress by taking a lock and blocking on the lock, hence the name "non-blocking". A non-blocking data structures is called lock free if system-wide progress can be guaranteed, even as threads are suspended. A non-blocking data structure is called wait free if per-thread progress can also be guaranteed.

2.2.1. A non-blocking stack data structure

Suppose that we wish to implement a concurrent stack data structure that can be shared my multiple threads. The code below shows the code for a standard serial "integer" stack that can hold integer values in its nodes.

#include <iostream>
#include <cstdlib>

using namespace std;

class Node {
public:
  int value;
  Node* next;

  Node (int v) {
    value = v;
    next = NULL;
  }
};

class Stack {
public:
  Node* top;

  Stack () {
    top = NULL;
  };

  int pop ();
  void push (int v);
};

int Stack::pop () {
  if (top == NULL) {
    cout << "Stack is Empty" << endl;
    return -12;
  }
  else {
    int oldTop = top->value;
    top = top->next;
    return oldTop;
  }
}

void Stack::push (int value) {
  top = new Node (value,top);
  return ;
}

Allowing such a data structure by multiple threads leads to race conditions and thus accesses must be protected by enforcing mutual exclusion. An example where multiple threads are operating on a shared stack is shown below.

#include <iostream>
#include <cstdlib>
#include <pthread.h>
#include <atomic>
#include "nb-stack.h"

using namespace std;

#define NTHREADS 8
#define NPUSHPOP 2

struct args {
  int tid;
  Stack* s;
};


void *pushPop(void *a)
{
  args* myargs = (args*) a;
  int tid = myargs->tid;
  Stack* sharedStack = myargs->s;

  cout << "Hello world! It is me, 00" << tid << endl;

  for (int i = 0; i < NPUSHPOP; ++i) {
    int j = NPUSHPOP * tid + i;
    sharedStack->push (NPUSHPOP * tid + i);
    sleep (1);
    int k = sharedStack->pop ();
    sleep (1);
  }

  pthread_exit(NULL);
}

int main ()
{
  pthread_t threads[NTHREADS];
  Stack* sharedStack = new Stack ();

  int rc;
  for(int i=0; i < NTHREADS; i++ ){
    args* a = new args;
    a->tid = i;
    a->s = sharedStack;

    int error = pthread_create(&threads[i], NULL, pushPop, (void *) a);
    if (error) {
      cout << "Error: unable to create thread," << error << endl;
      exit(-1);
    }
  }
  pthread_exit(NULL);
}

The easiest way to ensure mutual exclusion is to use a lock to guard accesses to the shared data structure, as in our Standard IO example discussed earlier. Such a lock, however, serializes all accesses to the data structure, turning the data structure into a bottleneck. Instead, we would like to allow the threads to operate concurrently but without breaking the correctness of the stack.

We can use the compare-and-exchange operation that we saw earlier to achieve this. The idea is to identify the parts of the code that require mutually exclusive access to certain data and make sure that such data are updated atomically. The code below show an implementation of stacks that attempts to guarantee mutual exclusion. In function push, we first set up the new node, u, that we want to add by using the current value of top. We then update top, only if it has not changed since we have read it by using compare_exchange_strong. Similarly, in function pop, we read the current value of top into a variable oldTop and then create the new value for top, next. We then replace top with next by using a compare_exchange_strong only if top has not changed since we last read it.

#include <iostream>
#include <cstdlib>
#include <atomic>

using namespace std;

class Node {
public:
  int value;
  Node* next;

  Node (int v) {
    value = v;
    next = NULL;
  }

};

class Stack {
public:
  std::atomic<Node*> top;

  Stack () {
    top = NULL;
  };

  int pop ();
  void push (int v);
};

int Stack::pop () {
  while (1) {

    /* take picture */
    Node* oldTop = top.load();

    if (oldTop == NULL) {
      cout << "Stack is Empty" << endl;
      return -12;
    }

    int oldTopValue = oldTop->value;

    /* local change */
    Node* next = oldTop->next;

    /* compare and commit */
    if (top.compare_exchange_strong(oldTop,next)) {
      return oldTopValue;
    }
  }
}

void Stack::push(const int value)
{
  Node *u = new Node(value);
  while (1) {
    u->next = top.load();
    if (top.compare_exchange_strong(u->next, u)) {
      break;
    }
  }
}

The example above illustrates a typical use of the compare-and-swap operation. Can we prove our code to be correct. It might appear so but actually this not the case. This data structure is actually not correct due to a problem called the "ABA problem."

2.2.2. ABA problem

While reasonably powerful, compare-and-swap operation suffers from the so-called ABA problem. To see this consider the following scenario where a shared variable top is updated by multiple threads in parallel.

Thread $T_1$, reads the top and stores its current value, say node M, in oldTop. It then continues to compute next as the next node in the stack, lets say N. Before $T_1$ can complete, however, another thread steps in.
Immediately after this, another thread $T_2$ reads top and performs a pop operation to remove node M and pushes a new node O onto the stack. At this point, the stack looks like O::N with the top of the stack pointing to node O, which points to node N. At this point, thread $T_2$ pushes M back onto the stack, the stack now looks like M::0::N.
Thread $T_1$ now steps in to continue its pop operation by attempting to rum compare_exchange_strong. The operation succeeds because node M is the at the top of the stack, which matches oldTop of $T_1$. The pop operation thus completes setting top to next, which is N. This is, however, incorrect because the node O has disappeared from the stack.

In summary, compare-and-swap was not able to detect the change in the stack structure, because it only relies on a simple "shallow" notion of equality. The fact that top and oldTop has the same node as their values does not mean that the stack has not been modified!

This problem is called the ABA problem, because it involves cycling the atomic memory cell between the three values $A$, $B$, and again $A$). The ABA problem is an important limitation of compare-and-swap. It shows that the operation itself is not atomic in a true sense of the world but tries to imitate atomicity by relying on a shallow equality test. If it can be ensured that the shallow equality test of the subject memory cell suffices for correctness, the imitation can be useful, if not, then the operation cannot be considered atomic.

The ABA problem can lead to seemingly correct implementations that are in fact incorrect. To reduce the changes of bugs due to the ABA problem, memory objects subject to compare-and-swap are usually tagged with an additional field that counts the number of updates. This solves the basic problem but only up to a point because the counter itself can also wrap around. The load-link/store-conditional, a.k.a., LL/SC, operation solves this problem in principle by performing the write only if the memory location has not been updated since the last read (load). Unfortunately, practical implementations of LL/SC that actually meet its specifications are hard to come by.

Here is a refinement of our non-blocking stack code with the ABA problem. The main difference between these two pieces of code is that the refined version keeps a counter with the top of the stack and counts the number of times that the stack is popped. This makes it possible to determine whether the stack has been modified since it has been read by a thread. The approach, however does not fully solve the ABA problem because the counter may overflow and wrap-up. This is however, considered to be highly unlikely and this solution can be seen used quite frequently in practice.

#include <iostream>
#include <cstdlib>
#include <atomic>

using namespace std;

class Node {
public:

  int value;
  Node* next;

  Node (int v) {
    value = v;
    next = NULL;
  }

  Node (int v, Node* u) {
    value = v;
    next = u;
  }
};

class Stack {
public:
  struct state_t {
    Node* top;
    int64_t nPops;
  };

  std::atomic<state_t> state;

  Stack () {
    state_t s;
    s.top = NULL;
    s.nPops = 0;
    state.store (s);
  };

  int pop ();
  void push (int v);
};


int Stack::pop () {
  state_t oldState = state.load();


    // ALTERNATE: This should also work
    // oldState = state.load ();
    // because oldState is updated by compare_and_exchange_strong

  while (1) {

    // This won't be needed in the ALTERNATE
    oldState = state.load ();

    Node* oldTop = oldState.top;

    if (oldTop == NULL) {
      cout << "Stack is Empty" << endl;
      return -12;
    }

    int oldTopValue = oldTop->value;
    Node* next = oldTop->next;

    /* new state */
    state_t newState;
    newState.top = next;
    newState.nPops = oldState.nPops+1;

    if (state.compare_exchange_strong(oldState,newState)) {
      return oldTopValue;
    }
  }
}

void Stack::push(const int value)
{
  Node *u = new Node(value);

  while (1) {
    state_t oldState = state.load ();
    u->next = oldState.top;

    /* new state */
    state_t newState;
    newState.top = u;
    newState.nPops = oldState.nPops;
    if (state.compare_exchange_strong(oldState, newState)) {
      break;
    }
  }
}

2.3. Memory Management

Another difficulty in concurrent data structures, and more generally, in multithreaded programs is ensuring correct usage of dynamically allocated memory. The challenge is to make sure that there are no "dangling references" to objects that are manually freed. This can be difficult, because in a multithreaded program there can be many different points to an object, making it difficult to find the "safe points" at which an object may be freed.

As a concrute example, imagine a shared stack data structure and suppose that a thread is in the middle of a pop operation and has read the top node M but has not completed it yet. Now, some other threads comes and pops the node M, and returns the value after freeing the node M. At this poin, there is a dangling reference to freed node M, because the first thread is still holding the value in its variable oldTop.

3. Chapter: Structured Multithreading

Multithreading interfaces such as Pthreads enable the programmer to create a wide variety of multithreaded computations that can be structured in many different ways. Multithreaded programs, however, can be extremely difficult to reason about, get right (correct), and make efficient, because they could require reasoning about exponentially many interleavings of code.

Consider as a toy multithreaded program consisting of 16 threads, each of which runs a 100 line program. To reason about the correctness of a program, we might have to reason about at least $16^100$ different interleavings. This is a huge number, larger than the number of atoms in the known universe. All evidence suggests that human beings are not able to reason about such a large number of possibilities. Perhaps for this reason, multithreaded programs are known to be notoriously difficult to get right. Common problem include

reasoning about the correctness shared state and absence of race conditions,
reasoning about efficiency of concurrent data structures, and
reasoning about correctness of memory management in non-GC’ed programming languages.

For example, over the past two decades researchers have developed many concurrent data structures, ranging from simple counters to more complex structures such as concurrent queues and stacks. Such data structures have proved to be very difficult to establish correct; in fact many were found to be incorrect after publication. The asymptotic efficiency of realistic concurrent data structures have also turned out to be difficult to establish, with relatively few tight bounds.

Fortunately, large classes of interesting multithreaded computations, can be written using a more structured approach, where threads are restricted in the way that they synchronize with other threads. One such interesting class of computations is fork-join computations where a thread can spawn or "fork" another thread or "join" with another existing thread. Joining a thread is the only mechanism through which threads synchronize. The figure below illustrates a fork-join computation. The main threads forks thread A, which then spawns thread B. Thread B then joins thread A, which then joins Thread M.

Figure 2. A multithreaded fork-join computation.

In addition to fork-join, there are other interfaces for structured multithreading such as async-finish, and futures. As in fork-join, these approaches provide the programmer with a simple interface for expressing concurrency by restricting the manner in which threads can synchronize.

One way provide language support for structured multithreading is to simply use a more general threading library such as pThreads. Although it is correct, this approach would be grossly inefficient because structured multithreading can be implemented more efficiently. Indeed, it has been adopted and implemented specifically in many programming languages: the Cilk language is primarily based on fork-join but also has some limited support for async-finish; X10 language is primarily based on async-finish but also supports futures; Fork Join Java and Habanero Java extend the Java language with structured multithreading; and Haskell language provides support for fork-join and futures as well as others; Parallel ML language as implemented by the Manticore project and by the Steel project at CMU is primarily based on fork-join parallelism. Such languages are sometimes called implicitly parallel languages.

The class of computations that can be expressed as fork-join and async-finish programs are sometimes called nested parallel. The term "nested" refers to the fact that a parallel computation can be nested within another parallel computation. This is as opposed to flat parallelism where a parallel computation can only perform sequential computations in parallel. Flat parallelism used to be common technique in the past but becoming increasingly less prominent.

The class of computations that can be expressed with futures is sometimes called pipelined parallelism. In this course, we will not discuss futures further.

3.1. Parallelism versus concurrency

Structured multithreading offers important benefits both in terms of efficiency and expressiveness. Using programming constructs such as fork-join and futures, it is possible to write parallel programs such that the program accepts a "sequential semantics" but executes in parallel. The sequential semantics enables the programmer to treat the program as a serial program for the purposes of correctness. A run-time system then creates threads as necessary to execute the program in parallel. Since threads synchronize only in specific ways, the run-time systems can optimize the creating and scheduling of threads so as to maximize efficiency. Structured multithreading thus offers in some ways the best of both worlds: the programmer can reason about correctness sequentially but the program executes in parallel.

More precisely, consider a purely functional sequential language such as the untyped (pure) lambda calculus and its sequential dynamic semantics specified as a strict, small step transition relation. We can extend this language with the structured multithreading by enriching the syntax language with "fork-join" and "futures" constructs. We can now extend the dynamic semantics of the language in two ways: 1) trivially ignore these constructs and execute serially as usual, and 2) execute in parallel by creating parallel threads. With some care, we can establish these two semantics to be identical, i.e., they produce the same value for the same expressions. In other words, we can extend a rich programming language with fork-join and futures and still give the language a sequential semantics. This shows that structured multithreading is nothing but an efficiency and performance concern; it can be ignored from the perspective of correctness.

We use the term parallelism to refer to the idea of computing in parallel by using such structured multithreading constructs. As we shall see, we can write parallel algorithms for many interesting problems. Although parallel algorithms or applications constitute a large class, they don’t cover all applications. Specifically applications that can be expressed by using richer forms of multithreading such as the one offered by Pthreads do not always accept a sequential semantics. In such concurrent applications, threads can communicate and coordinate in complex ways to accomplish the intended result. One such example is the multithreaded program that we considered for using a shared non-blocking stack. Another example, which is a classic concurrency example, is the "producer-consumer problem", where a consumer and a producer thread coordinate by using a fixed size buffer of items. The producer fills the buffer with items and the consumer removes items from the buffer and they coordinate to make sure that the buffer is never filled more than it can take. Operating-system level processes sometimes communicate similarly in some concurrent applications.

In summary, parallelism is a property of the hardware or the software platform where the computation takes place, whereas concurrency is a property of the application. Pure parallelism can be ignored for the purposes of correctness; concurrency cannot be ignored for understanding the behavior of the program.

Parallelism and concurrency are orthogonal dimensions in the space of all applications. Some applications are concurrent, some are not. Many concurrent applications can benefit from parallelism. For example, a browser, which is a concurrent application itself as it may use a parallel algorithm to perform certain tasks. On the other hand, there is usually no need to add concurrency to a parallel application, because this unnecessarily complicates software. It can, however, lead to improvements in efficiency.

The following quote from Dijkstra suggest pursuing the approach of making parallelism just a matter of execution (not one of semantics), which is the goal of the much of the work on the development of programming languages today. Note that in this particular quote, Dijkstra does not mention that parallel algorithm design requires thinking carefully about work and span, as opposed to just work as is sequential computing.

From the past terms such as "sequential programming" and "parallel programming" are still with us, and we should try to get rid of them, for they are a great source of confusion. They date from the period that it was the purpose of our programs to instruct our machines, now it is the purpose of the machines to execute our programs. Whether the machine does so sequentially, one thing at a time, or with considerable amount of concurrency, is a matter of implementation, and should not be regarded as a property of the programming language.

Selected Writings on Computing: A Personal Perspective (EWD 508)
— Edsger W. Dijkstra

3.2. Parallelism and Mutual Exclusion

In parallel programming, mutual exclusion problems do not have to arise. For example, if we program in a purely functional language extended with structured multithreading primitives such as fork-join and futures, programs remain purely functional and mutual-exclusion problems, and hence race conditions, do not arise. If we program in an imperative language, however, where memory is always a shared resource, even when it is not intended to be so, threads can easily share memory objects, even unintentionally, leading to race conditions.

Example 6. Writing to the same location in parallel.

In the code below, both branches of fork2 are writing into b. What should then the output of this program be?

long b = 0;

fork2([&] {
  b = 1;
}, [&] {
  b = 2;
});

std::cout << "b = " << std::endl;

At the time of the print, the contents of b is determined by the last write. Thus depending on which of the two branches perform the write, we can see both possibilities:

Output:

b = 1

Output:

b = 2

Example 7. Fibonacci

Consider the following alternative implementation of the Fibonacci function. By "inlining" the plus operation in both branches, the programmer got rid of the addition operation after the fork2.

long fib_par_racy(long n) {
  long result = 0;
  if (n < 2) {
    result = n;
  } else {
    fork2([&] {
      result += fib_par_racy(n-1);
    },[&] {
      result += fib_par_racy(n-2);
    });
  }
  return result;
}

This code is not correct because it has a race condition.

As in the example shows, separate threads are updating the value result but it might look like this is not a race condition because the update consists of an addition operation, which reads the value and then writes to it. The race condition might be easier to see if we expand out the applications of the += operator.

long fib_par_racy(long n) {
  long result = 0;
  if (n < 2) {
    result = n;
  } else {
    fork2([&] {
      long a1 = fib_par_racy(n-1);
      long a2 = result;
      result = a1 + a2;
    },[&] {
      long b1 = fib_par_racy(n-2);
      long b2 = result;
      result = b1 + b2;
    });
  }
  return result;
}

When written in this way, it is clear that these two parallel threads are not independent: they both read result and write to result. Thus the outcome depends on the order in which these reads and writes are performed, as shown in the next example.

Example 8. Execution trace of a race condition

The following table takes us through one possible execution trace of the call fib_par_racy(2). The number to the left of each instruction describes the time at which the instruction is executed. Note that since this is a multithreaded program, multiple instructions can be executed at the same time. The particular execution that we have in this example gives us a bogus result: the result is 0, not 1 as it should be.

Time step	Thread 1	Thread 2
1	a1 = fib_par_racy(1)	b2 = fib_par_racy(0)
2	a2 = result	b3 = result
3	result = a1 + a2	_
4	_	result = b1 + b2

The reason we get a bogus result is that both threads read the initial value of result at the same time and thus do not see each others write. In this example, the second thread "wins the race" and writes into result. The value 1 written by the first thread is effectively lost by being overwritten by the second thread.

In the example, race condition arises because of concurrent writes to the result variable. We can eliminate this kind of race condition by using different memory locations, or by using an atomic class and using a compare_exchange_strong operation.

Example 9. Fibonacci

The following implementation of Fibonacci is not safe because the variable result is shared and updated by multiple threads.

long fib_par_racy(long n) {
  long result = 0;
  if (n < 2) {
    result = n;
  } else {
    fork2([&] {
      result += fib_par_racy(n-1);
    },[&] {
      result += fib_par_racy(n-2);
    });
  }
  return result;
}

We can solve this problem by declaring result to be an atomic type and using a standard busy-waiting protocol based on compare-and-swap.


long fib_par_atomic(long n) {
  atomic<long> result = 0;
  if (n < 2) {
    result.store(n);
  } else {
    fork2([&] {
      long r = fib_par_racy(n-1);
      // Atomically update result.
      while (true) {
        long exp = result.load();
        bool flag = result.compare_exchange_strong(exp,exp+r)
        if (flag) {break;}
      }
    },[&] {
      long r = fib_par_racy(n-2);
      // Atomically update result.
      while (true) {
        long exp = result.load();
        bool flag = result.compare_exchange_strong(exp,exp+r)
        if (flag) {break;}
      }
    });
  }
  return result;
}

The idea behind the solution is to load the current value of result and atomically update result only if it has not been modified (by another thread) since it was loaded. This guarantees that the result is always updated (read and modified) correctly without missing an update from another thread.

4. Chapter: Scheduling Multithreaded Programs with Work Stealing

The work-stealing algorithm is a solution to the online scheduling problem, where we are given a $P$ workers (a.k.a., processors or processes) and a computation dag that unfolds dynamically during execution, and asked to construct an execution schedule with minimal length, while spending as little work and time for scheduling itself.

We consider multithreaded computations represented as dags as described in an earlier chapter. To streamline the analysis, we assume without loss of generality that the root vertex has a single child. This assumption eliminates a few corner cases from the analysis.

4.1. Work-Stealing Algorithm

In work stealing, each worker maintains a deque, doubly ended queue, of vertices. Each worker tries to work on its local deque as much as possible. Execution starts with the root vertex in the deque of one of the workers. It ends when the final vertex is executed.

A work stealing scheduler operates as described by our generic scheduling algorithm but instead of operating on threads, it operates on vertices of the dag. To obtain work, a worker pops the vertex at the bottom of its deque and executes it. We refer to the vertex being executed by a worker as the assigned vertex. When executed, the ready vertex can make the other vertices ready, which are then pushed onto the bottom end of the deque in an arbitrary order. If a worker finds its deque empty, then the worker becomes a thief. A thief picks a victim worker at random and attempts to steals a thread from another by popping a thread off the top of the victim’s deque. The thief performs steal attempts until it successfully steals a thread, at which point, the thief goes back to work and the stolen thread becomes its assigned thread.

The pseudo-code for the algorithm is shown below. All deques are initially empty and the root vertex is assigned to the worker zero. The algorithm operates in rounds. In each round, a worker executes the assigned vertex if any, pushes the newly enabled vertices to its deque, and obtains a new assigned vertex from its deque. If the round starts with no assigned vertex then the worker becomes a thief performs a steal attempt. Note that a steal attempt start and completes in the same round.

Such a steal attempt can fail if

the victim’s deque is empty, or
contention between workers occurs and the vertex targeted by the thief is executed by the worker that own the dequer or stolen by another thief.

For the analysis of the algorithm, we shall assume that each instruction and each deque operations executes in a single step to execute. As a result, each iteration of the loop, a round, completes in constant steps.

// Assign root to worker zero.
assignedVertex = NULL
if (self == WorkerZero) {
  assignedVertex = rootVertex
}


// Run scheduling loop.
while (computationDone == false) {

  // Execute assigned vertex.
  if (assignedVertex <> NULL) {
    (nChildren, child1, child2) = execute (assignedVertex)

    if (nChildren == 1) {
      self.pushBottom child1
    }
    else {
      self.pushBottom child1
      self.pushBottom child2
    }
    assignedVertex = self.popBottom ()
  }
  else {
    // Make steal attempt.
    victim = randomWorker ()
    assignedVertex = victim.popTop ()
  }
}

Note

Note that when a vertex enables two vertices they are both pushed onto the bottom of the deque in an order that is unspecified. The analysis holds for any such order. However, realistic implementations will operate at the granularity of threads as defined in for example an earlier chapter and by the generic scheduling algorithm. To make this algorithm consistent with such implementations, we would push the vertex that is next in the thread last, so that it is executed next. Pushing and popping the vertex in of course unnecessary and should be avoided in an implementation. For the purposes of the analysis, this adds a small constant factor that we don’t care to account for.

4.1.1. Deque Specification

The deque supports three methods:

pushBottom, which pushes a vertex at the bottom end of the deque.
popBottom, which returns the vertex at the bottom end of the deque if any, or returns NULL otherwise.
popTop, returns the vertex at the top end of the deque, if any, or returns NULL if the deque is empty.

For the analysis, we assume that these operations take place atomically. We can think of them as starting by taking a lock for the deque, performing the desired operation, and releasing the lock.

For the analysis, we shall assume that all these operations take constant time and in fact complete in one step. This assumption is somewhat unrealistic, because it is not known whether popTop can be implemented in constant time. But a relaxed version of popTop, which allows popTop to return NULL if another concurrent operation removes the top vertex in the deque, accepts a constant-time implementation. This relaxed version suffices for our purposes.

4.1.2. Work sequence of a worker

Consider the execution of the work stealing algorithm and let $q$ be any worker. We define the work sequence of $q$ as the sequence of vertices defined by the assigned vertex of $q$ followed by the vertices in its deque ordered from bottom to top. If a vertex is in the work sequence of a worker, then we say that it belongs to that worker.

Example 10. Work sequence

Consider the deque for a worker shows below along with the assigned vertex. The work sequence for this worker is $\langle v_0, v_1, v_2, v_3 \rangle$.

Now, if the worker completes executing $v_0$, which enables no new children, then the work sequence consist of the vertices in the deque, i.e., $\langle v_1, v_2, v_3 \rangle$. If the worker, later removes $v_1$ from its deque and starts working on it, i.e., $v_1$ becomes the assigned vertex, then the work sequence remains the same, i.e., $\langle v_1, v_2, v_3 \rangle$.

4.1.3. Enabling Tree

At any step in an execution, we say that a vertex is ready if all the ancestors of the vertex in the dag are executed prior to that step. If execution of a vertex $u$ makes ready another vertex $v$, then we say that $u$ enables $v$, and call the edge $(u,v)$ an enabling edge. We call $u$ the designated parent of $v.$ Every vertex except for the root has a designated parent. Therefore the subgraph of the dag consisting of the enabling edges form a rooted tree, called the enabling tree. Note each execution can have a different enabling tree.

4.1.4. An Example Run with Work Stealing

Consider the following computation dag.

The following drawings illustrate a 2-worker execution of this dag using work stealing. Time flows left to right and top to bottom.

The enabling tree for this execution is shown below.

4.1.5. Structural Lemma

Lemma[Structural Lemma]

Consider any time in an execution of the work-stealing algorithm after the execution of the root vertex. Let $v_0, v_1, \ldots, v_k$ denote the work sequence of any worker. Let $u_0, u_1, \ldots, u_k$ be the sequence consisting of the designated parents of the vertices in the work sequence in the same order. Then $u_i$ is an ancestor of $u_{i-1}$ in the enabling tree. Moreover, we may have $u_0 = u_1$ but for any $2 \le i \le k$ $u_{i-1} \neq u_i$, that is the ancestor relationship is proper.

Structural lemma illustrated.

The proof of the structural lemma is by induction on the number of rounds of the execution.

Consider the first round. At the initialization and before the beginning of the first round, all deques are empty, root vertex is assigned to worker zero but has not been executed. The root vertex is then executed. By assumption, the root vertex has a single child $v$, which becomes enabled and pushed onto the deque and popped again becoming the assigned vertex at the beginning of the second round. At any of point in time after the execution of the root, worker zero’s work sequence consist of $v$. The designated parent of $v$ is the root vertex and the lemma holds trivially.

For the inductive case, assume that the lemma holds up to beginning of some later round. We will show that it holds at any point during the round and also after the completion of the round.

Consider any worker and its deque. We have two cases to consider.

Case 1: There is an assigned vertex, $v_0$, which is executed.

By the definition of work sequences, we know that $v_1, \ldots, v_k$ are the vertices in the deque. Let $u_1, \ldots, u_k$ be their designated parents. By induction, we know that $u_i$ is an ancestor of $u_{i-1}$ in the enabling tree and the ancestor relationship is proper except for $i = 1$, where it is possible that $u_0 = u_1$. Immediately after the execution of the assigned node, the work sequence of the worker consists of all the vertices in the deque and the lemma follows.

Structural lemma illustrated after the assigned vertex is executed.

After the execution of the assigned vertex $v_0$, we have several sub-cases to consider.

Case 1.1: execution of $v_0$ enables no children.

Since the deque remains the same, the lemma holds trivially.

Case 1.2: execution of $v_0$ enables one child $x$, which is pushed onto the bottom of the deque. In this case, $v_0$ becomes the parent of $x$. The lemma holds.

Structural lemma illustrated after the assigned vertex enables one child.

Case 1.2: execution of $v_0$ enables two children $x,y$, which are pushed to the bottom of the deque in an arbitrary order.

In this case, $v_0$ becomes the parent of $x$ and $y$. We need to show that $v_0 \neq u_1$. This holds because $v_0 \neq u_0$ and $v_0 \neq u_1$. The lemma holds.

Structural lemma illustrated after the assigned vertex enables two children.

After the execution of the assigned vertex completes and the children are pushed, the worker pops the vertex at the bottom of the deque. There are two cases to consider.

If the deque is empty, then the worker finds no vertex in its deque and there is no assigned vertex at the end of the round, thus the work sequence is empty and the lemma holds trivially.
If the deque is not empty, then the vertex at the bottom of the deque becomes the new assigned vertex. The lemma holds trivially because making the bottom vertex the assigned vertex has no impact on the work sequence of the worker and thus the correctness of the lemma.

Case 2: A successful steal takes place and removes the top vertex in the deque. In this case, the victim worker loses its top vertex, which becomes the assigned vertex of the thief. The work sequence of the victim loses its rightmost element. It is easy to see that the lemma continues to hold for the victim. When the stolen vertex is assigned, the work sequence of the thief consist of just the stolen vertex and the lemma holds for the thief.

4.2. Analysis

The strategy analysis is to assign a potential to each vertex and show that the potential decreases geometrically as the execution proceeds.

4.2.1. Weights

If $d(u)$ is the depth of a vertex $u$ in the enabling tree, then we define the weight of $u$, written $w(u)$ as follows $w(u) = S - d(u)$. The root has weight $S$. Intuitively, the weight is equal to the distance of a vertex from the completion.

A crucial corollary of the structural lemma is that the weights of the vertices decrease from top to bottom.

Corrollary

Consider the work sequence of a worker $v_0, v_1, v_2 \ldots v_k$. We have $w(v_0) \le w(v_1) < w(v_2) \ldots w(k-1) < w(v_k)$.

4.2.2. Balls and Bins Game

One crucial fact behind the analysis is a probabilistic lemma, called the Balls and Bins lemma. This lemma proves something relatively intuitive: if you throw as many ball as there are bins, chances are good that you will have a ball in at least a constant fraction of the bins, because chances of all balls landing in a small number of bins is low.

Lemma[Balls and Bins]

Suppose that $P$ balls are thrown uniformly and randomly into $P$ bins, where bin $1 \le i \le P$ has weight $W_i \ge 0$, and $W = \sum_{i=1}^{P}{W_i}$. For each bin, define the random variable

$$ X_i = \left\{ \begin{array}{ll} W_i & \mbox{if a ball lands in bin}~i \\ 0 & \mbox{otherwise} \end{array} \right. $$

If $X = \sum_{i=1}^{P}{X_i}$, then for any $\beta, 0 < \beta < 1$, we have $P \lbrack X \ge \beta W \rbrack > 1 - \frac{1}{(1-\beta)e}$.

Proof

The proof of the lemma is a relatively straightforward application of Markov’s inequality. Consider the random variable $W_i - X_i$. This random variable takes on the value $0$ if a ball lands in bin $i$ and $W_i$ otherwise. Thus, we have

$$ \begin{array}{lll} E \lbrack W_i - X_i \rbrack & = & W_i \cdot (1-1/P)^P, \mbox{and} \\ E \lbrack W - X \rbrack & = & W \cdot (1-1/P)^P. \end{array} $$

We know that $\lim_{P \rightarrow \infty} (1-1/P)^P = 1/e$ and furthermore that the derivative of this function is non-negative (the function is non-decreasing). Thus we know that $(1-1/P)^P \le 1/e$.

It follows that $ E\lbrack W - X \rbrack \le W / e $.

Since the random variable $W - X$ is a non-negative random variable, we can apply Markov’s inequality:

$$ P \lbrack W-X > (1-\beta) W \rbrack \le \frac{E \lbrack W-X \rbrack}{(1-\beta)W}. $$

It follows that

$$ P \lbrack X < \beta W \rbrack < \frac{1}{(1-\beta)e}, $$

and thus

$$ P \lbrack X \ge \beta W \rbrack > 1 - \frac{1}{(1-\beta)e}. $$

Example 11. An application of the lemma

Let’s calculate the probability for $\beta = 1/2$. By the lemma, we know that if $P$ balls are thrown into $P$ bins, then the probability that the total weight of the bins that have a ball in them is at least half the total weight is $P \lbrack X \ge \frac{W}{2} \rbrack 1 - \frac{1}{0.5e}$. Since $e > 2.71$, this quantity can be calculated as at least $0.25$. We thus conclude that we using the Ball and Bins lemma, we can "collect" at least half of the weight with probability at least $0.25$

4.2.3. Bound in terms of Work and Steal Attempts

Lemma[Work and Steal Bound]

Consider any multithreaded computation with work $W$. The $P$-worker execution time is $O(W/P + A/P)$ steps where $A$ is the number of steal attempts.

Proof

Consider the execution in terms of rounds

If a vertex is executed in that round, then the workeror places a token into the work bucket.
If a steal attempt takes place in that round, then the worker places a token into the steal-attempt bucket.

There are exactly $A$ tokens in the steal-attempt bucket and exactly $W$ tokens is the work bucket. Thus the total number of tokens is at most $W + A$. Since in each round a worker either executes a vertex or attempts a steal, and since each round is a constant number of steps, the $P$-worker execution time is $T_P = O(\frac{W + A}{P})$.

Note

The proof assumes that each instructions including deque operations takes a (fixed) constant number of steps, because it assumes that each round contributes to the work or to the steal bucket. If this assumption is not valid, then we might need to change the notion of rounds so that they are large enough for steals to complete.

4.2.4. Bounding the Number of Steal Attempts

Our analysis will use a potential-function based method. We shall divide the computation into phases each of which decreases the potential by a constant factor.

We start with a few definitions. Consider some round $i$.

Let $R_i$ denote the set of ready vertices in the beginning of that round.
A vertex is $R_i$ is either assigned to a worker or is in a deque. Let $R_i(q)$ denote the set of ready vertices belonging to a worker $q$ at the beginning of round $i$; these are exactly the vertices in the work sequence of that worker.
For each vertex $v \in R_i$, we define its potential as
1. $\phi_i(v) = 3^{2w(v) - 1}$, if $v$ is assigned, or
2. $\phi_i(v) = 3^{2w(v)}$, otherwise.

Note that potential of a vertex is always a natural number.

The potential of worker $q$ is $\Phi_i(q) = \sum_{v \in R_i(q)}{\phi_i(v)}.$
We write $H_i$ (mnemonic for "Hood") for the set of workers whose deques are empty at the beginning of round $i$. We write $\Phi_i(H_i)$ for the total potential of the workers $H_i$, $\Phi_i(H_i) = \sum_{q \in H_i}{\Phi_i(q)}$.
We write $D_i$ for the set of other workers. We write $\Phi_i(D_i)$ for the total potential of the workers $D_i$, $\Phi_i(D_i) = \sum_{q \in D_i}{\Phi_i(q)}$.
Define the total potential at round $i$, written $\Phi_i$ as $\Phi_i = \sum_{v \in R_i}{\phi_i(v)}$. We can write the total potential in round $i$ as follows $\Phi_i = \Phi_i(H_i) + \Phi_i(D_i)$.

Definition: Beginning and termination potential

At the beginning of the computation, the only ready vertex is the root, which has a weight of $S$, because it is also the root of the enabling tree. Therefore the potential in the beginning of the computation is $3^{2S-1}$. At the end of the computation, there are no ready vertices and thus the potential is zero.

Lemma[Top-Heavy Deques]

Consider any round $i$ and any worker $q \in D_i$. The topmost vertex in $q$'s deque contributes at least $3/4$ of the potential of worker $q$.

Proof

This lemma follows directly from the structural lemma. The case in which the topmost vertex $v$ contributes the least of the worker is when the assigned vertex and $v$ have the same parent. In this case, both vertices have the same depth and thus we have

$$ \Phi_i(q) = \phi_i(v) + \phi_i(x) = 3^{2w(v)} + 3^{2w(x)-1} = 3^{2w(v)} + 3^{2w(v)-1} = (4/3) \phi_i(v). $$

Lemma[Vertex Assignment]

Consider any round $i$ and let $v$ be a vertex that is ready but not assigned at the beginning of that round. Suppose that the scheduler assigns $v$ to a worker in that round. In this case, the potential decreases by $2/3 \cdot \phi_i(v)$.

Proof

This is a simple consequence of the definition of the potential function:

$ \phi_{i}(v) - \phi_{i+1}(v) = 3^{2w(v)} - 3^{2w(v)-1} = 2/3 \phi_i(v). $

Lemma[Decreasing Total Potential]

Total potential does not increase from one round to the next, i.e., $\Phi_{i+1} \le \Phi_i$.

Proof

Let’s now consider how the potential changes during each round. There are two cases to consider based on scheduler actions.

Case 1: Suppose that the scheduler assign a vertex $v$ to a worker. By the Vertex Assignment Lemma, we know that the potential decreases by $2/3 \cdot \phi_i(v)$. Since $\phi_i(v)$ is positive, the potential decreases.

Note that this calculation holds regardless of where in the deque the vertex $v$ is. Specifically, it could have been the bottom or the top vertex.

Case 2: suppose that the scheduler executes a vertex $v$ and pushes its children onto the deque. There are two sub-cases to consider.

Case 2.1: suppose that the scheduler pushes onto the deque the only child $x$ of a vertex $v$. Assuming that the child stays in the deque until the beginning of the next round, the potential decreases by

$$ \phi_{i}(v) - \phi_{i+1}(x) = 3^{2w(v)-1} - 3^{2w(v)-2} =3^{2w(v)-1}(1 - 1/3) = 2/3 \cdot \phi_i(v). $$

Case 2.2: suppose that the scheduler pushes onto the deque two children $x,y$ of a vertex $v$. Assuming that the children remain in the deque until the beginning of the next round, then potential decreases by

$$ \begin{array}{lll} \phi_i(v) - \phi_{i+1}(x) - \phi_{i+1}(y) & = & 3^{2w(v)} - 3^{2w(v)-2} - 3^{2w(v)-2} \\ & = & 3^{2w(v)-1} (1 - 1/3 - 1/3) \\ & = & 1/3 \cdot \phi_i(v). \end{array} $$

Since $\phi_i(v)$ is positive, the potential decreases in both cases. Note that, it is safe to assume that the children remain in the deque until the next round, because assignment of a vertex decreases the potential further.

In each round each worker performs none, one, or both of these actions. Thus the potential never increases.

We have thus ustablished that the potential decreases but this by itself does not suffice. We also need to show that it decreases by some significant amount. This is our next step in the analysis. We show that after $P$ steal attempts the total potential decreases with constant probability.

Lemma[$P$ Steal Attempts]

Consider any round $i$ and any later round $j$ such that at least $P$ steal attempts occur at rounds from $i$ (inclusive) to $j$ (exclusive). Then, we have

$$ Pr \lbrack \Phi_i - \Phi_j \ge \frac{1}{4} \Phi_i(D_i) \rbrack > \frac{1}{4}. $$

Proof:

First we use the Top-Heavy Deques Lemma to establish the following. If a steal attempt targets a worker with a nonempty deque as its victim, then the potential decreases by at least of a half of the potential of the victim.

Consider any worker $q$ in $D_i$ and let $v$ denote the vertex at the top of its deque at round $i$. By Top-Heavy Deques Lemma, we have $\phi_i(v) \ge \frac{3}{4} \Phi_i(q)$.

Consider any steal attempt that occurs at round $k \ge i$.

Suppose that this steal attempt is sucessful with popTop returning a vertex. The two subcases both follow by the Vertex Assignment Lemma.
1. If the returned vertex is $v$, then after round $k$, vertex $v$ has been assigned and possibly executed. Thus, the potential has decreased by at least $\frac{2}{3}\, \phi_i(u)$.
2. If the returned vertex is not $v$, then $v$ is already assigned and possibly executed. Thus, the potential has decreased by at least $\frac{2}{3}\, \phi_i(v)$.
Suppose that the steal attempt is not succesful, and popTop returns NULL. In this case, we know that $q$'s deque was empty during popTop, or some other popTop or popBottom operation returned $v$. In all cases, vertex $v$ has been assigned or possibly executed by the end of round $k$ and thus, potential decreases by $\frac{2}{3}\, \phi_i(v)$.

Thus, we conclude that if a thief targets a worker $q \in D_i$ as victim at round $k \ge i$, then the potential decreases by at least

$$ \frac{2}{3}\, \phi_i(u) \ge \frac{2}{3} \frac{3}{4} \Phi_i(q) = \frac{1}{2} \Phi_i(q). $$

Second, we use Ball and Bins Lemma to establish the total decrease in potential.

Consider now all $P$ workers and $P$ steal attempts that occur at or after round $i$. For each worker $q$ in $D_i$, assign a weight of $\frac{1}{2}\Phi_i(q)$ and for each worker in $H_i$, assign a weight of $0$. The total weight is thus $\frac{1}{2} \Phi_i(D_i)$. Using the Balls-and-Bins Lemma with $\beta = 1/2$, we conclude that the potential decreases by at least $\beta W = \frac{1}{4} \Phi_i(D_i)$ with probability greater that $1 - \frac{1}{(1-\beta) e} > \frac{1}{4}$.

Important

For this lemma to hold, it is crucial that a steal attempt does not fail unless the deque is empty or the vertex being targeted at the time is popped from the deque is some other way. This is why, we required the popTop operation called by a worker to fail only if the top vertex is removed from the deque by another worker.

Theorem[Run-Time Bound]

Consider any multithreaded computation with work $W$ and span $S$ and execute it with non-blocking work stealing with $P$ workers in a dedicated environment. The exection time is

$O(W/P + S)$ in expectation, and
$O(W/P + S + \lg(1/\epsilon))$ with probability at least $1-\epsilon$ for any $\epsilon > 0$.

Proof

The Steal Attempt-Bound Lemma bounds the execution time in terms of steal attempts. We shall prove bounds on the number of steal attempts.

We break execution into phases of $\Theta(P)$ steal attempts. We show that with constant probability, a phase causes the potential to drop by a constant factor, and since the potential starts at $\Phi_0 = 3^{2S -1}$ and ends at zero, and is always an natural number, we can bound the number of phases.

The first phase begins at round $1$ and ends at the first round $k$, where at least $P$ throws occur during the interval $\lbrack 1,k \rbrack $. The second phase begins at round $k + 1$ and so on.

Consider a phase $[i,j)$, where the next phase begins at round $j$. We show that $\Pr \lbrack \Phi_j \le \frac{3}{4} \Phi_i \rbrack < \frac{1}{4}$.

Recall that the potential can be partitioned as $\Phi_i = \Phi_i(H_i) + \Phi_i(D_i)$.

Since the phase contains at least $P$ steal attempts, by $P$ Steal Attempts Lemma, we know that $P \lbrack \Phi_i - \Phi_j \ge \frac{1}{4} \Phi_i(D_i) \rbrack > \frac{1}{4}$.
We need to show that the potential also drops by a constant fraction of $\Phi_i(H_i)$. Consider some worker $q$ in $H_i$. If $q$ does not have an assigned vertex, then $\Phi_i(q) = 0$. If $q$ has an assigned vertex $v$, then $\Phi_i(q) = \phi_i(v)$. In this case, worker $q$ executes $v$ at round $i$ and the potential decreases by at least $\frac{1}{3} \phi_i(u)$ by an argument included in the proof of Decreasing Potential Lemma.

Summing over all workers in $H_i$, we have $\Phi_i - \Phi_j \ge \frac{1}{3} \Phi_i(H_i)$.

Thus we conclude that $P \lbrack \Phi_i - \Phi_j \ge \frac{1}{4} \Phi_i \rbrack > \frac{1}{4}$. In other words, we established that in any phase, potential drops by a quarter with some probability $\frac{1}{4}$.

Define a phase to be successful if it causes the potential do decrease by at least a quarter fraction. We just established that phase is successful with probability at least $0.25$. Since the start potential $\Phi_0 = 3^{2S -1}$ and ends at zero and is always an integer, the number of successful phases it as most $(2S-1)\, \log_{4/3}{3} < 8 S$.

The expected number of phases to obtain a single successful phase is distributed geometrically with expectation $4$. Therefore, the total expected number of phases is $32S$, i.e., $O(S)$. Since each phase contains $O(P)$ steal attempts, the expected number of steal attempts is $O(SP)$.

We now establish the high-probability bound.

Suppose that the execution takes $n = 32S + m$ phases. Each phase succeds with probability at least $p = \frac{1}{4}$, so the expected number of successes is at least $np = 8S + m/4$. Let $X$ the number of successes. By Chernoff bound

$$ P \lbrack X < np -a \rbrack < e^{-a^2/2np}, $$

with $a = m/4$. Thus if we choose $m = 32S + 16 \ln{1/\epsilon}$, then we have

$$ P \lbrack X < 8 S \rbrack < e^{-(m/4)^2 / (16S + m/2)} \le e^{-(m/4)^2 / (m/2 + m/2)} = e^{-m/16} \le e^{-16\ln{1/\epsilon}/16} = \epsilon. $$

Thus the probabilty that the execution takes $64S + 18 \ln{1/\epsilon)}$ phases or more is less than $\epsilon$.

We conclude that the number of steal attempts is $O(S + \lg{1/\epsilon}P)$ with probability at least $1 - \epsilon$.

4.3. Chapter Notes.

The material presented here is a adapted from the paper:

N. S. Arora, R. D. Blufome, and C. Greg Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. Theory of Computing Systems, 34(2), 2001.

In this chapter, we consider dedicated environments, and simplify and streamline the proof for this case. The original paper considers multiprogrammed environments.

Parallel Computing: Theory and Practice

1. Chapter: Multithreading, Parallelism, and Concurrency

1.1. DAG Representation

1.2. Cost Model: Work and Span

1.3. Execution and Scheduling

1.4. Scheduling Lower Bounds

1.5. Offline Scheduling

1.6. Online Scheduling

1.7. Writing Multithreaded Programs

2. Critical Sections and Mutual Exclusion

2.1. Synchronization Hardware

2.2. Nonblocking Concurrent Data Structures

2.2.1. A non-blocking stack data structure

2.2.2. ABA problem

2.3. Memory Management

3. Chapter: Structured Multithreading

3.1. Parallelism versus concurrency

3.2. Parallelism and Mutual Exclusion

4. Chapter: Scheduling Multithreaded Programs with Work Stealing

4.1. Work-Stealing Algorithm

4.1.1. Deque Specification

4.1.2. Work sequence of a worker

4.1.3. Enabling Tree

4.1.4. An Example Run with Work Stealing

4.1.5. Structural Lemma

4.2. Analysis

4.2.1. Weights

4.2.2. Balls and Bins Game

4.2.3. Bound in terms of Work and Steal Attempts

4.2.4. Bounding the Number of Steal Attempts

4.3. Chapter Notes.

5. Index