% Part III Notes on Parallelism for 6.001, Fall 1988
% Nikhil, December, 1988

\documentstyle[12pt]{article}

% HORIZONTAL MARGINS
% Left margin 1 inch (0 + 1)
\setlength{\oddsidemargin}{0in}
% Text width 6.5 inch (so right margin 1 inch).
\setlength{\textwidth}{6.5in}
% ----------------
% VERTICAL MARGINS
% Top margin 0.5 inch (-0.5 + 1)
\setlength{\topmargin}{-0.5in}
% Head height 0.25 inch (where page headers go)
\setlength{\headheight}{0.25in}
% Head separation 0.25 inch (between header and top line of text)
\setlength{\headsep}{0.25in}
% Text height 9 inch (so bottom margin 1 in)
\setlength{\textheight}{9in}
% ----------------
% PARAGRAPH INDENTATION
\setlength{\parindent}{0in}
% SPACE BETWEEN PARAGRAPHS
\setlength{\parskip}{\medskipamount}
% ----------------
% STRUTS
% HORIZONTAL STRUT.  One argument (width).
\newcommand{\hstrut}[1]{\hspace*{#1}}
% VERTICAL STRUT. Two arguments (offset from baseline, height).
\newcommand{\vstrut}[2]{\rule[#1]{0in}{#2}}
% ----------------
% EMPTY BOXES OF VARIOUS WIDTHS, FOR INDENTATION
\newcommand{\hm}{\hspace*{1em}}
\newcommand{\hmm}{\hspace*{2em}}
\newcommand{\hmmm}{\hspace*{3em}}
\newcommand{\hmmmm}{\hspace*{4em}}
% ----------------
% VARIOUS CONVENIENT WIDTHS RELATIVE TO THE TEXT WIDTH, FOR BOXES.
\newlength{\hlessmm}
\setlength{\hlessmm}{\textwidth}
\addtolength{\hlessmm}{-2em}

\newlength{\hlessmmmm}
\setlength{\hlessmmmm}{\textwidth}
\addtolength{\hlessmmmm}{-4em}
% ----------------
% ``TIGHTLIST'' ENVIRONMENT (no para space betwee items, small indent)
\newenvironment{tightlist}%
{\begin{list}{$\bullet$}{%
    \setlength{\topsep}{0in}
    \setlength{\partopsep}{0in}
    \setlength{\itemsep}{0in}
    \setlength{\parsep}{0in}
    \setlength{\leftmargin}{1.5em}
    \setlength{\rightmargin}{0in}
    \setlength{\itemindent}{0in}
}
}%
{\end{list}
}
% ----------------
% CODE FONT (e.g. {\cf x := 0}).
\newcommand{\cf}{\footnotesize\tt}
% ----------------
% INSTRUCTION POINTER
\newcommand{\IP}{$\bullet$}
\newcommand{\goesto}{$\longrightarrow$}
% ----------------------------------------------------------------
% LISP CODE DISPLAYS.
% Lisp code displays are enclosed between \bid and \eid.
% Most characters are taken verbatim, in typewriter font,
% Except:
%  Commands are still available (beginning with \)
%  Math mode is still available (beginning with $)

\outer\def\beginlisp{%
  \begin{list}{$\bullet$}{%
    \setlength{\topsep}{0in}
    \setlength{\partopsep}{0in}
    \setlength{\itemsep}{0in}
    \setlength{\parsep}{0in}
    \setlength{\leftmargin}{1.5em}
    \setlength{\rightmargin}{0in}
    \setlength{\itemindent}{0in}
  }\item[]
  \obeyspaces
  \obeylines \footnotesize\tt}

\outer\def\endlisp{%
  \end{list}
  }

{\obeyspaces\gdef {\ }}

% ----------------
% ILLUSTRATIONS
% This command should specify a NEWT directory for ps files for illustrations.
\def\psfileprefix{/usr/nikhil/parle/}
\def\illustration#1#2{
\vbox to #2{\vfill\special{psfile=\psfileprefix#1.ps hoffset=-72 voffset=-45}}} 

% \illuswidth is used to set up boxes around illustrations.
\newlength{\illuswidth}
\setlength{\illuswidth}{\textwidth}
\addtolength{\illuswidth}{-7pt}

% ----------------------------------------------------------------
% SCHEME CLOSURES AND PROCEDURES

% CLOSURES: TWO CIRCLES BESIDE EACH OTHER; LEFT ONE POINTS DOWN TO CODE (arg 1)
% RIGHT ONE POINTS RIGHT TO ENVIRONMENT (arg 2)
\newcommand{\closure}[2]{%
\begin{tabular}[t]{l}
\raisebox{-1.5ex}{%
  \setlength{\unitlength}{0.2ex}
  \begin{picture}(25,15)(0,-7)
   \put( 5,5){\circle{10}}
   \put( 5,5){\circle*{1}}
   \put( 5,5){\vector(0,-1){10}}
   \put(15,5){\circle{10}}
   \put(15,5){\circle*{1}}
   \put(15,5){\vector(1,0){12}}
  \end{picture}}
  \fbox{\footnotesize #2} \\
%
\hspace*{0.8ex} \fbox{\footnotesize #1}
\end{tabular}
}

% PROCEDURES: BOX CONTAINING PARAMETERS (arg 1) AND BODY (arg 2)
\newcommand{\proc}[2]{%
\begin{tabular}{l}
params: #1 \\
body: #2 \\
\end{tabular}
}

% ----------------------------------------------------------------
% HERE BEGINS THE DOCUMENT

\begin{document}

\begin{center}
MASSACHUSETTS INSTITUTE OF TECHNOLOGY \\
Department of Electrical Engineering and Computer Science \\
6.001 Structure and Interpretation of Computer Programs \\
Fall Semester, 1988

{\Large\bf Parallel Programs and Machines}

{\Large\bf Part III}

R.S.Nikhil

December 3, 1988
\end{center}

% ----------------------------------------------------------------

\section{Introduction}

In Part I of these notes,  we looked at the following topics:
\begin{tightlist}

\item How the choice of algorithm affects the amount of parallelism
available.

\item The danger of side-effects in a parallel computation model.

\item The concept of a multiprocessor as an ensemble of register sets sharing
a common memory for code, stacks and data structures.

\item The concept of one processor ``spawning'' another, giving it an initial
stack that informs it about what it must do.

\item How to arrange for the arguments of a combination \mbox{\cf (e1 ...
eN)} to be evaluated in parallel, i.e., spawn processors to evaluate {\cf
e2}, ..., {\cf eN}, continue evaluating {\cf e1}, and the last of these {\cf
N} processors (parent and {\cf N}$-1$ children) continues to perform the
application.

\end{tightlist}

In Part II,   we looked at the following topics:
\begin{tightlist}

\item
 The parallelism available from ``non-strict'' evaluation, i.e., allowing the
processor that evaluates the operator to carry on ahead to {\cf
APPLY-DISPATCH} even though the argument-evaluations may not have terminated
yet.

\item
 The programming implications of non-strict evaluation.

\item
 Connections with streams, {\cf FORCE} and {\cf DELAY} and normal-order
evaluation.

\item
 Primitives for implementing non-strictness

\item
 A parallel, explicit-control evaluator that implements non-strictness.

\end{tightlist}

In this part of the notes (Part III), we will look at:
\begin{tightlist}

\item
 Orders of growth, revisited, under parallel models of computation.

\item
 Languages with explicit, as opposed to implicit, parallelism, i.e., the
programmer tells the machine exactly what is to be done in parallel.

\item
 Side-effects, revisited: how to tame them (or at least live with them) in
parallel models of computation.

\item
 Real parallel machines, where we have finite resources and can't conjure up
processors at will.

\end{tightlist}

% ----------------------------------------------------------------

\section{Orders of growth, revisited}

When we first learned about orders of growth,  we made certain statements.
For example, given the following recursive procedure to count
atoms in a tree:
\beginlisp
(define count-atoms-r (lambda (lst)
    (if (null? lst)
        0
        (if (atom? lst)
            1
            (+ (count-atoms-r (car lst))
               (count-atoms-r (cdr lst)))))))
\endlisp
we said that it would take $O(n)$ time and $O(\log n)$ space , where $n$ was
the number of nodes in the tree.   We must now realize that these statements
are {\em relative to the underlying execution model\/}, i.e., the time and
space complexities are valid for our sequential execution model.

In our parallel machine models, on the other hand,  we can see that the same
program will take $O(n)$ space, and can complete in as little as $O(\log n)$
time!  This is because all the recursive calls will unfold in parallel so
that there is one computation in progress at every node,  and the time taken
is proportional only to the depth of the tree.

Another example--- the sum of the square-roots of the numbers in a list:
\beginlisp
(define sum-of-sqrts (lambda (lst)
    (sum-of-sqrts-loop lst 0)))
\null
(define sum-of-sqrts-loop (lambda (lst s)
    (if (null? lst)
        s
        (sum-of-sqrts-loop (cdr lst)
                           (+ s (sqrt (car lst)))))))
\endlisp
 In all three evaluators, this iterative computation should take $O(n)$ time,
where $n$ is the length of the list.  In both the sequential and the
parallel, strict evaluators, it should take constant space (i.e., $O(1)$).
However, in the parallel, non-strict evaluator, it will take $O(n)$ space.
This is because, in the non-strict evaluator, the recursive call can race
ahead, traversing the entire list, selecting all the {\cf car}'s, and
initiating all the {\cf sqrt}'s even before the first {\cf sqrt} computation
has finished, i.e.,  all the {\cf sqrt} computations can proceed in parallel.
Thus, an iteration is not necessarily ``finished'' before the next one
starts, and so, we cannot reclaim or reuse the space that it occupies.

The last point is an illustration of a general axiom about parallelism:
Except when the computation is inherently sequential, you can usually buy
more parallelism with more space.

----------------------------------------------------------------
\section{Explicit, instead of implicit parallelism}

In both the evaluators that we have seen thus far, the parallelism is
expressed implicitly.  The programming language did not change in moving from
a sequential to a parallel execution model.\footnote{
 We did introduce one new construct, the {\cf LETREC} block, but that was
independent of the parallelism issue--- that construct is useful even in a
sequential implementation.
}
 The programmer writes programs in ordinary Scheme syntax, and it is
implicitly understood that for every expression that is a combination (i.e.,
application), all components may be evaluated concurrently.

The parallelism that resulted was what is sometimes called ``fine-grained''
parallelism, i.e., a parallel process is spawned even to evaluate something
as small and trivial as the number 23.

Suppose we are given the expression:
\beginlisp
(+ 23 34)
\endlisp
In the parallel evaluators we have seen, we spawn two tasks, one each for
evaluating {\cf 23} and {\cf 34}; the parent task evaluates {\cf +}, and
later, the application is performed.   If we count the instructions
that are executed in the parallel evaluators, we might find that a many
of the instructions are ``overhead'' instructions for spawning tasks and
synchronization, and there are few instructions actually doing the
evaluation work.  It is possible that the sequential evaluator, not having
these overheads, would actually have done the job faster.

Thus, one may draw the conclusion that until a subexpression is ``large
enough'' to justify the overhead, it is not worth spawning off a separate
processor to evaluate it.  One might imagine a processor, when given a
combination to evaluate, taking a decision, for each argument, whether to
evaluate it in line or to spawn another processor to do it.

Unfortunately, this judgement is not easy to make:
\begin{itemize}

\item In the case of constants it may be simple but, in general, it is not
possible to examine an arbitrary expression and judge how ``expensive'' it
is, because it may involve calls to arbitrary procedures whose complexity is
unpredictable.

\item The ``overhead'', against which we decide whether to spawn or not,
depends on the machine model, the available instructions, the method of
spawning, how cleverly the evaluator program and/or the compiler was written,
etc.  It is very difficult to come up with a clean and coherent model of all
these parameters.

\end{itemize}

A possible solution to this problem is this.  Instead of implicit
parallelism, where the evaluator decides when to spawn off a process, we
could have explicit parallelism, where we shift the responsibility to the
programmer.  We assume the normal, sequential interpretation of Scheme, and
introduce new constructs in the language by which the programmer can specify
exactly what is to be done in parallel.

\section{Explicit Parallelism: {\tt FUTURE} and {\tt TOUCH}}

We have already pointed out the connection between non-strictness and {\cf
FORCE} and {\cf DELAY} (in Part II of these notes).  In particular, we
remarked that when we see a program in which, at one point, we say {\cf
(DELAY $e$)}, producing an object $d$ which we later examine by saying {\cf
(FORCE $d$)}, there is {\em nothing in the program itself that indicates
exactly when it gets evaluated!\/} It could have happened at {\em any\/} time
between the evaluation of the two forms--- in particular, it could have
happened independently, in parallel.

Using this idea as a springboard, we introduce two new constructs in
Scheme.\footnote{
 These constructs are borrowed from a language called Multilisp,  developed
by Professor Robert Halstead and his research group at MIT.
}
The expression:
\beginlisp
(FUTURE $e$)
\endlisp
does the following:
\begin{tightlist}

\item Creates a ``promise'' for the value of expression $e$;

\item Spawns off a processor that will compute the expression $e$
in the current environment and store the result in the ``promise'', and

\item Returns a reference $f$ to the ``promise''.

\end{tightlist}
We will call $f$ a ``future'', or a ``future reference'', and we will call
the value of $e$ the ``value of the future''.

Given a future $f$, the expression:
\beginlisp
(TOUCH $f$)
\endlisp
 returns the value of the future, waiting, if necessary, for its computation
to be completed.

Thus, {\cf FUTURE} is similar to {\cf DELAY}, and {\cf TOUCH} is similar to a
memo-ized {\cf FORCE}.  In fact, {\cf FUTURE} and {\cf TOUCH} are
semantically identical to {\cf DELAY} and {\cf FORCE}, i.e., there is no way
to distinguish them in the program.\footnote{
 As usual, there is a difference if there are side-effects, but we defer that
question until Section \ref{side-effects-redux}.
}
 We use different names merely to suggest an operational difference.  {\cf
FUTURE} actually spawns off a processor, and the expression evaluates
concurrently, whereas {\cf DELAY} does no evaluation at all.  {\cf
TOUCH} does no evaluation, and simply waits for the concurrent processor to
finish, whereas {\cf FORCE} actually evaluates the expression.  In other
words, the only difference is in when the expression is scheduled to be
evaluated.

\subsection{Programming with {\tt FUTURE} and {\tt TOUCH}}

\label{programming-with-future-and-touch}

Remember that we are back to a sequential model of Scheme, and the {\em
only\/} place where concurrent execution is spawned is in a {\cf FUTURE}
expression.  We can ``parallelize'' the recursive procedure to count
the atoms in a tree, as follows:
\beginlisp
(define count-atoms-r (lambda (lst)
    (if (null? lst)
        0
        (if (atom? lst)
            1
            (let
                  ((f1 (FUTURE (count-atoms-r (car lst))))
                   (f2 (FUTURE (count-atoms-r (cdr lst)))))
              (+ (TOUCH f1) (TOUCH f2)))))))
\endlisp
 i.e., in the recursive step, we spawn off two processors to count atoms in
the left and right subtrees and, when they are done, add them up.  Note that
each spawned processor, recursively, may spawn off two more, and each of
those may spawn off two more, and so on, so that the parallelism can grow
exponentially. 

There is a very subtle point in the transformation.\footnote{
 I am grateful to Bert Halstead for pointing this out to me.
}
 We started with an
expression of the form:
\beginlisp
(+ $e_1$ $e_2$)
\endlisp
and transformed it into:
\beginlisp
(let
      ((f1 (future $e_1$))
       (f2 (future $e_2$)))
  (+ (touch f1) (touch f2)))

\endlisp
Why did we not convert it into the following?
\beginlisp
(+ (touch (future $e_1$)) (touch (future $e_2$)))
\endlisp
The two, superficially, look equivalent.  However, remember that except for
{\cf FUTUREs}, we have a basically sequential model.  Thus, in the latter
form, the combination:
\beginlisp
(+ <...> <...>)
\endlisp
is evaluated {\em sequentially\/}, say, from left to right.  This means that
we evaluate
\beginlisp
(touch (future e1))
\endlisp
{\em until we have its value\/} before we even look at
\beginlisp
(touch (future e2))
\endlisp
This means, of course, that {\cf e1} and {\cf e2} do not get evaluated in
parallel at all!

The lesson:  parallelizing a sequential program by introducing {\cf FUTURE} and
{\cf TOUCH} must be done with great thought and care.

\subsection{Implementation of {\tt FUTURE} and {\tt TOUCH}}

It is quite easy to extend our original, sequential explicit-control
evaluator code to implement {\cf FUTURE} and {\cf TOUCH}, using I-structure
cells.

First, let us look at {\cf (FUTURE $e$)}.  We begin by adding another clause
in {\cf EVAL-DISPATCH} that recognizes the {\cf FUTURE} special form:
\beginlisp
EVAL-DISPATCH
  ..
  (branch (FUTURE? (fetch exp)) EV-FUTURE)
  ..
\endlisp
Note that, like {\cf DELAY}, {\cf FUTURE} must be a special form and not a
procedure.

Here is the code to implement the {\cf FUTURE}:
\beginlisp
EV-FUTURE
  ;;; Assume EXP: (FUTURE e), VAL:E, CONTINUE: L
  (assign exp (cdr (fetch exp)))   ;  (e)
  (assign exp (car (fetch exp)))   ;  e
  (assign val (make-I-cell))       ;  VAL: I-cell
\null
  (assign unev (cons (fetch val) nil))           ; UNEV: (I-cell)
  (assign unev (cons (fetch env) (fetch unev)))  ; UNEV: (E I-cell)
  (assign unev (cons (fetch exp) (fetch unev)))  ; UNEV: (e E I-cell)
  (assign unev (cons DO-FUTURE (fetch unev)))    ; UNEV: (DO-FUTURE e E I-cell)
  (spawn (fetch UNEV))
\null
  (goto (fetch continue))          ; VAL: I-cell
\endlisp
i.e., we simply return a new I-structure cell, after spawning off a new
processor with initial stack containing the label {\cf DO-FUTURE}, the
expression $e$, the environment and a reference to the I-structure cell.
Thus, a ``future'' or ``promise'' is nothing more than a reference to an
I-structure cell. 

The spawned processor executes the following code:
\beginlisp
DO-FUTURE
  ;;; Assume STACK: (e E I-cell)
  (restore exp)                         EXP: e
  (restore env)                         ENV: E
  (assign continue DO-FUTURE-STORE)
  (goto EVAL-DISPATCH)
\null
DO-FUTURE-STORE
  ;;; Assume VAL: v, STACK: (I-cell)
  (restore exp)                                EXP: I-cell
  (set-I-cell! (fetch exp) (fetch val))
  (goto STOP)
\endlisp
i.e., we evaluate the expression, store it in the I-cell, and stop.

To implement {\cf (TOUCH $f$)}, we assume that {\cf TOUCH} evaluates to a
primitive procedure i.e., like {\cf FORCE}, {\cf TOUCH} is a procedure, not a
special form.  We arrive at the following code, via {\cf APPLY-DISPATCH} and
{\cf PRIMITVE-APPLY},
\beginlisp
APPLY-TOUCH
  ;;; assume: ARGL: (I-cell)
  (assign exp (car (fetch argl)))    ; the I-cell
  (get-I-cell val (fetch exp))
  (restore continue)
  (goto (fetch continue))
\endlisp
Of course, any processor executing this code will simply suspend at the {\cf
get-I-cell} instruction, if necessary, using the usual mechanism for
I-structures.

\subsection{Connection with Implicit Parallelism}

So far, we have seen three parallel explicit-control evaluators (ECEs) for Scheme:
\begin{tightlist}

\item An implicitly parallel, but strict evaluator; call this ECE-I-S.

\item An implicitly parallel, but non-strict evaluator; call this ECE-I-NS.

\item An explicitly parallel evaluator where the programmer specifies what to
do in parallel using {\cf FUTURE},  and waits for results using {\cf TOUCH};
call this ECE-F/T

\end{tightlist}
What is the relationship between the implicit and explict evaluators?

A program run under ECE-I-S can be transformed into a program for ECE-F/T
that will have the same parallel behavior,  as follows.  For every
combination of the form:
\beginlisp
(e1 e2 ... eN)
\endlisp
we convert it into:
\beginlisp
(let
      ((f1 (FUTURE e1))
       (f2 (FUTURE e2))
       ...
       (fN (FUTURE eN)))
  ((touch f1) (touch f2) ... (touch fN)))
\endlisp
Notice that this has the effect of evaluating all components of the
combination in parallel, waiting till all of them are values, and then doing
the application.

A program run under ECE-I-NS can be transformed into a program for ECE-F/T
that will have the same parallel behavior,  but the transformation is a
little more involved.  First, for every combination of the form:
\beginlisp
(e1 e2 ... eN)
\endlisp
we convert it into:
\beginlisp
(let
      ((f2 (FUTURE e2))
       ...
       (fN (FUTURE eN)))
  (e1 f2 ... fN))
\endlisp
i.e., we spawn off the evaluation of all the operands in parallel, evaluate
the operator and do the application, even though the operands may still be
evaluating.  But, notice that the arguments that we are passing in to {\cf
v1}, the value of {\cf e1}, are no longer the values of the expressions {\cf
eJ}--- they are futures that will contain those values.  Thus, for every
strict primitive operation such as the ``plus'' procedure, we will have to
{\cf TOUCH} the arguments before actually performing the operation.

\section{Explicit Parallelism: {\tt FUTURE} and implicit {\tt TOUCH}}

\subsection{The problem with explicit {\tt TOUCH}es}

When we studied streams and {\cf DELAY} and {\cf FORCE}, we encountered a
some unpleasant consequences having to do with the modularity of programs.
For example, even though lists and streams are conceptually so similar, we
had to manipulate them with similar, but distinct functions.  Thus, we had a
{\cf mapcar} for lists, and a {\cf map-stream} for streams, a {\cf filter}
for lists and a {\cf filter-stream} for streams, etc.  Section 3.4.5 of the
textbook also has an example of a program on streams that did not work
correctly until we inserted an extra {\cf DELAY} for an argument for the
{\cf integral} function, which required changing the function to expect a
delayed argument, which required changing all other calls to it.

With {\cf FUTURE} and {\cf TOUCH}, we have the same problem.  Consider the
following attempt at a ``{\cf mapcar}''-like procedure that does things in
parallel:
\beginlisp
(define map-flist (lambda (proc lst)
    (if (null? lst)
        nil
        (cons (FUTURE (proc (car lst)))
              (FUTURE (map-flist proc (cdr lst)))))))
\endlisp
Unfortunately, when we evaluate
\beginlisp
(map-flist square '(1 2 3))
\endlisp
we do not get the list \mbox{\cf (1 4 9)}--- we get a pair whose car is a
future for 1, and whose cdr is a future for another pair, whose car is a
future for 4, and ... and so on.  So, for example, we cannot write:
\beginlisp
(map-flist square (map-flist square '(1 2 3)))
\endlisp
 This will result in several errors:  the outer {\cf square} function will
try to multiply two futures together instead of numbers.  In a recursive
call, we'll try to take the {\cf car} and {\cf cdr} of a future.

We have no option but to define a new kind of sequence (let us call it them
``flists'') just as we had to introduce streams, and we have to redo all the
attendant functions---  {\cf cons-flist}, {\cf car-flist}, ...,
{\cf map-flist}, {\cf filter-flist} etc.

\subsection{Implicit {\tt TOUCHes}}

When we introduced {\cf FUTURE} and {\cf TOUCH},  we did not change any other
part of the language.  Thus, a primitive operator like {\cf +} expects
numbers as arguments and, so, if we are given two futures $f_1$ and $f_2$, we
had to explicitly touch them, like so:
\beginlisp
(+ (touch $f_1$) (touch $f_2$))
\endlisp

An alternative would be to change the semantics of all the strict primitive
operators, such as {\cf +}, so that their arguments can optionally be passed
in as futures.\footnote{
 This is the approach adopted by Halstead in Multilisp.
}
 The machine code for each such primitive operator must now be
changed to test if an argument is a future or not, and touch it if it is.

The code for our previous example is now:
\beginlisp
(define count-atoms-r (lambda (lst)
    (if (null? lst)
        0
        (if (atom? lst)
            1
            (let
                  ((f1 (FUTURE (count-atoms-r (car lst))))
                   (f2 (FUTURE (count-atoms-r (cdr lst)))))
              (+ f1 f2))))))
\endlisp
i.e., we have got rid of the {\cf TOUCH} forms in the last line.

With this change, we find that programs are modular, once more, i.e., we can
treat objects and ``futured'' objects uniformly.  For example, the form
\beginlisp
(let
      ((f1 (FUTURE e1))
       (f2 (FUTURE e2)))
  (+ f1 f2))
\endlisp
is again equivalent to:
\beginlisp
(+ (FUTURE e1) (FUTURE e2))
\endlisp
and {\cf mapcar}, {\cf filter}, ... will work on futured and unfutured
objects, alike.

\subsection{The cost of implicit {\tt TOUCH}es}

Of course, there is no free lunch.  Using implicit {\cf TOUCH}es, we have
restored modularity and elegance into the language for the programmer, but we
have introduced a performance penalty.  Now, every strict primitive operator
(such as {\cf +}), every time it refers to an argument, must execute
additional instructions that test if the argument is a future or not (and do
the appropriate synchronization if it is).  This can be a {\em major\/}
overhead.

% ----------------------------------------------------------------

\section{Side-effects, revisited}

\label{side-effects-redux}

\subsection{More insight into the problem}

In Part I, Section 5, of these notes, we pointed out that side-effects may be
especially tricky under parallel evaluation, because they lead to ``race
conditions''.  We illustrate the problem here again using an example based on
the simple bank balance from Section 3.1.1 of the textbook:
\beginlisp
(define balance 100)
\null
(define withdraw (lambda (amount)
    (set! balance (- balance amount))
    (list amount balance)))
\endlisp
(To simplify the presentation, we are not checking for errors, etc.)

Now, suppose we had a parallel evaluator, and two customers were executing
the form at the same time:
\beginlisp
{\rm Joe:}   ... (withdraw 10) ...
{\rm Moe:}   ... (withdraw 15) ...
\endlisp
A problem arises because the {\cf set!} form actually consists of two
separate activities--- first, computing \mbox{\cf (- balance amount)} and
then updating the binding for {\cf balance}.  In a parallel evaluator, the
following sequence of events is possible:
\begin{center}
\begin{tabular}{l|l}
Joe & Moe \\
\hline
... & ... \\
{\cf (- 100 10)}         & ... \\
...                      & {\cf (- 100 15)} \\
{\cf (set! balance 90)}  & ... \\
...                      & {\cf (set! balance 85)} \\
... & ...
\end{tabular}
\end{center}
Thus, the final balance is 85, an outcome that the bank is likely to be very
unhappy about, since it has just handed out \$25 out of the original \$100.

Thus, we need to ensure that the entire {\cf set!} activity is an {\em
atomic\/} action, i.e., indivisible, so that either Joe's entire transaction
precedes Moe's, or vice versa, but they never interleave.

Some more common terminology:  The {\cf set!} part of the program is also
known as a {\em critical section\/}, and the requirement that no more than
one processor can execute it at a time is known as a {\em mutual exclusion\/}
requirement.

It is not enough for our evaluator somehow to arrange for {\cf set!} alone to
be performed atomically.  Critical sections can encompass larger program
regions.  For example, suppose we had two accounts, a money-transfer
procedure, and a balance-inquiry procedure:
\beginlisp
(define balance2 100)
(define balance3 200)
\null
(define transfer-2-to-3 (lambda (amount)
  (set! balance2 (- balance2 amount))
  (set! balance3 (+ balance3 amount))
  (list amount balance2 balance3)))
\null
(define tot-balance (lambda ()
  (+ balance2 balance3)))
\endlisp
Now, suppose we are executing the following in parallel:
\beginlisp
{\rm Joe:}   ... (transfer-2-to-3 50) ...
{\rm Moe:}   ... (tot-balance) ...
\endlisp
Again, the following sequence of events can occur:
\begin{center}
\begin{tabular}{l|l}
Joe & Moe \\
\hline
... & ... \\
{\cf (set! balance2 (- 100 50)}   & ... \\
...                               & {\cf (+ 50 200)} \\
{\cf (set! balance3 (+ 200 50)}   & ... \\
... & ...
\end{tabular}
\end{center}
Moe will find that the total balance is \$250 instead of \$300. In this
example, the two {\cf set!} forms should be executed atomically, and together
constitute a critical section.

Actually this problem arises at another level in both programs.  In the {\cf
withdraw} program, it is possible for the two processors to execute the {\cf
set!} before the first processor can evaluate \mbox{\cf (list amount
balance)}, so that the first processor gets the wrong balance. A similar
problem occurs in the second program.

\subsection{Locks}

To address this issue, we introduce a new kind of an object called a {\em
lock\/}, and a special form:
\beginlisp
(HOLDING <lock>
    <expression>
    ...
    <expression>)
\endlisp
 When a processor P evaluates this form, it first tries to ``acquire'' the
lock. Only one processor at a time can hold a lock.  If some other processor
Q currently holds the lock, P must wait until Q releases it.  When P finally
acquires the lock, it proceeds to execute the {\cf <expression>}s in
sequence, returning the value of the last one.  When the last expression has
returned a value, P releases the lock.

In general, there can be many processors waiting for a lock, but only one
processor may hold it at a time.  When a lock is released and there are
several waiting processors, only one of them gets it, and the remaining
processors continue to wait.

We assume a procedure:
\beginlisp
(define make-lock (lambda () ...))
\endlisp
that creates, and returns a new lock.

We can now solve our first problem as follows.
\beginlisp
(define balance 100)
(define lock (make-lock))
\null
(define withdraw (lambda (amount)
    (HOLDING lock
       (set! balance (- balance amount))
       (cons amount balance))))
\endlisp
 Now, when Joe and Moe try to withdraw money at approximately the same time,
one of them acquires {\cf lock}, performs the transaction,  and releases
the lock, at which time the other can acquire it and perform his entire
transaction.

We can solve the second problem as follows:
\beginlisp
(define balance2 100)
(define balance3 200)
(define lock2 (make-lock))
\null
(define transfer-2-to-3 (lambda (amount)
  (HOLDING lock2
    (set! balance2 (- balance2 amount))
    (set! balance3 (+ balance3 amount))
    (list amount balance2 balance3))))
\null
(define tot-balance (lambda ()
  (HOLDING lock2
    (+ balance2 balance3))))
\endlisp

\subsection{Implementation of locks}

Locks can be implemented using a mechanism similar to that for I-structure
cells.  First:
\beginlisp
(define make-lock (lambda ()
    (cons 'free nil)))
\endlisp
i.e., a lock is simply a pair with a flag intialized to {\cf FREE} and an
empty waiting list of processors.

In the evaluator, we assume two new instructions for dealing with locks.  The
instruction
\beginlisp
(acquire-lock <lock>)
\endlisp
when executed in processor P, advances the program counter, and tests if the
lock is in the {\cf FREE} state.  If it is free, P simply continues (it is at
the next instruction), after setting the lock flag to {\cf BUSY}.  If the lock is
already {\cf BUSY},  P is added to the waiting list in the lock, and P is
taken off the ready list of processors.

The instruction:
\beginlisp
(release-lock <lock>)
\endlisp
 when executed in processor P always succeeds and continues at the next
instruction.  The lock must be in the {\cf BUSY} state, since P must have
previously acquired it.  If the waiting list on the lock is empty, then the
lock flag is set to {\cf FREE}. Otherwise, a processor (say, Q) on the
waiting list is put back on the ready list of processors, and the waiting
list updated to omit Q.  Note that Q will be at the instruction just
following the {\cf acquire-lock} instruction that put it on the waiting list.

\subsection{Other issues}

With locks, we have a first step towards dealing with side-effects in a
parallel evaluator.  However, we have barely scratched the surface of this
issue here.

First, raw locks are too primitive, too unstructured a mechanism.  It is
still upto the programmer to introduce locks and use them correctly.  It is
easy to make mistakes: we could have forgotten to use the lock in the {\cf
tot-balance} procedure, or we could have used the lock only for the {\cf
set!}s and forgotten to enclose the third, {\cf (list ...)} expression,
again leading to a consistency problem.  In general, we would like more
powerful abstractions than raw locks.

Second, we have introduced another {\em deadlock\/} problem.  Suppose, instead
of a single lock guarding both accounts, we had two locks, one for each
account:
\beginlisp
(define balance2 100)
(define lock2 (make-lock))
(define balance3 200)
(define lock3 (make-lock))
\endlisp
We might redo our transfer and inquiry procedures as follows:
\beginlisp
(define transfer-2-to-3 (lambda (amount)
  (HOLDING lock2
    (HOLDING lock3
      (set! balance2 (- balance2 amount))
      (set! balance3 (+ balance3 amount))
      (list amount balance2 balance3)))))
\null
(define tot-balance (lambda ()
  (HOLDING lock3
    (HOLDING lock2
      (+ balance2 balance3)))))
\endlisp
 Note that the two procedures happen to aquire the locks in the opposite
order.  Now, it is possible that processor P1, executing the transfer
procedure, acquires {\cf lock2} and then tries to acquire {\cf lock3}.
Meanwhile, processor P2, executing the inquiry procedure, may have acquired
{\cf lock3} and is trying to acquire {\cf lock2}.  Again, we will have a
situation where both procedures are holding one lock, and neither can make
progress because they need a lock that the other one holds.

Or, consider the situation where, after acquiring a lock, a processor goes
off into an infinite loop, never releasing it, so that other processors wait
indefinitely for the lock.

Another problem is that of {\em fairness\/}.  Suppose processor P0 has
acquired a lock, and P1 and P2 are waiting for it.  When P0 releases it, P1
gets it, and P2 remains waiting.  A little later, P0 again joins the waiting
list with P2.  When P1 releases it P0 gets it.  In this way, it is possible
for a processor like P2 to wait unreasonably long, or even forever (this is
called {\em starvation\/}).

We hope it is clear that dealing with side effects in a parallel evaluator is
an enterprise not to be taken lightly.  Unfortunately, it is beyond the scope
of this course to go into any greater depth on these issues.

% ----------------------------------------------------------------

\section{Real parallel machines: finite resources}

Our parallel computation model, so far, has been one of repeated {\em
sweeps\/} across all ``ready'' processors.  We had a concept of a ``time
step'' and a ``ready list'' of all processors that were ready to execute an
instruction.  At each time step, we executed one instruction from each of the
ready processors, and constructed a new ready list.

This model is highly idealized, and is only useful in that it gives us some
intuition about the time-independent, resource-independent aspects of
parallel programs, mechanisms and processes.  The characteristics and
behavior of a real machine are likely to be very different.  It is beyond the
scope of these notes to explore all these issues in any level of detail; we
mention some of them here just to give the reader a flavor of what is
involved.

First of all, it is unlikely that all instructions take the same time to
execute, so processors will not advance in lock-step, instruction by
instruction.

Second, a real machine will not have an infinite supply of processors.  Thus,
the entities that we have called ``processors'' are, in reality, ``logical
processors'', or ``processes'', and we must distinguish these from the
``physical'' processors that actually exist in a real machine.  In a real
machine, when we have more logical than physical processors, we must somehow
arrange to {\em multiplex\/} the available physical processors amongst the
logical processors, i.e., $m$ physical processors must do the work of $n$
logical processors.  Apart from the fact that this will lengthen the
computation because some things that were done in parallel are now done
sequentially, it will also lengthen the computation because it will introduce
some management overhead, i.e., extra instructions to do the book-keeping and
and scheduling activity of the multiplexing.

Multiplexing $m$ physical processors PP amongst $n$ logical processors LP,
where $m<n$ is almost always a challenging task, and introduces new problems
in its own right.  For example, at each instant in time, how do we choose
which $m$-subset of the LPs gets to use the PPs?  Typically, some LPs are
more important, or crucial to the computation, than others--- how do we
recognize them?  A bad choice, for example, can schedule an LP to execute
that reads from an empty cell, while the LP that writes into the cell is not
yet scheduled.  Thus, this kind of multiplexing can introduce deadlock into
an otherwise deadlock-free program.

Similarly, we need to take decisions like, ``LP45 will run on PP33''.  How do
we avoid the situation where all of the LPs are assigned to run on a few PPs
while the other PPs are sitting idle?
This is called the ``load balancing''
problem.

Third, we have assumed that spawning a new processor is instantaneous, i.e.,
it happens within one instruction, and that access to data is uniformly fast.
However, in a real machine, spawning a logical processor would involve, among
other things, transporting the processor state to the physical processor that
is to execute it, which may be on the other side of the machine.  Similarly,
how do we ensure that a logical processor runs on a physical processor that
is ``near'' the data that it will access?

These kind of ``resource management'' issues are major topics for research.
In fact, one might say that they are the {\em central\/} issues in parallel
computation.

% ----------------------------------------------------------------

\mbox{}\hrulefill\mbox{}

\vspace{2cm}

{\bf References}

...

\mbox{}\hrulefill\mbox{}

\vspace{2cm}

{\bf A note to the reader}

These are a first cut on notes on parallelism for 6.001.  If you have any
comments, criticisms, complaints, suggestions, etc., I would appreciate it if
you communicate them to me.  Thank you.

R.S.Nikhil, \\
MIT Laboratory for Computer Science, \\
545 Technology Square, Cambridge, MA 02139, USA

(617)-253-0237 \\
{\tt nikhil@xx.lcs.mit.edu}

\end{document}
