% Part I Notes on Parallelism for 6.001, Fall 1988
% Nikhil, November, 1988

\documentstyle[12pt]{article}

% HORIZONTAL MARGINS
% Left margin 1 inch (0 + 1)
\setlength{\oddsidemargin}{0in}
% Text width 6.5 inch (so right margin 1 inch).
\setlength{\textwidth}{6.5in}
% ----------------
% VERTICAL MARGINS
% Top margin 0.5 inch (-0.5 + 1)
\setlength{\topmargin}{-0.5in}
% Head height 0.25 inch (where page headers go)
\setlength{\headheight}{0.25in}
% Head separation 0.25 inch (between header and top line of text)
\setlength{\headsep}{0.25in}
% Text height 9 inch (so bottom margin 1 in)
\setlength{\textheight}{9in}
% ----------------
% PARAGRAPH INDENTATION
\setlength{\parindent}{0in}
% SPACE BETWEEN PARAGRAPHS
\setlength{\parskip}{\medskipamount}
% ----------------
% STRUTS
% HORIZONTAL STRUT.  One argument (width).
\newcommand{\hstrut}[1]{\hspace*{#1}}
% VERTICAL STRUT. Two arguments (offset from baseline, height).
\newcommand{\vstrut}[2]{\rule[#1]{0in}{#2}}
% ----------------
% EMPTY BOXES OF VARIOUS WIDTHS, FOR INDENTATION
\newcommand{\hm}{\hspace*{1em}}
\newcommand{\hmm}{\hspace*{2em}}
\newcommand{\hmmm}{\hspace*{3em}}
\newcommand{\hmmmm}{\hspace*{4em}}
% ----------------
% VARIOUS CONVENIENT WIDTHS RELATIVE TO THE TEXT WIDTH, FOR BOXES.
\newlength{\hlessmm}
\setlength{\hlessmm}{\textwidth}
\addtolength{\hlessmm}{-2em}

\newlength{\hlessmmmm}
\setlength{\hlessmmmm}{\textwidth}
\addtolength{\hlessmmmm}{-4em}
% ----------------
% ``TIGHTLIST'' ENVIRONMENT (no para space betwee items, small indent)
\newenvironment{tightlist}%
{\begin{list}{$\bullet$}{%
    \setlength{\topsep}{0in}
    \setlength{\partopsep}{0in}
    \setlength{\itemsep}{0in}
    \setlength{\parsep}{0in}
    \setlength{\leftmargin}{1.5em}
    \setlength{\rightmargin}{0in}
    \setlength{\itemindent}{0in}
}
}%
{\end{list}
}
% ----------------
% CODE FONT (e.g. {\cf x := 0}).
\newcommand{\cf}{\footnotesize\tt}
% ----------------
% INSTRUCTION POINTER
\newcommand{\IP}{$\bullet$}
\newcommand{\goesto}{$\longrightarrow$}
% ----------------------------------------------------------------
% LISP CODE DISPLAYS.
% Lisp code displays are enclosed between \bid and \eid.
% Most characters are taken verbatim, in typewriter font,
% Except:
%  Commands are still available (beginning with \)
%  Math mode is still available (beginning with $)

\outer\def\beginlisp{%
  \begin{list}{$\bullet$}{%
    \setlength{\topsep}{0in}
    \setlength{\partopsep}{0in}
    \setlength{\itemsep}{0in}
    \setlength{\parsep}{0in}
    \setlength{\leftmargin}{1.5em}
    \setlength{\rightmargin}{0in}
    \setlength{\itemindent}{0in}
  }\item[]
  \obeyspaces
  \obeylines \footnotesize\tt}

\outer\def\endlisp{%
  \end{list}
  }

{\obeyspaces\gdef {\ }}

% ----------------
% ILLUSTRATIONS
% This command should specify a NEWT directory for ps files for illustrations.
\def\psfileprefix{/usr/nikhil/parle/}
\def\illustration#1#2{
\vbox to #2{\vfill\special{psfile=\psfileprefix#1.ps hoffset=-72 voffset=-45}}} 

% \illuswidth is used to set up boxes around illustrations.
\newlength{\illuswidth}
\setlength{\illuswidth}{\textwidth}
\addtolength{\illuswidth}{-7pt}

% ----------------------------------------------------------------
% SCHEME CLOSURES AND PROCEDURES

% CLOSURES: TWO CIRCLES BESIDE EACH OTHER; LEFT ONE POINTS DOWN TO CODE (arg 1)
% RIGHT ONE POINTS RIGHT TO ENVIRONMENT (arg 2)
\newcommand{\closure}[2]{%
\begin{tabular}[t]{l}
\raisebox{-1.5ex}{%
  \setlength{\unitlength}{0.2ex}
  \begin{picture}(25,15)(0,-7)
   \put( 5,5){\circle{10}}
   \put( 5,5){\circle*{1}}
   \put( 5,5){\vector(0,-1){10}}
   \put(15,5){\circle{10}}
   \put(15,5){\circle*{1}}
   \put(15,5){\vector(1,0){12}}
  \end{picture}}
  \fbox{\footnotesize #2} \\
%
\hspace*{0.8ex} \fbox{\footnotesize #1}
\end{tabular}
}

% PROCEDURES: BOX CONTAINING PARAMETERS (arg 1) AND BODY (arg 2)
\newcommand{\proc}[2]{%
\begin{tabular}{l}
params: #1 \\
body: #2 \\
\end{tabular}
}

% ----------------------------------------------------------------
% HERE BEGINS THE DOCUMENT

\begin{document}

\begin{center}
MASSACHUSETTS INSTITUTE OF TECHNOLOGY \\
Department of Electrical Engineering and Computer Science \\
6.001 Structure and Interpretation of Computer Programs \\
Fall Semester, 1988

{\Large\bf Parallel Programs and Machines}

R.S.Nikhil

November 30, 1988
\end{center}

% ----------------------------------------------------------------

\section{Introduction}

Parallelism is a new topic in 6.001, introduced this term (Fall 1988) for the
first time.  Parallelism is not covered in the textbook (i.e., ``Structure
and Interpretation of Computer Programs'', Harold Abelson and Gerald Jay
Sussman, with Julie Sussman, MIT Press, 1985), so these notes are meant to
fill the gap.


There is almost universal agreement that future computers will be parallel
computers and will be programmed using parallel programming languages.
Unfortunately, there is very {\em little\/} consensus beyond that as to what
exactly will be the nature of parallel programming languages and machines.
This is a rich area of research, and many approaches are currently being
pursued.  In 6.001, we will study the topic at a very general and abstract
level.  In particular, we will study some parallel versions of Scheme and
some parallel evaluators that are based on the ``Explicit Control Evaluator''
of the textbook (Section 5.2).

% ----------------------------------------------------------------

\section{Why is parallelism important?}

There are countless problems in science, engineering, business, etc., that
are infeasible to solve even on the fastest computers available today,
because they are still far too slow.  Examples include:  weather prediction,
simulating the aerodynamics of new automobile or airplane bodies without wind
tunnels, simulating planetary motion, very high-speed bank-card transaction
processing, etc.

It is clear that we need to improve the performance of our computers by
several orders of magnitude.  In the past four decades, the main way we have
been speeding up our computers is by improving circuit technology (raw speed
and miniaturization).  Architectural and compiler-based innovations
(pipelining, vector processing, RISC) have also helped.  These advances are
exciting and continue to happen, but it is difficult to expect them to
produce the required {\em orders of magnitude\/} speed improvements in any
cost-effective manner.

The solution that most people look to is parallelism, i.e., instead of
speeding up a single machine, replicate it, and get the ensemble to {\em
cooperate\/} in solving a problem by solving pieces of it.  The goal is that
to increase the speed of the computer, we need only increase the {\em
number\/} of processors and memories, since, for any widget that we
manufacture, it is generally ``easier'' to produce twice as many than it is
to improve any one of them by a factor of two.  Of course, increasing their
individual speeds will continue to be beneficial.

% ----------------------------------------------------------------

\section{Algorithmic sources of parallelism in programs}

Consider the following two algorithms for computing the factorial of a
number $n$:
\begin{description}

\item[{\em Algorithm 1:\/}] \mbox{}
\begin{tightlist}
\item If $n=1$, the answer is 1.
\item If $n>1$, the answer is $n \times {\it factorial}(n-1)$.
\end{tightlist}

\item[{\em Algorithm 2:\/}] \mbox{} \\
The answer is ${\it product}(1,n)$, where the algorithm to compute ${\it
product}(l,m)$ is:
\begin{tightlist}
\item If $l=m$, the answer is $m$.
\item If $l+1=m$, the answer is $l \times m$.
\item If $l+1 < m$, the answer is ${\it product}(l,j) \times {\it
product}(j+1,m)$, where $j = \left\lfloor (l+m)/2 \right\rfloor$.
\end{tightlist}

\end{description}
In Algorithm 1, we basically set up a linear chain of $n$ multiplications that
must be done one after the other, so that it will take $O(n)$ time.

In Algorithm 2, we use a divide-and-conquer strategy.  To find the factorial
of 8, for example, we independently find the product of 1 through 4 and the
product of 5 through 8, and then multiply them together.  And, recursively,
to find the product of 1 through 4, we independently find the product of 1
through 2 and the product of 3 through 4, and then multiply them together.
Thus, we set up a {\em tree\/} of multiplications instead of a linear chain,
and it is possible to compute the factorial in $O(\log n)$ time.

The lesson: for the same mathematical function, the choice of algorithm can
have a dramatic effect on what can be done in parallel.  The courses 6.046,
6.848, and 6.849 explore issues in parallel algorithms much further.

Another issue is the size of the data on which we operate.  Clearly, even in
Algorithm 2 above, the amount of parallelism available depends on $n$.

The holy grail of parallel processing is ``linear speedup'', i.e., by
doubling the size of the machine (numbers of processors and memories), we
should get double the speed.  Unfortunately, this is, in general,
unachievable.  First, as described above, the structure of the program itself
will always place a theoretical limit on how much parallelism is available to
be exploited.  Second, a larger machine must be physically larger, and it
will always take finite time to transport data from one part of a machine to
another, and this time spent is an overhead that we can try to minimize, but
never eliminate.  Finally, there are serious difficulties in marshalling and
managing many parallel activities--- as anyone who has ever tried to manage a
large and unruly organization will appreciate, it is very difficult to keep
all processors busy doing useful work all the time!

% ----------------------------------------------------------------

\section{Operational sources of parallelism in programs}

Once we have chosen a particular algorithm, there are still many sources of
parallelism in our choice of {\em execution mechanisms\/}.

\subsection{Parallel evaluation of sub-expressions in a combination}

Consider the following Scheme expression (or its lambda-equivalent, shown to
the right):
\beginlisp
(let                           ((lambda (x y) (* x y)) (+ 3 4) (- 12 7))
    ((x (+ 3 4))
     (y (- 12 7)))
  (* x y))
\endlisp
 In sequential implementations of Scheme, the addition and subtraction
expressions are evaluated, one at a time, in some {\em unspecified, but
sequential\/}, order.   However, in a parallel machine, we could conceive of
the two arguments being evaluated simultaneously.

This kind of parallelism can have a multiplicative effect, due to recursion.
Consider the following program to square all the leaves in a general list
structure:
\beginlisp
(define map-square (lambda (lst)
  (cond
    ((null? lst) 0)
    ((atom? lst) (square lst))
    (else (cons (map-square (car lst))
                (map-square (cdr lst)))))))
\endlisp
Suppose we give it the balanced binary tree
\beginlisp
(((2 . 5) . (13 . 11)) . ((6 . 3) . (4 . 7)))
\endlisp
We soon encounter the form \mbox{\cf (cons (...) (...))}, which calls for the
simultaneous evaluation of these two forms:
\beginlisp
(map-square \fbox{((2 . 5) . (13 . 11))})
(map-square \fbox{((6 . 3) . (4 . 7))})
\endlisp
This, in turn, would call for the simultaneous evaluation of
\beginlisp
(map-square \fbox{(2 . 5)})
(map-square \fbox{(13 . 11)})
(map-square \fbox{(6 . 3)})
(map-square \fbox{(4 . 7)})
\endlisp
which, in turn, would call for the simultaneous evaluation of
\beginlisp
(square \fbox{2})
(square \fbox{5})
(square \fbox{13})
(square \fbox{11})
(square \fbox{6})
(square \fbox{3})
(square \fbox{4})
(square \fbox{7})
\endlisp
Thus, all eight invocations of {\cf square} could execute in parallel. When
they are finished, the four {\cf cons}'es awaiting these results could
execute in parallel.  When they are finished, the two {\cf cons}'es awaiting
these results could execute in parallel.  And finally, the last {\cf cons}
could execute, and we return the resulting tree.

Note that, in principle, this computation could be finished in $O(\log n)$
time, where $n$ is the number of leaves in the tree.  Compare this with a
sequential implementation, which would take at least $O(n)$ time.

% ----------------------------------------------------------------

\section{Parallelism and side effects: a dangerous mixture!}

Suppose we wanted to write an ``instrumented'' version of a function that, in
addition to doing it's usual thing, also recorded every argument it was
called with.  Here is an instrumented ``identity'' function, and some uses of
it:
\beginlisp
==> (define L nil)
{\em ()}
\null
==> (define id (lambda (x)
       (set! L (cons x L))
       x))
{\em F}
\null
==> (+ (id 10) (id 20))
{\em 30}
\null
==> L
{\em ?}
\endlisp
What answer should we get on the last line?

Even in sequential Scheme, since the order of evaluation of arguments in a
combination is not specified,  we do not know which of the expressions
\mbox{\cf (id 10)} and \mbox{\cf (id 20)} will be evaluated first.  Thus, we
could get either of these two answers:
\beginlisp
{\em (10 20)}        {\rm or}        {\em (20 10)}
\endlisp
 When we run the same program twice on the same sequential implementation of
Scheme, we generally expect the answer to be repeatable, i.e., two runs of
the same program will produce the same result.  However, different sequential
implementations of Scheme may choose different evaluation orders.  Thus,
such programs are said to be {\em indeterminate\/}, and it is not good style
to write programs in which answers depend on the order of evaluation of
arguments.\footnote{
 Some of this is in the eye of the beholder.  If we want to treat {\cf L} as
the ``{\em unordered set\/} of all arguments to {\cf id}'', then \mbox{\em
(10 20)} and {\em (20 10)} represent the same set, and, in this sense, the
answer ``does not depend'' on the order of the calls.
 }

In a parallel model of evaluation, we will have the same problem of
indeterminacy.  In fact, it is even worse, because two runs of the same
program on the {\em same\/} parallel implementation of Scheme may produce
different results, because factors such as the load on the machine (e.g., how
many other users are running their programs at the same time?) may result in
a different parallel schedules on different runs.

But wait! the problem gets much worse!  In a sequential implementation of
Scheme, either \mbox{\cf (id 10)} precedes \mbox{\cf (id 20)} or vice versa.
In a parallel implementation,  they can execute concurrently.  Let's focus on
the expression:
\beginlisp
       ...
       (set! L (cons x L))))
       ...
\endlisp
 Recall the evaluation rules for the {\cf set!} special form.  First, we
evaluate the expression part, {\em i.e.\/}, \mbox{\cf (cons x L)}.  Then, we
update the binding for the identifier, {\em i.e.\/}, {\cf L}.  If we imagine
``time'' going downward, we can depict the schedule of events as follows:
\beginlisp
    ...
    {\em Evaluate\/} (cons x L)
    {\em Update binding for\/} {\cf L}
    ...
\endlisp
Now, since there are two concurrent evaluations of {\cf id}, it is possible that
their activities get interleaved as follows:
\begin{center}\footnotesize
\begin{tabular}{l|l}
Activity for {\cf (id 10)}          &    Activity for {\cf (id 20)} \\
\hline
...                                &    \\
{\em Evaluate\/} {\cf (cons 10 L)}  &    ... \\
                                   &    {\em Evaluate\/} {\cf (cons 20 L)} \\
{\em Update binding for\/} {\cf L} &    ... \\
...                                &    {\em Update binding for\/} {\cf L}
\end{tabular}
\end{center}
Here, in each of the ``{\em evaluate\/}'' steps, the value of {\cf L} is
still {\cf nil}.  The {\cf cons}'es thus produce {\cf (10)} and {\cf (20)},
respectively.  Assuming the {\em update\/} activities go in the order shown,
the final result is
\beginlisp
{\em (20)}
\endlisp
Similarly, we could also have got the result:
\beginlisp
{\em (10)}
\endlisp
Neither result could have occurred in the sequential implementation!

This situation is called a {\em race condition\/}, i.e., one activity races
to perform an update before some other activity tries to read it.  For
correct execution of the above program, we want the {\em evaluate\/} and {\em
update binding\/} parts of a {\cf set!} to execute {\em atomically\/}, i.e.,
as one, indivisible event.  More about this later.

% ----------------------------------------------------------------

\section{The language we will consider for parallel execution}

Since side-effects are so dangerous in a parallel language, we will
temporarily banish them from our lexicon--- no {\cf set!}, {\cf set-car!} or
{\cf set-cdr!}.   We will return to this issue in Part II of these notes.

Similarly, we are also going to banish internal {\cf DEFINE}s from our
language.  {\cf DEFINE}s are only allowed at the top-level.

To summarize, the language we will deal with has only top-level {\cf
DEFINE}s of the form:
\beginlisp
(define {\em symbol\/} {\em expression\/})
\endlisp
and expressions have the following forms:
\begin{center}
\begin{tabular}{ll}
{\em Constants\/}    & {\cf 0, 23, 3.14, ..., \#t, "Apres Scheme", ...} \\
{\em Symbols\/}      & {\cf x, fact, mapcar, car, nil, cons, +, ...} \\
{\em Quoted\/}       & {\cf '{\em expr\/}} \\
{\em Lambdas\/}      & {\cf (lambda ($x_1$ ... $x_N$) $b_1$ ... $b_M$)} \\
{\em Lets\/}         & {\cf (let (($x_1$  $e_1$) ... ($x_N$ $e_N$)) $b_1$ ... $b_M$)} \\
{\em Ifs\/}          & {\cf (if $e_1$ $e_2$ $e_3$)} \\
{\em Applications\/} & {\cf ($e_1$ ... $e_N$)}
\end{tabular}
\end{center}
Note that we have {\cf if} but not {\cf cond}.

\subsection{Preprocessing}

For every expression entered at the top-level, We assume three pre-processing
steps.  The code for this pre-processing may be found in Appendix
\ref{pre-processor}.

\subsubsection{Moving quotes inward to symbols only}

A quoted list-structured expression, such as:
\beginlisp
'(10 (A 20) 30)
\endlisp
is expanded out into:
\beginlisp
(cons 10
      (cons (cons 'A (cons 20 nil))
            (cons 30
                  nil)))
\endlisp

{\em Rationale\/}: Our representation for ``cons cells'' is going to change
to include ``synchronization flags'', etc.  A quoted list structure still
uses ordinary cons cells.  Thus, we need to transform it into explicit calls
to {\cf cons} in order to build the right structures.

\subsubsection{{\tt LET}s to {\tt LAMBDA}s}

This is the standard transformation:

\begin{minipage}[t]{2in}
\beginlisp
(let (($x_1$ $e_1$)
      ...
      ($x_N$ $e_N$))
  $b_1$
  ...
  $b_M$)
\endlisp
\end{minipage}\hfill
$\Longrightarrow$\hfill
\begin{minipage}[t]{3in}
\beginlisp
((lambda ($x_1$ ... $x_N$) $b_1$ ... $b_M$)
 $e_1$ ... $e_N$)
\endlisp
\end{minipage}

{\em Rationale\/}: One fewer special form to deal with in the evaluator.

\subsubsection{Lexical Addressing}

(This is also covered in Section 5.3.7 of the textbook.  Please refer to that
section for details.)

Recall that in the environment model of computation, an {\em environment\/}
is a sequence of {\em frames}; each frame is a sequence of {\em bindings\/},
and each binding is a \mbox{({\em symbol\/} . {\em value\/})} pair.  To
lookup a symbol in an environment, we looked for its nearest binding in the
sequence of frames.

It turns out that, for lambda-bound variables (i.e., ignoring {\cf DEFINE}s),
we do not need the symbol at all in order to look up its value.  This is
because, with Scheme's lexical scoping rules, we can predict the exact
position in the environment structure where we will find the binding for a
given symbol.  A ``position'' is specified by two numbers $i$ and $j$,
meaning, go up to the $i^{\em th}$ frame, and go over to the $j^{\em th}$
binding. 
Thus, every lambda-bound variable can be replaced by the new special form:
\beginlisp
(lookup $i$ $j$)
\endlisp

Here is an example of the transformation:

\begin{minipage}[t]{1.75in}
\beginlisp
((lambda (x y)
  (lambda (a b c d e)
     ((lambda (y z)
         (* x y z))
      (* a b x)
      (+ c d x))))
3
4)
\endlisp
\end{minipage}\hfill%
$\Longrightarrow$\hfill%
\begin{minipage}[t]{4in}
\beginlisp
((lambda
  (lambda
     ((lambda
         (* (lookup 2 0) (lookup 0 0) (lookup 0 1)))
      (* (lookup 0 0) (lookup 0 1) (lookup 1 0))
      (+ (lookup 0 2) (lookup 0 3) (lookup 1 0)))))
3
4)
\endlisp
\end{minipage}

Note that we don't need the formal parameter lists in lambdas any more.

{\em Rationale\/}: Lexical addressing is really independent of parallelism.
In fact, it is a standard technique in any reasonable sequential
implementation.  The reason we introduce it here will become apparent in Part
II of these notes, where we discuss a ``non-strict'' evaluator.  There, we
would like to build a frame for the arguments of an application {\em
before\/} we know the value of the procedure part, i.e., before we know the
names of the formal parameters of the procedure.  The lexical-addressing
transform essentially eliminates the need for formal parameter names.

% ----------------------------------------------------------------

\section{From individual processors to multiprocessors}

We are going to extend the sequential, single-processor explicit-control
evaluator described in the textbook.  In this section, we will look at what
it means to have multiple processors, what it means to ``run'' a
multiprocessor, how new processors come into existence, and how they
terminate, or die.

\subsection{Processors}

First, let us clarify our model of the register machine used by the
sequential explicit-control evaluator.  We're going to think of it as
consisting of two separate modules:
\begin{tightlist}

\item A {\em processor\/}, which is a set of nine registers, and

\item a {\em memory\/} which contains the controller code, the stack and the
heap.

\end{tightlist}
The nine registers are the usual seven ({\cf exp}, {\cf env}, {\cf val}, {\cf
fun}, {\cf unev}, {\cf argl}, and {\cf continue}), plus two more: a {\cf
program-counter} and a {\cf stack}.

The {\cf stack} register contains the list of things currently saved in the
stack. Thus, one can think of a {\cf save} instruction:
\beginlisp
(save <reg>)
\endlisp
as equivalent to:
\beginlisp
(assign stack (cons (fetch <reg>) (fetch stack)))
\endlisp
and the restore instruction:
\beginlisp
(restore <reg>)
\endlisp
as equivalent to:
\beginlisp
(assign <reg> (car (fetch stack)))
(assign stack (cdr (fetch stack)))
\endlisp
Now that the stack is simply a register, we can do things like holding on to
the stack from one of the other registers.

The {\cf program-counter} register contains the list of instructions from the
controller code that begins at the instruction to be executed next.  At each
time-step, the machine executes one instruction--- the instruction at the car
of the list in the {\cf program-counter} register--- and stores a new list of
instructions into the {\cf program-counter}.  Most of the time, this new list
is the cdr of the previous list.  For \mbox{\cf (goto $L$)} and successful
\mbox{\cf (branch <test> $L$)} instructions, the new list is the list of
instructions beginning at the label $L$.

All these lists--- the controller code, the stack contents, and the actual
lists manipulated by the program being executed--- all of them reside in the
{\em memory\/} of the machine.

Thus, a {\em processor\/}, or a {\em task}, is just a set of nine registers.
Creating a new processor means creating a new set of nine registers.

\subsection{Multiple processors}

To generalize our model to a parallel machine, we assume that we have many
processors, i.e., many sets of nine registers.  In fact, we will assume that
we can have as many processors as we like--- we can snap our fingers and
conjure up a new processor at will.  Thus, it no longer makes sense to say
``the {\cf exp} register'';  we have to say, ``the {\cf exp} register in
such-and-such processor.''  Similarly, we can no longer say: ``execute the
instruction \mbox{\cf (assign val (fetch exp))}''; we have to say, ``execute
it in such-and-such processor.''

All our processors share the same memory, i.e., the lists that their
registers refer to are all in the single, common memory.   Thus, it is
possible for more than one processor to be referring to the same object in
memory.  In fact, this is exactly how processors will communicate with each
other--- one processor will write a value into some cell that is read by
another processor.\footnote{
 Thus, our model of the machine is a so-called ``shared memory'' model.  We
emphasize that this is only an abstract view--- do {\em not\/} extrapolate
this into a physical view of all memory being located in a single physical
module,  because this raises fearful visions of
``bottlenecks'' and other nasties.
}

\subsection{Sweeps, critical paths, parallelism profiles, etc.}

What does it mean to execute an instruction in a particular processor?  As
usual, we examine its {\cf program-counter} register.  If it has an empty
list of instructions, i.e., it has reached the end of the the controller
code, then this processor simply disappears, goes poof!, dies.  Otherwise, we
execute the first instruction specified in the {\cf program-counter}.  For
most instructions, this modifies one or more of the other registers in the
processor.  Finally, we advance the program counter appropriately.  For one
particular instruction ({\cf spawn}, see below), we will also create a new
processor.

Thus, executing an instruction in a processor may result in zero, one or two
processors.

The overall view of execution of the multiprocessor machine is the following.
We initialize the machine to have one processor containing the main,
top-level expression that we want to evaluate.  Then, we repeatedly conduct a
{\em sweep\/} on the machine.  We consider each sweep as taking one ``time
step''.

A sweep at time $j$ is described as follows.  Let $S_j$ be the set of
processors that are alive.  We will execute {\em one\/} instruction in each
of the processors in $S_j$.  For each such instruction executed, we will put
the resulting zero, one or two processors into a new set $S_{j+1}$.

As we repeatedly perform sweeps, the number of processors will grow and
shrink--- some terminate, some do ordinary instructions, some spawn new
processors.  At some point in time $n$, there will be zero processors alive,
i.e., all processors would have terminated (except, of course, if the program
had an infinite loop).  At this time, we say that the entire program (or
multiprocessor machine) has terminated.  We call $n$ the {\em critical
path\/} of the program, i.e., it is the shortest time necessary to execute
this program.

We can draw a graph that plots $\left| S_j \right|$ (the number of processors
alive at time step $j$) versus $j$.  This is called the {\em parallelism
profile\/} of the program, i.e., it tells us, for each time step $j$, how
many things could be done in parallel for that program.

\subsection{``Cold-starting'' a processor}

We need a convention by which a new processor that has just been created
knows what to do, i.e., knows what it is supposed to accomplish.
The very first two instructions in the controller code are these:
\beginlisp
  (restore continue)
  (goto (fetch continue))
\endlisp
 Thus, we assume that, somehow, the creator of this processor has initialized
the {\cf stack} of this new processor with suitable contents.  The top
element on the stack must be a label, so that the effect of these first two
instructions is to jump to that label.  Subsequent instructions at that label
will deal with the rest of the objects in the stack.

\subsection{The {\tt SPAWN} instruction to create a new processor}

We introduce one new instruction in our controller code:
\beginlisp
  (spawn <{\em stack-contents}>)
\endlisp
Its effect is to create a new processor with the given initial stack
contents.  We often refer to the processor executing the {\cf spawn}
instruction as the ``parent'' and the newly created processor as
the ``child''.

For example, suppose a processor P$_1$ wants to spawn a new processor P$_2$
that should evaluate the expression \mbox{\cf '(f x)} in the environment
which is currently in P$_1$'s {\cf env} register, print the result and stop.
P$_1$ may execute the following code:
\beginlisp
  (assign val nil)
  (assign val (cons (fetch env) (fetch val)))
  (assign val (cons '(f x) (fetch val)))
  (assign val (cons EVAL-PRINT-AND-DIE (fetch val)))
  (spawn (fetch val))
\endlisp
Basically, P$_1$ builds up the value of P$_2$'s stack in its own {\cf val}
register, and spawns a new processor (P$_2$) using that value.  Meanwhile,
P$_1$ goes on its merry way, doing other things.

When P$_2$ starts executing, its stack, therefore, contains the label
{\cf EVAL-PRINT-AND-DIE}, the expression \mbox{\cf '(f x)}, and an
environment, in that order.  The effect of the first two instructions in the
controller code is to pop the label and jump to it.  So, P$_2$ arrives at
the label {\cf
EVAL-PRINT-AND-DIE}, with the stack containing the expression and the
environment.  The code at that label looks like this:
\beginlisp
EVAL-PRINT-AND-DIE
  (restore exp)
  (restore env)
  (assign continue PRINT-AND-DIE)
  (goto EVAL-DISPATCH)
\endlisp
i.e., it sets up the contract for {\cf EVAL-DISPATCH} by setting up the {\cf
exp}, {\cf env} and {\cf continue} registers and going there.  Thus, when
{\cf EVAL-DISPATCH} has done its thing, P$_2$ ends up at the label {\cf
PRINT-AND-DIE} with the value sitting in the {\cf val} register.

The code at {\cf PRINT-AND-DIE} looks like this:
\beginlisp
PRINT-AND-DIE
  (perform (print (fetch val)))
  (goto STOP)
\endlisp
where the label {\cf STOP} is the {\em last\/} item in the controller code.
When P$_2$ gets there, therefore, it dies.

% ----------------------------------------------------------------

\section{A ``Strict'' Parallel Explicit-Control Evaluator}

We now have enough mechanism to specify a parallel evaluator for Scheme.  The
objective in this section is to design the controller code so that it has
the following behavior.  It will have the usual {\cf EVAL-DISPATCH} code,
performing the usual evaluation of expressions, {\em except\/} when it
encounters a combination:
\beginlisp
(e1 ... eN)
\endlisp
 At this point, it will spawn $N-1$ new processors to evaluate {\cf e2}
through {\cf eN} in parallel, and, concurrently, it will continue to evaluate
{\cf e1}.  The first $N-1$ processors that finish their evaluations will just
die. When the $N^{\em th}$ processor finishes its
evaluation, we know that we have all the necessary values
\beginlisp
v1 ... vN
\endlisp
where {\cf v1} must be a procedure object.  This processor continues to {\cf
APPLY-DISPATCH} to perform the application of the procedure object to the
argument values.

The crux of the change from the sequential evaluator is in the section of
controller code beginning at label {\cf EV-APPLICATION} and ending at {\cf
APPLY-DISPATCH}.

The main questions are:  How do we spawn off these processors?  Where do they
store their result values?  How do they know whether or not they are the last
ones to finish their evaluations?

\subsection{The frame, and the process structure}

The central data structure we use to manage this parallel activity is a {\em
frame\/}.  Please note that this frame is {\em different\/} from the frame
used in the sequential evaluator.

To handle an application
\beginlisp
(e1 e2 ... eN)
\endlisp
we will build a frame that looks like this:
\beginlisp
(N (FULL . <v1 target>) (FULL . <vN target>) ... (FULL . <v2 target>))
\endlisp
 i.e., it is a list of length N+1.  The first component is a counter, i.e., a
number initialized to N, the length of the application form.  The remaining
components are cons cells.  Each of these cells has the symbol {\cf FULL} in
the car, and the cdr is the slot where an evaluated component of the
application will go.

As indicated earlier, there are going to be N processors, each evaluating one
component of the application {\cf e1} through {\cf eN}.  Since these
component expressions may vary widely in their complexity, it is {\em
unpredictable\/} as to what order these processors will finish their
respective evaluations.  Thus, we need some synchronization mechanism which
will tell us that all the N evaluations are done.  This is the role of the
counter.

The meaning of the counter is this.  When it has the value $i$, it means that
N$-i$ components have been evaluated (with their values placed in their
designated target slots), and $i$ components are still in the process of
being evaluated.  Thus, the counter is initialized to N; it is decremented by
1 each time a component value becomes ready; so, when it reaches 0, all
components must be ready, and we can then proceed safely to do the {\em
apply\/}.

Consider the processor P that is responsible for {\cf eJ}.  Its mission in
life is this.   After evaluating {\cf eJ} to value {\cf vJ}, it will write
that value into the appropriate target cell.  Then, it will decrement the
counter, and test its value.  If it is non-zero, then the processor P simply
dies.  If it is 0, it continues, going on to perform the {\em apply\/}.

To summarize the big picture for the process structure:  A processor
encounters the application form. It spawns off N$-1$ processors to evaluate
the argument expressions, and itself continues, evaluating the operator
expression.  The processor has thus become N processors.  The first N$-1$
that finish their evaluation will die.  The last one continues, to perform
the {\em apply\/}.

What are the {\cf FULL} symbols in the frame meant for?  Ignore them for now; 
they have no purpose at all in this evaluator.  They are used in the
``non-strict'' evaluator, which we will discuss in Part II of these notes.
They are here only for compatibility between the two evaluators.

Note the reverse order of cells for {\cf vN} through {\cf v2}.  There is no
deep reason for this.  Recall that, in the sequential evaluator, the argument
list was built in the reverse order into the {\cf argl} register.  The only
reason was to avoid extra consing--- if we wanted to keep them in the right
order, we would have to use {\cf append}, which does a lot of consing.  The
reason for the reversed order here, in this evaluator, is the same.

\subsection{Information needed to spawn an argument evaluation}

What information does each spawned processor need?  Of course, it needs the
argument expression {\cf eJ} that it is responsible for.  For the evaluation
environment, all the processors need the environment E which was in the {\cf
env} register when we came into {\cf EV-APPLICATION}.

Each task needs a reference to the frame cell into which it should store the
value that it produces.

Each task needs a reference to the frame itself, for two reasons.  First, it
needs to access the counter, in order to decrement and test it, and the
counter is the first thing in the frame. Second, if the counter went to zero,
then that task is responsible to go to do the {\em apply\/}, for which it
needs access to all the component values of the application.

Finally, since any of the N$-1$ spawned tasks may end up doing the {\em
apply\/}, all of them need the information needed to do it.  Thus, all of
them need to share the original stack as it was when we came into {\cf
EV-APPLICATION}, and all of them need to know the label $L$ that was in the {\cf
continue} register at that time.

Thus, to spawn an argument processor, we need to initialize it with a stack
containing:
\beginlisp
EVAL-ARG  ({\em Label at which to start work on the arg\/})
eJ        ({\em Arg expression\/})
E         ({\em Environment\/})
<{\em reference to frame cell for vJ\/}>
<{\em reference to entire frame\/}>
L         ({\em continuation to go to after application done\/})
...
<{\em original stack\/}>
...
\endlisp

\subsection{Spawning the argument evaluations}

Here is the code for spawning the arguments.  There is a prelude that builds
the common part of the stack used for all the spawned arguments:

\beginlisp
EV-APPLICATION
  (save continue)

;;; --- Spawn arguments; VAL, FUN and CONTINUE are used as temporaries
  (assign val (length (fetch exp)))   ; N, length of application
  (assign val (cons (fetch val) nil)) ; first cell of frame (with counter)
  (save val)
  (assign fun (fetch stack))          ; keep common part of stack in FUN

  (assign unev (cdr (fetch exp)))
  (assign exp  (car (fetch exp)))
  (assign argl nil)                   ; the argument list
\endlisp

Each time around {\cf SPAWN-ARGS-LOOP}, we build up the stack to have the
form we have discussed above, and spawn a processor, after which we reset the
stack to the common part.

\beginlisp
SPAWN-ARGS-LOOP
  (branch (null? (fetch unev)) EVAL-OP)

;;; --- Spawn an argument
  (assign val  (make-frame-cell))                  ; a frame cell
  (assign argl (cons (fetch val) (fetch argl)))    ; cons it into arg list
  (save val)                                       ; save frame cell
  (save env)                                       ; save environment
  (assign val (car (fetch unev)))                  ; an argument
  (save val)                                       ; save it
  (assign continue EVAL-ARG)                       ; save
  (save continue)                                  ;   EVAL-ARG label
  (spawn (fetch stack))
;;; --- Spawned.

  (assign stack (fetch fun))                       ; restore stack to common part
  (assign unev (cdr (fetch unev)))                 ; remaining args
  (goto SPAWN-ARGS-LOOP)
\endlisp

Now, the parent task goes on to evaluate the operator expression.
\beginlisp
EVAL-OP
  (restore unev)                                 ; counter cell
  (save unev)                                    ;
  (assign val (make-frame-cell))                 ; frame cell for e1
  (assign argl (cons (fetch val) (fetch argl)))  ; cons into arg list
  (perform (set-cdr! (fetch unev) (fetch argl))) ; link counter cell to rest of frame
  (save val)
  (assign continue SET-VALUE)
  (goto EVAL-DISPATCH)
\endlisp

Each spawned argument-evaluation begins here.
\beginlisp
;;; Assume: stack: (exp env frame-cell frame-handle ...)
;;; Effect: eval(exp, env) into val, at SET-VALUE

EVAL-ARG
  (restore exp)
  (restore env)
  (assign continue SET-VALUE)
  (goto EVAL-DISPATCH)
\endlisp

All N processes come to {\cf SET-VALUE} after they have evaluated their
component into the {\cf val} register.
\beginlisp
;;; Assume: VAL: v  STACK: (frame-cell frame ...)
;;; Effect: Store v in frame-cell, pop stack.
;;;         Decrement arg-count in frame. If not 0, stop.
;;;         Else goto GET-READY-TO-APPLY
SET-VALUE
  (restore exp)                                             ; frame cell
  (perform (set-frame-cell-value! (fetch exp) (fetch val)))
  (restore exp)
  (assign val (decr-arg-count-and-fetch! (fetch exp)))
  (branch (zero? (fetch val)) READY-TO-APPLY)
  (goto STOP)
\endlisp
The {\cf decr-arg-count-and-fetch!} function decrements the counter, and
also returns the new value, as one atomic action.  Note that we could not
have done it as two separate instructions, like this:
\beginlisp
  (perform (decr-arg-count (fetch exp)))
  (assign val (fetch-arg-count (fetch exp)))
\endlisp
The reason is that there would be a race condition.  Suppose the counter had
the value 2, and the last two processors arrived at these instructions at exactly the
same time.  Both decrements occur together, so the counter goes to 0. Then,
both processors fetch the counter into their {\cf val} registers.  Unfortunately, this
means that {\em both\/} processors will think that they are the last ones home, and both will go on
to do the {\em apply\/}, and chaos will ensue as they step all over each
other!

The last, surviving participant in the application form arrives at {\cf READY-TO-APPLY},
knowing
that all the components have been evaluated.  This processor simply sets up
for {\cf APPLY-DISPATCH}, pulling out the operator into {\cf fun} and the
argument list into {\cf argl}, and goes there.
\beginlisp
;;; Assume: EXP: frame   STACK: continuation ...
READY-TO-APPLY
  (assign exp (cdr (fetch exp)))
  (assign fun (car (fetch exp)))
  (assign fun (frame-cell-value (fetch fun)))
  (assign argl (cdr (fetch exp)))
  (goto APPLY-DISPATCH)
\endlisp

\subsection{Conclusion}

In moving from the sequential to a parallel evaluator, all we needed was one
new instruction--- {\cf spawn}--- and one new, atomic function--- {\cf
decrement-arg-count-and-fetch}.  The only part of the code that we had to
change was the {\cf EV-APPLICATION} section, upto {\cf APPLY-DISPATCH}.

However, there is still more parallelism that we should be able to squeeze
out of our Scheme programs.  Consider the following program:
\beginlisp
(define f (lambda (x y) (+ (sqrt x) y)))
\null
(f 23 (sqrt 34))
\endlisp
In our current evaluator, we would evaluate {\cf f}, {\cf 23} and \mbox{\cf
(sqrt 34)} in parallel.  Then, we invoke the function, which computes {\cf
+}, \mbox{\cf (sqrt 23)} and {\cf 34} in parallel, then returns the sum.  The
total time for the program will be at least the sum of the times for the two
{\cf sqrt} computations.

If, however, we could enter the function body of {\cf f} without waiting for
all argument values to be ready, we could begin work on \mbox{\cf (sqrt 23)}
inside the function while \mbox{\cf (sqrt 34)} was still computing outside
the function.  The total time for the program then could be much shorter,
because the {\cf sqrt} computations are overlapped in time.

We refer to these two behaviors as ``strict'' and ``non-strict'' evaluation.
We will look at the latter in more detail in Part II of these notes.

% ----------------------------------------------------------------

\section{To be continued ...}

In Part II of these notes, we will explore the following topics:
\begin{itemize}

\item
 The parallelism available from ``non-strict'' evaluation, i.e., allowing the
processor that evaluates the operator to charge on ahead to {\cf
APPLY-DISPATCH} even though the arguments may not have been evaluated fully
yet.

\item
 The programming implications of non-strict evaluation, and
 the connections with {\cf FORCE} and {\cf DELAY}.

\item
 Another version of the evaluator that implements non-strictness.

\item
 Orders of growth, revisited, under parallel models of computation.

\item
 Real parallel machines, where we have finite resources and can't conjure up
processors at will.

\item
 Languages with explicit, as opposed to implicit, parallelism, i.e., YOU tell
the machine exactly what you want it to do in parallel.

\item
 Side-effects, revisited: how to tame them (or at least live with them) in
parallel models of computation.

\end{itemize}

% ****************************************************************

\newpage

\appendix

\section{Pre-processor}

\label{pre-processor}

This pre-processor does three things:
\begin{tightlist}

\item Move quotes inward onto atoms

\item Convert {\cf LET}s to {\cf LAMBDA}s

\item Convert lambda-bound variables into \mbox{\cf (LOOKUP $i$ $j$)} forms

\end{tightlist}

\beginlisp
(define pre-process (lambda (exp)
  (if (define? exp)
      (list (car exp)
	    (cadr exp)
	    (pre-process-expression (caddr exp) nil))
      (pre-process-expression exp nil))))
\null
(define pre-process-expression (lambda (exp env)
  (cond
    ((self-evaluating? exp) exp)
    ((symbol? exp) (find-variable exp env 0))
    ((quoted? exp) (convert-quoted (cadr exp)))
    ((lambda? exp) (cons (car exp)
			 (pre-process-sequence (cddr exp)
					       (cons (cadr exp) env))))
    ((let? exp) (pre-process-expression (convert-let-to-lambda exp)
					env))
    ((if? exp) (list (car exp)
		     (pre-process-expression (cadr exp) env)
		     (pre-process-expression (caddr exp) env)
		     (pre-process-expression (cadddr exp) env)))
    ((application? exp) (mapcar (lambda (e) (pre-process-expression e env))
				exp))
    (else (error "PRE-PROCESS-EXPRESSION: Unknown expression type" exp)))))
\null
(define pre-process-sequence (lambda (seq env)
    (if (null? seq)
	nil
        (cons (pre-process-expression (car seq) env)
	      (pre-process-sequence (cdr seq) env)))))
\null
(define find-variable (lambda (symbol env i)
  (if (null? env)
      symbol
      (let
            ((j (frame-lookup symbol (car env) 0)))
	(if (null? j)
	    (find-variable symbol (cdr env) (1+ i))
	    (list 'lookup i j))))))
\null
(define frame-lookup (lambda (symbol frame j)
  (cond
    ((null? frame) nil)
    ((eq? symbol (car frame)) j)
    (else (frame-lookup symbol (cdr frame) (1+ j))))))
\null
(define let? (lambda (exp)
  (if (atom? exp)
      nil
      (eq? (car exp) 'let))))
\null
(define convert-let-to-lambda (lambda (exp)
    (cons (cons 'lambda
		(cons (mapcar car (cadr exp))
		      (cddr exp)))
	  (mapcar cadr (cadr exp)))))
\null
(define convert-quoted (lambda (exp)
    (cond
      ((null? exp) (list 'quote nil))
      ((number? exp) exp)
      ((atom? exp) (list 'quote exp))
      (else (list 'cons (convert-quoted (car exp))
		        (convert-quoted (cdr exp)))))))
\endlisp

\end{document}
