\documentstyle[psfig]{article}
\setlength{\textheight}{9in}
%\setlength{\columnsep}{1.8pc}
\setlength{\textwidth}{6.5in}
\setlength{\footheight}{0.0in}
\setlength{\topmargin}{0.0in}
\setlength{\headheight}{0.0in}
\setlength{\headsep}{0.0in}
\setlength{\oddsidemargin}{0in}
\setlength{\parindent}{1pc}

\title{SCANMACS --- Scan Macros for Regularly Distributed Arrays}

\author{Peter A. Dinda}

\begin{document}

\bibliographystyle{acm}

\maketitle

\begin{abstract}
SCANMACS is a portable set of C macros for instantiating families of
high performance scan (parallel prefix) functions for use on
regularly distributed arrays.  An implementation of the scan intrinsics
in the HPF standard library is described, including modifications to
the compiler.
\end{abstract}

\section{Introduction}

This paper describes SCANMACS, a portable set of C macros for
instantiating families of high performance scan (parallel
prefix~\cite{BLELLOCH-THESIS}) functions for parallel computers.
These functions can perform segmented and masked scans along any
dimension of regularly distributed arrays.  SCANMACS was used to build
an HPF~\cite{HPFF93} scan intrinsics library for the Fx~\cite{fx-pdt}
compiler.  Because of SCANMACS relation to HPF, we first describe HPF
regular distributions and scan intrinsics.  Next, the operation of
SCANMACS generated scan functions, the implementation of SCANMACS, and
its limitations are discussed.  Finally, we comment on testing
procedures and the performance of a stereo vision step coded with
scans.

\section{HPF}

The High Performance Fortran specification~\cite{HPFF93} defines a set
of parallel prefix and suffix intrinsic functions as part of the HPF
standard library.  These functions can operate on any regularly
distributed array.

\subsection{Regular distributions}
\label{sec:regdist}

Each dimension of an HPF array can be distributed across an arbitrary
number of processors such that the indices a processor owns can be
described by a four-tuple $(f,l,b,s)$, where $f$ and $l$ are the first
and last indices owned by the processor, $b$ is the size of each block
of indices owned, and $s$ is the stride between blocks.  At the
extremes are pure-block distributions, where each processor owns a
single block, and pure-cyclic distributions, where each block contains
only one index.  Block-cyclic distributions are also
possible. Multiple dimensions may distributed, in which case the total
number of processors the array is distributed over is the product of
the number of processors each dimension is distributed over.  For
example, suppose a $512 \times 512$ array is cyclic-distributed over
four processors in its first dimension and block-distributed over four
processors in its second dimension.  Then processors 0-3 collectively
own the first 128 columns of the array, and processors 4-7 own the
next 128 columns.  The actual indices processor 1 owns are $(1:509:4,
1:128:1)$, meaning the indices $\{(i,j) : i=1,5,9,...,509;
j=1,2,3,...128\}$

\subsection{Scan intrinsics}
\label{sec:scanintr}

The HPF specification defines a number of library procedures,
including a set of scan functions for arrays with an arbitrary number
of dimensions, any number of which may be distributed in any regular,
block--cyclic manner.  These have the form
\\
\centerline{$op$\_\{pre,suf\}fix($array, dimension, mask, segment, exclusive$)}
\\
where $op$ is one of
\begin{itemize}
\item sum: plus scan of numeric array
\item product: multiply scan of numeric array
\item maxval: max scan of numeric array
\item minval: min scan of numeric array
\item copy: copy scan of any type array
\item all: AND scan of LOGICAL array
\item any: OR scan of LOGICAL array
\item count: count number of .TRUE.s
\item parity: parity scan of LOGICAL array
\item iall: Bit-wise AND scan of INTEGER array
\item iany: Bit-wise OR scan of INTEGER array
\item iparity: Bit-wise parity scan of INTEGER array
\end{itemize}
Each scan be be either a prefix or suffix operation.  If $exclusive$
is true, then the scan is a prescan (an element is not included in the
computation of its corresponding output element.)  $Mask$ and
$segment$ are LOGICAL arrays of the same size and shape as $array$.
The scan occurs along the specified dimension, including those
elements for which $mask$ is true, and restarting wherever there is a
transition in the values of $segment$.  All the arguments, with the
exception of $array$ are optional (for the LOGICAL scans, $mask$ takes
the place of $array$.)  If $mask$ does not appear, then all elements
are included.  If $segment$ does not appear, then the scan is
unsegmented.  A missing $exclusive$ means that the scan output
elements include their corresponding input elements.  Notice that
these default behaviors are quite reasonable.  However, a missing
$dimension$ argument requires surprising behavior --- the array must
be treated as a one-dimensional vector, with the scan order following
the storage order.  

\section{SCANMACS}

SCANMACS provides scan functions that have an interface similar to
that of the HPF scan intrinsic functions.  SCANMACS scan functions can
scan an array which has a regular distribution along each of its
dimensions.  Segmented and masked scans are provided as are prescans.
In an effort to improve efficiency, the SCANMACS calling convention
differs from the HPF scan intrinsics in several ways.  The most
important difference is that the target, segment and mask arrays must
all have the same distribution as the source array.  Further, SCANMACS
does not support scanning a multidimensional array in storage order.

\subsection{Scheme}

SCANMACS does scans ``in-place'' --- the array distribution never
changes.  This means the amount of parallelism depends on the
distribution of the array and on the dimension scanned.  It is assumed
that the local portion of the array is ``packed.''  Conceptually, the
local portion of the array consists of some number of vectors along
the scan dimension.  

A SCANMACS scan has five phases.  First, there is a local up scan
phase, where each processor scans the blocks of the scan dimension
that it owns, resulting in a vector of scan-out values.  For a
block-distributed scan dimension, this vector consists of one value
for each of the scan dimension vectors the processor owns.  In
general, the scan-out vector consists of contiguous subvectors, one
subvector for each of the blocks the processor owns along each of the
scan dimension vectors.

The second phase is to combine the scan-out vectors using a combining
tree.  At each node of the tree, the scan operator is applied to
corresponding elements of the two incoming scan-out vectors.  A stack
is used to collect the left-hand scan-out vectors for the trip down
the tree.  It is important to note that it is the assignment of
processors to dimensions of the array that determines the number of
trees and the number of leaf nodes.  For example, consider a two
dimensional array with the first dimension distributed over four
processors and the second over eight.  Then a scan along the first
dimension will result in eight combining trees, each with four leaf
nodes, and a scan along the second will result in four combining
trees, each with eight leaf nodes.

At the top of the tree, the third phase kicks in.  Because each
processor may own multiple blocks along the scan dimension, it is
necessary to scan the subvectors of the final vector value in find the
initial scan-in values for each of the blocks.  

The fourth phase begins with each processor initializing a scan-in
(which has the same size and structure as the scan-out vectors) to the
identity.  Next is a trip down the tree, element-wise updating the
scan-in vectors with the incoming scanout vectors.  Like the second
phase, this step is identical to that of a conventional description
of a scan, except that the operator is applied element-wise to
vectors.

The final phase is to locally element-wise update each block with the
block's associated scan-in element.

Masks are implemented locally entirely in the first phase.  Segments
are handled in each phase, with each scan-out element having a
corresponding segment element in a segment vector that is also passed
up and down the tree.  One complexity is that the semantics of the
segment argument are that a segment begins on each {\em transition} of
the segment array along the scan dimension.  This means that given any
block of the segment array along the scan dimension, it is impossible
to determine whether the scan should restart at the first element of
the block without looking at the previous element, which may be on
another processor.  The way this is handled in SCANMACS is that every
block has associated with it the segment array values corresponding to
the leftmost and rightmost elements of the block, and a flag
indicating whether the scan restarts in the block.  These values are
updated in the obvious way when blocks are combined.


\subsection{Example}

Consider the two dimensional $512 \times 512$ array of
section~\ref{sec:regdist}.  Recall that the array's first dimension is
cyclic-distributed over four processors and its second dimension is
block-distributed over four processors.  Suppose we scan along the
second dimension.  On each processor, the result of the first phase
will be a vector of 128 scan-out values, one for each row index the
processor owns.  Each element corresponds to the single block the
processor owns in the corresponding row.  The second phase will result
in four trees, each with four leaf nodes.  The vectors passed up the
tree are each 128 elements long.  At the top of a tree, we have the
scan-outs of each of the 128 rows associated with the tree.  The third
phase effectively does nothing since each processor owns only a single
block along the scan dimension.  At the end of the fourth phase, we
have the scan-ins for each of the 128 blocks each processor owns, and
the fifth phase distributes this scan-in element-wise over the block,
completing the scan.

Scanning along the first dimension changes the numbers.  On each
processor, the result of the first phase is $128^2$ values, scan-outs
of each of the 128 single-element blocks the processor owns along each
of the 128 columns it owns.  The second and fourth phases are as
before: there are four trees, each with four leaf nodes, but the
vector exchanged is $128^2$ values long.  The third phase locally
scans each of the 128 elements of each of the 128 subvectors. In the
final phase, the $128^2$ scan-ins are distributed over each
corresponding single-element block.

\subsection{Implementation}

SCANMACS is a set of C macros for instantiating families of scan
functions.  It is easiest to explain with an example.  Suppose a scan
function family for double precision floating point arrays was needed.
The programmer would simply write
\begin{verbatim}
#include <scan.h>

#define SEND(processor,buffer,size) /* Appropriate send function */
#define RECV(processor,buffer,size) /* Appropriate receive function */

#define plus(x,y) ((x)+(y))

MAKE_SCAN_FAMILY(double,plus,0.0)   /* datatype, operator, identity */
\end{verbatim}
which would instantiate the scan functions
\begin{verbatim}
int plus_scan_double();
int plus_scan_double_seg();
int plus_scan_double_mask();
int plus_scan_double_seg_mask();
int plus_prescan_double();
int plus_prescan_double_seg();
int plus_prescan_double_mask();
int plus_prescan_double_seg_mask();
\end{verbatim}
which are scan and prescan functions for all combinations of segment
and mask.  Also created are Fortran compatible wrappers for each of
the functions, which add the prefix \verb.fx_..  To save space, the
prototypes of the functions are not given here, but can be generated
by the including the file \verb_scaninterface.h_ and using the macros
\verb.MAKE_PROTO. and \verb.MAKE_FX_PROTO..  There is a hierarchy of
macros, so it is possible to create a specific function or even purely
local versions of the functions.  For example \verb.MAKE_SCAN.  would
build only the first of the functions, while \verb.MAKE_LOCAL_UP_SCAN.
would build a local-only version of the first function,
\verb.local_plus_scan_double_up().  Segment and mask arrays should be
of type \verb.LOGICAL., and should only take the values
\verb.LOGICAL_TRUE.  and \verb.LOGICAL_FALSE. to facilitate use
with Fortran, which has two distinct values for false and true,
instead of C's zero and everything else. The simpler scan functions
are faster than the more complex ones, so it behooves one to use the
simplest function possible.

All scan functions (including local scans) have an array distribution
descriptor parameter, the format of which is given in the scan header
file. In addition to requiring the programmer to define send and
receive functions (not necessary for local-only scans), the stack
memory allocator \verb.alloca. must be available.  SCANMACS does all
memory allocation on stack for better performance.  With these two
caveats, SCANMACS is portable.  

\subsection{Limits}

SCANMACS requires that the target, mask, and segment arrays have the
same distribution as the source array.  The current implementation
also requires that, along each dimension, that the product of the
block size and the number of processors divides the size of the
dimension.  Loosening the restriction to only requiring that each
processor own the same number of blocks should be reasonably easy.
Currently, SCANMACS does not support parallel suffix operations,
although adding the support will require only a marginal amount of
work, but much copying and pasting.  Finally, SCANMACS does not
support scanning an array in storage order.  This can be accomplished
with a simple redistribution so that the final (outermost) dimension
is block-distributed, then using SCANMACS on the array as if it had a
single dimension.  In general, I feel that redistribution is a job
better left to a compiler or to a companion redistribution library.

\section{Fx interface}

To implement HPF intrinsics for Fx, SCANMACS was used to create 
function families for the HPF parallel prefix operators.  The Fx compiler
was modified to compile assignment expressions of the form
\begin{verbatim}
distributed_array = *_PREFIX(...)
\end{verbatim}
into an initialization of an array descriptor, and call to the
appropriate scan function.  The prefix call must have all the
parameters specified in \ref{sec:scanintr}, with missing segment and
mask arguments being represented with the value \verb:.TRUE.:.  The
dimension and exclusive arguments must be supplied.  

Currently, this is the only idiom supported, and even it is limited by
the distribution requirements specified above.  However, this is mostly
due to time constraints.  Ultimately, the scan intrinsics will be better
incorporated in the compiler to ease the limitations and support other
idioms.  The updated compiler was tested on the Paragon as well as on
DEC Alpha workstations.  However, the scan library was only built and
tested on the Paragon.

\section{Testing} 

The HPF scan library built using SCANMACS was tested on the Paragon
with \verb.REAL. plus scans for a variety of one and three dimensional
arrays, with distributions including block, cyclic, and general
block-cyclic.  All combination of segments and masks were tested and found
to operate well.

\section{A simple application}

As promised in my proposal, I implemented the windowed error sum
calculation step of Okutomi-Kanade stereo vision~\cite{okkanstereo}
using scans and tested it on 64 processors of the Paragon.  I compared
it with a naive parallel loop implementation with the following
results:
\\
\centerline{
\begin{tabular}{|l|l|l|} 
\hline
Images/iteration & Naive PDO & Scan \\
\hline
1   & 4.7 images/sec & 40.4 images/sec \\
16  & 4.3 images/sec & 73.6 images/sec \\
\hline
\end{tabular}
}
\\
The first column indicates the number of images processed simultaneously.
For each image, the scan versions performs the following steps, inspired
by \cite{algvision}:
\begin{enumerate}
\item plus-prescan along each row
\item shift the rows right by the window's horizontal size
\item compute the point-wise difference between the shifted row 
      and the unshifted row
\item plus-scan along each column of the result
\item shift the columns up by the window's vertical size
\item compute the point-wise difference between the shifted and
      unshifted columns
\end{enumerate}
For the 16 image case, the scans run along the second and third
dimensions of a three dimensional array.  Since the PDO implementation
is especially naive due to time constraints, the measurements should
not be taken too seriously.  However, it is interesting to note how
the scan version scales with the number of images.

\section{Conclusion and future work}

SCANMACS, a set of C macros for instantiating scan functions, was
described.  SCANMACS was used to build an HPF scan library for the Fx
compiler.  In addition to filling out the features of SCANMACS and the
compiler interface to the library, an obvious move is to a C++
implementation.  The current low level implementation is due to a
desire for high performance, especially to be sure that the compiler
could recognize opportunities for superscalar execution in the local
vector operations.  It would be interesting to see if similar
performance could be achieved by a C++ template class.


\bibliography{/afs/cs/project/iwarp/member/fx/bib/texbib,/afs/cs/project/iwarp/member/jass/bib/all,local}

\end{document}


