\documentstyle[epsf]{article}

\def\N{n}
\def\D{d}
\def\W{w}
\def\CIN{c_{in}}
\def\COUT{c_{out}}
\def\FU{FU}

\oddsidemargin 0in
\evensidemargin 0in
\textwidth 6.5in 
\topmargin 0in
\headsep 0in
\headheight 0in
\textheight 8.75in
\footskip 1in

\title{
Parameterized Arithmetic Functional Units for Clockwise-skewed Integer 
Streams
}

\author{
Peter A. Dinda \footnote{pdinda@cs.cmu.edu} \\
School of Computer Science \\
Carnegie-Mellon University \\
Pittsburgh, PA 15213 
}

\begin{document}

\maketitle

%\input{psfig}


\begin{abstract}
%
% Fix
%
 Clockwise-skewed streams deliver integers in a least 
significant digit to most significant digit order. The paper describes add, 
accumulate, left-shift, and multiply functional units with single
cycle throughput for such streams.   These functional units are parameterized 
by the width of their input streams and the width of the digits of their
input streams.  
\end{abstract}

\section{Introduction}

A clockwise-skewed integer stream presents values in a least
significant digit to most significant digit order.  Functional units
for adding, accumulating, multiplying, left-shifting, and multiplexing
such streams are highly regular and their designs are easily
parameterized by stream width and digit width.  Further, they acheive
single cycle throughput with less combinational logic (although more
registers) than conventional non-skewed integer stream functional units.

These attributes make them a good match for FPGAs, which often have
significant flip-flop and RAM resources, but minimal available
combinational logic.  Finally, in the custom vector unit programming
model for reconfigurable machines \ref{reconfig}, where the cost of
skewing input streams and deskewing output streams can be amortized by
cascading many functional units.

\subsection{Organization}

Section \ref{skewedstreams} describes clockwise-skewed streams.
Section \ref{addacc} describes add and accumulate functional units for
skewed streams.  Section \ref{lshift} describes left shift functional
units, and section \ref{mpy} describes unsigned and signed multiply
functional units. Section \ref{reconfig} is a short summary of
reconfigurable machines and the custom vector unit programming model.
Finally, section \ref{conc} concludes with some discussion.

%
% If I can think of anything good by then...
% 


\section{Clockwise-skewed Streams}
\label{skewedstreams}

\begin{figure}
\label{unskewedandskewed}
\centerline{
\( 
  \begin{array}{ll}
     \begin{array}[t]{llll}
       d_{t+3,3} & d_{t+3,2} & d_{t+3,1} & d_{t+3,0} \\
       d_{t+2,3} & d_{t+2,2} & d_{t+2,1} & d_{t+2,0} \\
       d_{t+1,3} & d_{t+1,2} & d_{t+1,1} & d_{t+1,0} \\
       d_{t,3} & d_{t,2} & d_{t,1} & d_{t,0} 
     \end{array} \hspace{1in}
     \begin{array}[t]{llll}
        d_{t+3,3} & - & - & - \\
        d_{t+2,3} & d_{t+3,2} & - & - \\
        d_{t+1,3} & d_{t+2,2} & d_{t+3,1} & - \\    
        d_{t,3} & d_{t+1,2} & d_{t+2,1} & d_{t+3,0} \\
        - & d_{t,2} & d_{t+1,1} & d_{t+2,0} \\
        - & - & d_{t,1} & d_{t+1,0} \\
        - & - & - & d_{t,0} \\
     \end{array}
   \end{array}
\)
}
\caption{A conventional 4 digit stream (left) and a clockwise-skewed 
stream.  Time increases towards the top of the page.  }
\end{figure}

In a clockwise-skewed integer stream, the digits of a particular
integer arrive in least significant to most significant order.  If
digit $0$ arrives at time $t$, digit $1$ arrives at time $t+1$.  A
clockwise-skewed stream is parameterized by its width in bits, $N$,
and the width in bits of its digits, $D$, where $D$ evenly divides
$N$.  $ W=N / D $ is the width in digits of the stream.  $D$ digits
from $W$ different integers arrive every cycle.  Figure
\ref{unskewedandskewed} shows a conventional stream and its
corresponding clockwise-skewed stream.

\subsection{Skewing and Deskewing Streams}

\begin{figure}
\label{skewdeskew}
\centerline{\epsfxsize=5in \epsfbox{skewdeskew.eps}}
\caption{Logic for skewing (left) and deskewing streams.}
\end{figure}

Skewing and deskewing are very simple operations that each require
$\frac{1}{2} W (W - 1)$ $D$ bit wide registers.  For skewing,
digit $k$ flows through $k$ registers.  For deskewing, digit $k$ flows
through $W - k -1$ registers.  Figure \ref{skewdeskew} presents skewing
and deskewing units for a 4 digit stream.


\section{Add and Accumulate Functional Units}
\label{addacc}

The basic component of a clockwise-skewed stream adder is a $D$ bit
adder cell which adds two $D$ bit conventional streams $X$ and $Y$,
and a carry, \( c_{in} \), producing a $D$ bit conventional stream $S$
and a single carry bit \( c_{out} \) .  The adder cell also contains a
single bit register to capture \( c_{out} \).  An adder cell can be
optimized for a particular device architecture.  For example, each
combinational logic block of a Xilinx 4000 series FPGA can be
configured as a fast two-bit adder. \cite{Xilinx} 

The complete adder is a chain of $W$ adder cells, connected as in figure 
\ref{addacc}. The critical path $\delta$ in the design is the critical path 
in a single adder cell. Latency is $W \delta$. The gate cost is
$W$ times the cost of an adder cell plus $W$ single bit registers.
For $D=1$, each cell is a full adder (five gates) for a total cost of
$5N$ gates + $N$ registers.

An accumulator is simply a complete adder where each adder cell's output 
stream is captured with a $D$ bit wide register and redirected to the 
cell's $Y$ input.  The additional gate cost for the accumulator is $W$ 
$D$ bit registers.  Figure \ref{addacc} shows an accumulator for $W=4$.

\begin{figure}
\label{addacc}
\centerline{
    \begin{array}[t]{cc}
       \epsfxsize=3in \epsfbox{adder.eps} & \
       \epsfxsize=3in \epsfbox{acc.eps}
    \end{array}
}
\caption{Adder (left) and accumulator for $W=4$}
\end{figure}





\section{Multiplexors}

\begin{figure}
\label{mux}
\centerline{
         \epsfxsize=3in \epsfbox{mux.eps}
}
\caption{2:1 Multiplexor for $W=4$}
\end{figure}

For a $W$ digit-wide clockwise-skewed stream, a multiplexor consists
of $W$ cells, each of which is a conventional $D$-bit wide
multiplexor.  The selection stream $S$ is an unskewed stream that is
pipelined through the multiplexor from least significant to most
significant digit.  The design can be easily extended for skewed
selection streams by deskewing such a selection stream.  However,
since the multiplexor is primarily intended to support ``if-else''
constructs, this may not be necessary.

Figure \ref{mux} presents a 2:1 multiplexor for $W=4$.  Note that the
gate cost is $W$ times the cost of a conventional $D$ bit multiplexor
plus $W-1$ registers.

\section{Left Shift Functional Units}

\begin{figure}
\label{lshift}
\centerline{
         \epsfxsize=3in \epsfbox{lshift.eps}
}
\caption{Left shift for $W=4$ with shifts in units of $D$ bits}
\end{figure}

A left shifter for clockwise skewed streams shifts in units of the
digit width $D$ as determined by a shift stream $S$, which is itself a
skewed stream whose digit width is 1.  A barrel shifter that meets
these constraints has $log_2 W$ stages, each of which is $W$ $D$-bit
multiplexors wide.  Each mulitplexor feeds a $D$-bit wide register.  A
``shift by $2^i$ stage'' is preceded by $2^i(W-2^i)$ registers
configured to assure that the appropriate digits arive at that stage.
The digit of the shift stream $S$ associated with a stage propagates
from least significant to most significant digit through $W-1$
registers. Figure \ref{lshift} presents a barrel left shifter for
$W=4$.  

The cost of a left shifter is $\lfloor log_{2} W \rfloor$ $D$-digit
conventional multiplexors,
$$
\lfloor log_{2} W \rfloor + \sum_{i=0}^{\lfloor log_{2} W \rfloor}
2^i(W-2^i)
$$
$D$-bit registers, and $(W-1)\lfloor log_{2} W \rfloor $ single bit
registers.

\section{Multiply Functional Units}

A pipelined array multiplier \cite{hamacher} can be easily extended
for clockwise-skewed streams. 


\begin{figure}
\label{umpy}
\centerline{
         \epsfxsize=5.5in \epsfbox{unsignedmpy.eps}
}
\caption{Unsigned multiplier for $N/D=4$}
\end{figure}

\begin{figure}
\label{mpycell}
\centerline{
         \epsfxsize=1.5in \epsfbox{mpyblock.eps}
}
\caption{Multiplier cell}
\end{figure}



\section{Reconfigurable Machines}
\label{reconfig}

Reconfigurable machines \cite{some} are commonly used as coprocessors for 
conventional machines. One programming model for such a hybrid is to find 
heavily utilized kernels and convert their data flow graphs into custom 
vector units which are implemented on the reconfigurable logic. When such a 
kernel is called, data is streamed from memory to the coprocessor and 
results are streamed back to memory. \cite{somemachien} 

Although FPGAs \cite{xilinx}, the most commonly used reconfigurable 
elements, have lower densities and longer cycle times than microprocessors, 
speedup can be acheived for some applications for three reasons. First, 
some operations, such as the evaluation of Boolean functions, do not map 
efficiently to the fixed functional units of a microprocessor. Second, the 
data path width of a microprocessor can result in inefficiencies. For 
example, a 4 bit add and a 64 bit add have the same cost on a DEC Alpha. 
Finally, because custom hardware is constructed, very fine grain 
parallelism is exposed via dependence analysis on the data flow graph and 
generated via deep pipelining. 

\section{Conclusion}



\end{document} 













