\documentstyle{article}

\oddsidemargin 0in
\evensidemargin 0in
\textwidth 6.5in 

\title{
Parameterized Arithmetic Functional Units for Clockwise-skewed Integer 
Streams
}

\author{
Peter A. Dinda \footnote{pdinda@cs.cmu.edu} \\
School of Computer Science \\
Carnegie-Mellon University \\
Pittsburgh, PA 15213 
}

\begin{document}

\maketitle


\begin{abstract}
%
% Fix
%
 Clockwise-skewed streams deliver integers in a least 
significant digit to most significant digit order. The paper describes add, 
accumulate, left-shift, and multiply functional units with single
cycle throughput for such streams.   These functional units are parameterized 
by the width of their input streams and the width of the digits of their
input streams.  
\end{abstract}

\section{Introduction}

A clockwise-skewed integer stream presents values in a least significant 
digit to most significant digit order. Add, accumulate, left shift, and 
multiply functional units (FUs) for such streams are highly regular and 
easily parameterized by input stream width and digit width or input stream 
width and latency. Further, they have single cycle throughput and require 
approximately the same amount of logic as conventional non-skewed integer 
stream FUs. 

These attributes make clockwise-skewed integer stream FUs interesting for 
the custom vector unit programming model for reconfigurable machines 
\ref{reconfig}, where the cost of skewing input streams and deskewing 
output streams can be amortized by cascading many FUs.  

\subsection{Organization}

Section \ref{reconfig} is a short summary of reconfigurable machines and 
the custom vector unit programming model.  Section \ref{skewedstreams} 
describes clockwise-skewed streams.  Section \ref{addacc} describes add 
and accumulate FUs for skewed streams.  Section \ref{lshift} describes 
left shift FUs, and section \ref{mpy} describes unsigned and signed 
multiply FUs.  Finally, section \ref{conc} concludes with some discussion.
%
% If I can think of anything good by then...
% 

\section{Reconfigurable Machines}
\label{reconfig}

Reconfigurable machines \cite{some} are commonly used as coprocessors for 
conventional machines. One programming model for such a hybrid is to find 
heavily utilized kernels and convert their data flow graphs into custom 
vector units which are implemented on the reconfigurable logic. When such a 
kernel is called, data is streamed from memory to the coprocessor and 
results are streamed back to memory. \cite{somemachien} 

Although FPGAs \cite{xilinx}, the most commonly used reconfigurable 
elements, have lower densities and longer cycle times than microprocessors, 
speedup can be acheived for some applications for three reasons. First, 
some operations, such as the evaluation of Boolean functions, do not map 
efficiently to the fixed functional units of a microprocessor. Second, the 
data path width of a microprocessor can result in inefficiencies. For 
example, a 4 bit add and a 64 bit add have the same cost on a DEC Alpha. 
Finally, because custom hardware is constructed, very fine grain 
parallelism is exposed via dependence analysis on the data flow graph and 
generated via deep pipelining. 


\section{Clockwise-skewed Streams}
\label{skewedstreams}

\begin{figure}
\label{unskewedandskewed}
\( 
  \begin{array}{ll}
     \begin{array}[t]{llll}
       d_{t+3,3} & d_{t+3,2} & d_{t+3,1} & d_{t+3,0} \\
       d_{t+2,3} & d_{t+2,2} & d_{t+2,1} & d_{t+2,0} \\
       d_{t+1,3} & d_{t+1,2} & d_{t+1,1} & d_{t+1,0} \\
       d_{t,3} & d_{t,2} & d_{t,1} & d_{t,0} 
     \end{array}
     \begin{array}[t]{llll}
        d_{t+3,3} & - & - & - \\
        d_{t+2,3} & d_{t+3,2} & - & - \\
        d_{t+1,3} & d_{t+2,2} & d_{t+3,1} & - \\    
        d_{t,3} & d_{t+1,2} & d_{t+2,1} & d_{t+3,0} \\
        - & d_{t,2} & d_{t+1,1} & d_{t+2,0} \\
        - & - & d_{t,1} & d_{t+1,0} \\
        - & - & - & d_{t,0} \\
     \end{array}
   \end{array}
\)
\caption{A conventional 4 digit stream (left) and a clockwise-skewed 
stream.  Time increases towards the top of the page.  }
\end{figure}


A clockwise-skewed stream is parameterized by its width in bits, $N$, and the 
width in bits of its digits, $D$, where $D$ evenly divides $N$.  The 
digits of a particular integer arrive in least significant to most 
significant order.  If the digits of an integer in a conventional stream 
form a horizontal line, then that line has been rotated clockwise in a 
clockwise skewed stream.  Figure \ref{unskewedandskewed} shows a 
conventional stream and its corresponding clockwise-skewed stream.

\subsection{Skewing and Deskewing Streams}

\begin{figure}
\label{skewdeskew}
\caption{Logic for skewing (left) and deskewing streams.}
\end{figure}


Skewing and deskewing are very simple operations that each require 
$\frac{1}{2} \frac{N}{D} (\frac{N}{D} - 1)$ $D$ bit wide registers.  For 
skewing, digit $k$ flows through $k$ registers.  For deskewing, digit $k$ 
flows through $N/D - k -1$ registers.  Figure \ref{skewdeskew} presents 
an example.

\section{Add and Accumulate Functional Units}
\label{addacc}

\begin{figure}
\label{adder}
\caption{Adder (left) and accumulator for $N/D=4$}
\end{figure}

The basic component of a clockwise-skewed stream adder is a $D$ bit adder 
cell which adds two $D$ bit conventional streams $X$ and $Y$, and a carry, 
$c_in$, producing a $D$ bit conventional stream $S$ and a single carry bit 
$c_out$. The adder cell also contains a single bit register to capture 
$c_out$.
An adder cell can be optimized for a particular device architecture.  
For example, the Xilinx 4000 series' Combinational Logic Blocks have 
built in 4 bit carry look ahead generators. \cite{xilinx}

The complete adder is a chain of $N/D$ adder cells, connected as in figure 
\ref{adder}. The critical path $\sigma$ in the design is the critical path 
in a single adder cell. Latency is $N\sigma/D$. The gate cost is $N/D$ 
times the cost of an adder cell plus $N/D$ single bit registers. For $D=1$, 
each cell is a full adder (five gates) for a total cost of $5N$ gates + $N$ 
registers. 

An accumulator is simply a complete adder where each adder cell's output 
stream is captured with a $D$ bit wide register and redirected to the 
cell's $Y$ input.  The additional gate cost for the accumulator is $N/D$ 
$D$ bit registers.  Figure \ref{adder} shows an accumulator for $N/D=4$.


\section{Left Shift Functional Units}
%
% Actually build one of these
%

\section{Multiply Functional Units}

\section{Conclusion}

\end{document} 
