Preliminary Results

Loop Language

Existing C vector libraries such as CVL [CVL] support high-performance, aggregate data manipulation and is intended as a intermediate language for [NESL]. This section outlines a loop language for uniprocessors that could run faster yet provide greater flexibility.

The bulk of CVL's procedures may be classified according to four axes (listed with their possible values):

plus, times, sqrt, etc.
bit, int, double, etc.
element-wise (elwise), scan, reduce.
segmented, flat.
These near-symmetries are also present: vector/scalar arguments, in-place operations, permutations/scatter/gather.

Because each procedure is written separately, the number of procedures in the library is the product of these options. The implementation is not (and could not be, due to the interface) blocked (see buffering), thus performance is limited by memory rather than cache bandwidth.

The idea behind the loop language is to use fewer, more general loops and specialize them. The compilers are loop generators where the operation (op) is passed in as a scalar procedure.

Here is a simple example:

CVL's primop and pattern axes are now actual parameters of an `interpreter'. Hand analysis indicates that applying cogen would produce an efficient compiler that produces basically efficient code.

Higher-order operations (even for scans and reduces) are handled by the loop since op need not be a primop. There's no way to handle higher-order scans and reduces without RTCG. [cite?!]

In addition, general multidimensional data is handled very simply; without a real metalanguage it presents a substantial challenge.

It's not hard to modify this general loop so that the generated compiler would unroll (say four times) the inner loop over the data (the loop over the dimensions is completely removed). This depends on the linearity of the loop to remove the tests; we're not just inlining to remove jumps. Also straightforward to handle are scalar/vector arguments, higher-order scatter/gather/permutations, and perhaps also sorting the dimensions for better memory access patterns.

Handling packed datatypes of sub-word sizes and segmented representations is much more difficult since code must be reorganized so data is transferred in fixed sized (see graphics).

Now, once one has this loop language, how does one build NESL on top of it? One might hope that writing a NESL interpreter using the loop language and then applying cogen would work, but there's no way avoid implementing something like [ChaBleFi91] if you want good performance. How can we compromise?