Programming languages using functions on collections of values, such as map,
reduce, scan and filter, have been used for over fifty years. Such collections
have proven to be particularly useful in the context of parallelism because
such functions are naturally parallel. However, if implemented naively they
lead to the generation of temporary intermediate collections that can
significantly increase memory usage and runtime. To avoid this pitfall, many
approaches use "fusion" to combine operations and avoid temporary results.
However, most of these approaches involve significant changes to a compiler
and are limited to a small set of functions, such as maps and reduces.

In this paper we present a library-based approach that fuses widely used
operations such as scans, filters, and flattens. In conjunction with existing
techniques, this covers most of the common operations on collections. Our
approach is based on a novel technique which parallelizes over blocks, with
streams within each block. We demonstrate the approach by implementing
libraries targeting multicore parallelism in two languages: Parallel ML and
C++, which have very different semantics and compilers. To help users
understand when to use the approach, we define a cost semantics that indicates
when fusion occurs and how it reduces memory allocations. We present
experimental results for a dozen benchmarks that demonstrate significant
reductions in both time and space. In most cases the approach generates code
that is near optimal for the machines it is running on.