To surpass the overhead reduction offered by block prefetches, the final step would be eliminating the ``per iteration'' instruction overhead altogether. The two previous proposals which accomplish this are (i) issuing large block prefetches for the entire data structure outside the loop (as illustrated in Figure , or (ii) using strictly hardware-based prefetching. However, both of these approaches have drawbacks, which we will briefly discuss.
The large block prefetch approach was studied by Gornish et al.  in the context of a software-coherent shared-memory multiprocessor with binding prefetches. The potential problem of large block prefetches is that data may arrive in the cache too early, thereby exposing it to possible replacement before it can be used. In addition, the bursts in network traffic caused by large block requests can lead to queueing delays. We observed these negative effects when using large block prefetches in LU during an earlier study . While hardware-based prefetching schemes can stream data into the cache more evenly, they suffer from a number of disadvantages which we will describe in detail later in Section .
To improve upon both of these techniques, we introduce the notion of programmable streams. With programmable streams, the software decides what to prefetch and when the prefetches should be issued. Similar to the ``large block prefetch'' approach, this information is provided only once outside the loop, so there is no per-iteration instruction overhead. However, in contrast to the ``large block prefetch'' approach, where all data is fetched at once, programmable streams provide a mechanism for ``flow control'' so that data can be streamed into the cache at a rate matching the computation. This flow control is accomplished by associating a ``trigger instruction'' address with each stream. A trigger instruction can be any normal instruction in the loop, provided that it is executed once per loop iteration. Once a programmable stream has been initialized by the software, the hardware issues the next prefetch in the stream each time the trigger instruction is executed. Therefore the data arrives in the cache in the same manner as with software pipelining, but with essentially no instruction overhead.
To initialize a programmable stream, the software must specify at least the following: (i) a starting prefetch data address, (ii) the difference between consecutive prefetch data addresses, and (iii) a trigger instruction address. The values are stored in a hardware structure representing the stream. The only value which changes is the ``current prefetch data address'', which is initially the starting prefetch address, but is updated after each new prefetch is issued upon a matching trigger instruction. Once the loop completes, the software may choose to deallocate the stream to free the corresponding hardware entry for later use.
Further refinements of this basic approach may be achieved by having the software provide more information to the hardware. For example, the prefetch block size may be specified to make the software more portable across different cache configurations. Also, to optimize for cases with spatial locality, the software could specify the number of iterations between prefetches, and the hardware could account for this by maintaining a modulo counter with each stream. Finally, given that the hardware knows the size of memory latency, the hardware could compute the number of iterations to pipeline ahead and automatically issue the prolog prefetches if the software specifies the number of instructions in the loop body.