The prefetching algorithm has a few compile-time parameters, which we consistently set as follows: cache line size = 32 bytes, effective cache size = 500 bytes, prefetch latency = 300 cycles, and policy on unknown loop bounds = assume a small number of iterations. The cache line size precisely matches the architecture, while the other parameters are more heuristic in nature. As discussed in Section , we choose an effective cache size to be a fraction of the actual size (8 Kbytes) as a first approximation to the effects of cache conflicts. The prefetch latency indicates to the compiler how many cycles in advance it should try to prefetch a reference (i.e. parameter in equation()). The prefetch latency is larger than 75 cycles, the minimum miss-to-memory penalty, to account for bandwidth-related delays. For cases where loop bounds cannot be resolved at compile-time, we assume the number of iterations to be small, which tends to overestimate what remains in the cache. Later, in Section , we will consider the effects of varying these parameters.