Locality optimizations attempt to make caches more effective by restructuring computation to enhance data locality. One important example of a locality-improving transformation is blocking (also known as tiling) , which works as follows. Rather than operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. Other useful transformations include unimodular loop transforms such as interchange, skewing and reversal . Since these optimizations improve the code's data locality, they not only reduce the effective memory access time but also reduce the memory bandwidth requirement.
For multiprocessors, the concept of data locality can be extended to minimize not only accesses to main memory, but also communication between processors. This involves both the placement of data in processors' local memories (known as data decomposition) and the mapping of the computation onto the processors of the parallel machine (known as computation decomposition). The most popular approach to this complex optimization problem is for the programmer to explicitly specify the data decomposition through directives in the programming language , while the compiler is responsible for decomposing the computation. Another approach is for the compiler to determine both the data and computation decompositions .
While locality optimizations are quite useful when they work, their applicability is somewhat limited because not only must there be a better way to structure the code (which is not always the case), but it also must be legal to do so. Whenever dependence analysis  is inexact, the compiler usually has to be conservative and assume that dependencies could be violated if the code was restructured. In practice, this means that these types of optimizations are not frequently applicable (as we will observe later in Section ). Therefore to cope with whatever latency cannot be reduced through caching and locality optimizations, we consider techniques for tolerating memory latency.