We implemented our prefetching algorithm in the SUIF (Stanford University Intermediate Form) compiler, which includes many of the standard optimizations and generates code competitive with the MIPS 2.10 compiler . The experience of actually implementing our algorithm raised some interesting issues.
The first issue is when prefetching should occur relative to the other optimizations performed by the compiler. In particular, should prefetch insertion occur before or after the bulk of scalar optimization? Inserting prefetches before scalar optimization has the disadvantage that the code seen by the prefetching pass may change significantly once optimization takes place; these changes may make it difficult to schedule prefetches the proper amount of time in advance. For example, if the optimizer eliminates a significant fraction of instructions in a loop body, the software pipelining algorithm may underestimate the number of loop iterations necessary to provide enough computation to hide memory latency (i.e. the parameter in equation () may be overestimated). On the other hand, inserting prefetches after scalar optimization has the following two disadvantages: (i) much of the high-level representation of arrays and loops may be unrecognizable, making it difficult to perform locality analysis; and (ii) the compiler must be extremely careful only to insert highly efficient prefetching code, since scalar optimization will not be run again.
In our experience, the advantages of inserting prefetches before scalar optimization far outweigh the disadvantages. First, the ability to recognize high-level structures such as arrays and ``for'' loops is essential for our analysis, and becomes prohibitively difficult once scalar optimization has radically altered the code. Second, the fact that scalar optimization is performed on the prefetching code itself has two strong advantages: (i) it greatly simplifies the task of inserting prefetches, since code transformations such as loop unrolling and software pipelining can be done in a straightforward manner without undo concern for minimizing overhead; and (ii) scalar optimization appears to do a very good job of minimizing prefetching instruction overhead. The overhead is often less than 2 or 3 instructions per prefetch (as we will see later in Chapter ), and we believe these results are much better than if the entire burden of minimizing overhead was placed on the prefetch code-generating pass itself. Regarding the disadvantage of inserting prefetches before scalar optimization, we found that in practice the impact of reduced loop body sizes could be taken into account by simply increasing the target memory latency parameter (i.e. in equation ()) by a small factor (less than two). Although this may result in prefetches being issued unnecessarily early (e.g., if scalar optimization does not eliminate any instructions from the loop body), it does not appear to have a noticeable negative impact on performance.
There are some optimizations, however, that should be performed before prefetching, since they greatly improve the analysis. One such example is interprocedural constant propagation, which helps eliminate symbolic loop bounds and therefore makes it easier to compute the localized iteration spaces. Table shows the order in which our compiler performs its optimizations. The optimizations performed before prefetching in Table tend to improve the information content of the loop and array constructs. After prefetches are inserted, these high-level SUIF constructs (such as hierarchical abstract representations of ``for'' loops) are deconstructed into simpler low-level constructs (such as branch instructions) which more closely match traditional RISC instruction sets. The bulk of scalar optimization (e.g., common subexpression elimination, etc.) is then performed on this low-level representation. In general, we were quite happy with this approach.
We did experience one surprising problem with the scalar optimizer, however, during our early experiments with the compiler. Our initial code with prefetching had very large instruction overheads-often more than 6 instructions per prefetch. We discovered the problem was that our scalar optimizer did not take full advantage of the ``base-plus-offset'' addressing mode in the MIPS instruction set by reusing the same base register and simply having different constant offsets when two addresses differed by a constant amount. Instead, the optimizer would use a separate base register to compute each address. While this optimization may not have been a high priority for normal code, it was crucial for our prefetching code, since each prefetch tends to differ by a constant amount from a load or store, and loop unrolling creates many copies of array references that differ by constant amounts. Without this optimization, the compiler quickly ran out of registers, and suffered the large overheads of spilling values to memory. Once this optimization was implemented, we saw dramatic reductions in prefetch instruction overhead.