The localized iteration space is simply the set of innermost loops whose volume
of data accessed in a single iteration does not exceed the cache size.
Figure shows our algorithm for computing the localized
iteration space, which consists of two major substeps. First, we use our reuse
vector information to estimate the amount of data accessed by each loop (i.e.
Tally_Loop_Traffic in Figure
). This involves
visiting each array reference and projecting the amount of data it references
onto the surrounding loops. We associate a running total of data accessed with
each array reference, and this running total is potentially multiplied by the
number of iterations of each surrounding loop, depending on the type of reuse
the reference has with respect to that loop. This first substep is a simplified
version of algorithms proposed
previously [64][24][21]. Once the amount of
data accessed by each loop has been estimated, the second substep is comparing
these amounts with the expected cache capacity to decide which loops are
localized (i.e. Is_Localized in Figure
). This
second substep is a separate top-down and then bottom-up pass over the loops,
which allows us to decide that (i) a loop is definitely localized if any
surrounding loops are definitely localized (perhaps because they have been
actively ``blocked'' or ``tiled'' by the locality optimizer), and (ii) a loop
is definitely not localized if any internal loops are definitely not
localized (perhaps because they reference too much data). Such information is
particularly helpful when the amount of data referenced by a loop is a symbolic
range, and therefore cannot be compared directly against a constant cache size.
Two complications that arise in this algorithm are (i) estimating the
amount of data accessed in the presence of symbolic loop bounds, and (ii)
modeling cache capacity given the potential for mapping conflicts in
direct-mapped or set-associative caches. To address the symbolic loop bound
problem, we do three things. First, we use interprocedural constant
propagation to eliminate as many symbolic loop bounds as possible. Second,
we represent the loop iterations and subsequent data traffic as symbolic
ranges, which can include simple linear expressions of symbolic variables.
In some cases we can reason about these symbolic values, such as when they
are indices of surrounding loops (an example of this will be shown later in
Figure ). Third, when all else fails, we use a
default policy of assuming loop iteration counts to be either large or
small to resolve such cases. (A fourth option which we explore later in
Section
is using control-flow feedback
in such cases.) To address cache mapping conflicts, we approximate this effect
simply by setting the ``effective'' cache size to be a fraction of the
actual cache size. We will discuss the robustness of these
approximations later in Section
, and will
suggest further improvements in Section
.
Figure illustrates how our algorithm computes the
localized iteration space for the example code introduced earlier in
Section
. We begin with the A[i][j]
reference, which has spatial reuse along the inner loop and no reuse along
the outer loop. During a single j iteration, A[i][j] will bring
(at most) a single line into the cache (16 bytes). To compute the total
data accessed by a single pass through the j loop, we multiply the
current ``running total'' (16 bytes) by the total number of j
iterations (100), and divide by the savings factor due to spatial reuse
(2), yielding a total of 800 bytes. This is also the amount of data
accessed by the A[i][j] reference during a single iteration of the
i loop. Since A[i][j] has no reuse along the i
loop, we simply multiply the running total (800 bytes) by the number of
i iterations (3) to get a total of 2400 bytes accessed by A[i][j] in the entire loop nest. Next, we consider the B[j+1][0]
reference, which has no reuse along the j loop and temporal reuse
along the i loop. The data accessed during a single j iteration
is again a single cache line (16 bytes), and because there is no reuse
along j, we multiply by the total j iterations (100)
to get 1600 bytes accessed during a single i iteration. Since B[j+1][0] has temporal reuse along i, the running total is not
increased any further, and the total data accessed by B[j+1][0]
during the loop nest is therefore 1600 bytes. We can ignore the B[j][0] reference, since it has group reuse with B[j+1][0], and B[j+1][0] is the leading reference. After tallying each reference's
contribution to total data traffic, we see that a single iteration of j accesses only 32 bytes (and therefore is definitely localized), and a
single iteration of i accesses 2400 bytes, which is also likely to
fit in an 8 Kbyte cache. Therefore our algorithm would conclude that both
loops are within the localized iteration space.
Figure presents a more complex example, where
the upper bound of the inner loop is a symbolic variable (i). In this
case, the amount of data accessed during a single i iteration is a
symbolic expression (8i bytes), which cannot readily be compared
against a fixed cache size. However, since the compiler understands how the
i variable behaves, it recognizes that a triangular region within the
A matrix is being accessed, and that i can be substituted with
to compute a total of 840
bytes being accessed by the A[i][j] reference in this loop nest.
Since 840 bytes is likely to fit in the 8 Kbyte cache, the compiler would
again conclude that both loops are within the localized iteration space.
We represent the localized iteration space as a vector space, so that it
can be directly compared with our vector space representation of data
reuse. For example, if both loops in a two-level loop nest are within the
localized iteration space (as in Figures (a),
, and
), the
corresponding localized vector space would be represented as
. If only the innermost loop is localized
(as in Figure
(b)), then the localize vector space
would be
. We will now describe how the
intrinsic data reuse and the localized iteration space can be combined to
compute data locality.