| Overview of Optimizations |
|
Here, we will discuss in detail the kind of
optimizations we devised, and are currently implementing. In
order to parallelize a group of instructions in a way that boosts
performance, the group of instructions needs to be executed
somewhat often in a normal execution. This is because there is an
extra cost incurred by setting up long registers for
parallelization (and extracting the components of the registers).
To offset this cost, we need to identify situations in which a
single parallel instruction gives a substantial benefit over the
usual instructions. The particular situations we investigate are
loops.
Operations on arrays seem as if they would be quite susceptible to parallelization. For example, the initialization of an array could perhaps be done faster, if we initialized several of the entries at once using a parallel instruction. A linear transformation, where a series of data-independent dot products are performed on the rows of a matrix, is similarly susceptible. We wish to capture these situations as generally as possible, while keeping our optimization feasible. The intuition behind our optimization is to discover variables in a loop where loop unrolling allows us to parallelize their definition. We say that a variable or array v is movable in a loop if (a) all definition points of v are executed in every possible execution of the loop, (b) v is not used anywhere in the loop, except possibly in its own definition points, (c) for each definition point of v, every operand o used in defining v is s.t. either o = v, or o does not have a definition point inside of the loop.(1) Note that condition (a) just means that v is defined independently of any branches; another way to say it is that the blocks containing definition points of v dominate the tail of the loop. The idea behind this definition is that the definition points of movable variables may be moved ``freely" within a loop. This allows us to group the definitions of several of these together in one parallel operation, either in the head block or the tail block of the loop.(2) If there are not many movable variables in a loop body, then unrolling a loop a few times will create several definitions of the variable which may be parallelized together, provided the operation defining the variables is well-behaved; more precisely, it is associative, commutative, and has an identity.
|
| Optimization Examples |
|
|
| Simple Example |
This first example illustrates why we chose our definition of
being movable. Assume b} and |
| Unrolling Examples |
|
This optimization (of unrolling a loop in order to parallelize
some computations) may be applied in several contexts. Here, we
show how it works for an array, in simple pseudocode of a linear
transformation.
|
| Example 1 |
for i = 1 to n
Assume for simplicity that n is a multiple of 4. Then the inner loop may be unrolled into:
for i = 1 to n
We can then parallelize the c[i] assignments via the
following pseudocode:
for i = 1 to n
|
| Example 2 |
Here is an example of the unrolling optimization on a variable.
The operation defining the variable must be associative,
commutative, and have an identity. (Notice all arithmetic and
logical ops have this property.)
for i = 1 to n
Assume again for simplicity that n is a multiple of 4. Unrolling gives us: for i = 1 to n step 4
The optimization first renumbers variables in order to parallelize:
sum1 = sum; sum2 = sum3 = sum4 = 0
Then parallel instructions are inserted:
v = (sum, 0, 0, 0)
Observe that the above transformation may be applied to any
variable defined with an operation that is associative,
commutative and has an identity. (For multiplication, we would
have the extra variables initialized to 1 instead of 0.)
|
| A more general optimization |
|
Suppose every instruction has a window within which it can move.
It can move earlier in the program provided that it does not kill
a value that is needed earlier. It can move later in the program
provided that it is executed before it is needed, and none of
it's operands are killed. Let's define vectorizable instructions as instructions that meet the following criteria: 1. Their windows overlap. This implies that they could be rearranged so that they execute right next to each other. This also implies that they do not depend on each other (because their windows wouldn't overlap). 2. They are similar instructions (i.e. add, multiply, load, etc) 3. Their operands are compatible. This is may be instruction specific: * A floating point add cannot be vectorized with an integer add. * A load can only be vectorized with another load so long as certain criteria are met, including data locality and alignment. Consider this.
x=a[0]
These are vectorizable since our vectors can store up to 4
elements. These are 2 side-by-side.
x=a[0]
These are also vectorizable (providing we meet the alignment constraints which I can discuss with you if you don't understand them -- loads often need to be aligned on certain boundaries). We can just ignore the two middle elements (i.e. if we add stuff to them, add 0, and if we multiply them, multiply them by 1).
x=a[0]
This load is not vectorizable. That's because we'd need a vector that could store up to 6 elements. Now, many instructions are going to be vectorizable with many other instructions. Look at this code
u=a[0]
u is vectorizable with v, w and x. v with u, w, x and y (again, alignment would eliminate y most likely, but let's ignore that for now). w with everything. x with everything. y with v, w, x, and z. finally, z with w, x, and y. Now, some of these vectorizations can even be combined. For instance, if I were to combine u and v, the pair is still vectorizable with w and x. For each instruction there is a maximum of n concurrent vectorizations possible, where n is the width of the register, in elements. Once we've established these potential vectorization partners, the next question is, "who do we combine with?" Obviously, some combinations may break other combinations. Occasionally, this will be in trivial ways:
x=a[0]
The loads are vectorizable, the increments are vectorizable, but you can't merge the two together. Sometimes it will be a little less intuitive.
x=a[0]
Notice if we set up: v1=(x,y,z) and v2=(j,k,l) We can't do v1+v2 because that would give us (x+j, y+k, z+l) but we really want (x+j, y+l, z+k). Sometimes that's not bad. For example, for the PlayStation 2, there's a parallel exchange instruction. So we can do
lq &a, v1
And we've solved that problem. The additional cost of vectorization is an exchange instruction. In other cases, the cost may be a few more or a few less instructions.
|
| Footnotes |
|
(1) A problem with (c) is that, for some matrix
computations, the operand may have a definition inside of the
loop, but only because it is ``set up" code... for example, a
machine language version of the first unrolling example will
probably have a register defined in the loop that contains
a[i,j] * b[i]. So more generally, o is
loop-invariant. I use the weaker version of (c) for now because I
don't know how hard it would be to compute loop-invariants, if we
have other tools to do that, etc.
(2) We do not consider loops in which branches out of the loop may occur in the loop body. |