Overview of Optimizations
Here, we will discuss in detail the kind of optimizations we devised, and are currently implementing. In order to parallelize a group of instructions in a way that boosts performance, the group of instructions needs to be executed somewhat often in a normal execution. This is because there is an extra cost incurred by setting up long registers for parallelization (and extracting the components of the registers). To offset this cost, we need to identify situations in which a single parallel instruction gives a substantial benefit over the usual instructions. The particular situations we investigate are loops.

Operations on arrays seem as if they would be quite susceptible to parallelization. For example, the initialization of an array could perhaps be done faster, if we initialized several of the entries at once using a parallel instruction. A linear transformation, where a series of data-independent dot products are performed on the rows of a matrix, is similarly susceptible.

We wish to capture these situations as generally as possible, while keeping our optimization feasible. The intuition behind our optimization is to discover variables in a loop where loop unrolling allows us to parallelize their definition. We say that a variable or array v is movable in a loop if (a) all definition points of v are executed in every possible execution of the loop, (b) v is not used anywhere in the loop, except possibly in its own definition points, (c) for each definition point of v, every operand o used in defining v is s.t. either o = v, or o does not have a definition point inside of the loop.(1) Note that condition (a) just means that v is defined independently of any branches; another way to say it is that the blocks containing definition points of v dominate the tail of the loop.

The idea behind this definition is that the definition points of movable variables may be moved ``freely" within a loop. This allows us to group the definitions of several of these together in one parallel operation, either in the head block or the tail block of the loop.(2) If there are not many movable variables in a loop body, then unrolling a loop a few times will create several definitions of the variable which may be parallelized together, provided the operation defining the variables is well-behaved; more precisely, it is associative, commutative, and has an identity.

Optimization Examples

Simple Example
This first example illustrates why we chose our definition of being movable. Assume b} and c} are movable in the loop.

for i = 1 to n
blah_1
b += b + a[i]
blah_2
c += c + d[i]
blah_3
end for

This code is easily parallelized, via the (rough) pseudocode:

v = (b,c)
for i = 1 to n
blah_1
w = (a[i], d[i])
v = padd(v,w)
blah_2
blah_3
end for
b = fst(v)
c = snd(v)


We can move the definition of c, by the assumptions that c is defined outside of any branches in the loop, and the operands it uses are loop-invariant.

Unrolling Examples
This optimization (of unrolling a loop in order to parallelize some computations) may be applied in several contexts. Here, we show how it works for an array, in simple pseudocode of a linear transformation.

Example 1
for i = 1 to n
for j = 1 to n
c[i] += c[i] + A[i,j]*b[j]
end for
end for

Assume for simplicity that n is a multiple of 4. Then the inner loop may be unrolled into:

for i = 1 to n
for j = 1 to n step 4
c[i] += c[i] + A[i,j]*b[j]
c[i+1] += c[i+1] + A[i+1,j]*b[j]
c[i+2] += c[i+2] + A[i+2,j]*b[j]
c[i+2] += c[i+3] + A[i+3,j]*b[j]
end for
end for

We can then parallelize the c[i] assignments via the following pseudocode:

for i = 1 to n
for j = 1 to n step 4
c' = (c[i],c[i+1],c[i+2], c[i+3])
d' = (A[i,j]*b[j], A[i+1,j]*b[j], A[i+2,j]*b[j], A[i+3,j]*b[j])
c' += padd(c',d')
c[i] = fst(c)
c[i+1] = snd(c)
c[i+2] = thrd(c)
c[i+3] = frth(c)
end for
end for

Example 2
Here is an example of the unrolling optimization on a variable. The operation defining the variable must be associative, commutative, and have an identity. (Notice all arithmetic and logical ops have this property.)

for i = 1 to n

sum += sum + a[i]

end for

Assume again for simplicity that n is a multiple of 4. Unrolling gives us:

for i = 1 to n step 4
sum += sum + a[i]
sum += sum + a[i+1]
sum += sum + a[i+2]
sum += sum + a[i+3]
end for

The optimization first renumbers variables in order to parallelize:

sum1 = sum; sum2 = sum3 = sum4 = 0
for i = 1 to n step 4
sum1 = sum1 + a[i]
sum2 = sum2 + a[i+1]
sum3 = sum3 + a[i+2]
sum4 = sum4 + a[i+3]
end for
sum = sum1 + sum2
sum += sum3 + sum4

Then parallel instructions are inserted:

v = (sum, 0, 0, 0)
for i = 1 to n step 4
w = (a[i], a[i+1], a[i+2], a[i+3])
v = padd(v,w)
end for
sum = fst(v)+snd(v)
sum += thrd(v)+frth(v)

Observe that the above transformation may be applied to any variable defined with an operation that is associative, commutative and has an identity. (For multiplication, we would have the extra variables initialized to 1 instead of 0.)

A more general optimization
Suppose every instruction has a window within which it can move. It can move earlier in the program provided that it does not kill a value that is needed earlier. It can move later in the program provided that it is executed before it is needed, and none of it's operands are killed.

Let's define vectorizable instructions as instructions that meet the following criteria:

1. Their windows overlap. This implies that they could be rearranged so that they execute right next to each other. This also implies that they do not depend on each other (because their windows wouldn't overlap). 2. They are similar instructions (i.e. add, multiply, load, etc) 3. Their operands are compatible. This is may be instruction specific:
* A floating point add cannot be vectorized with an integer add.
* A load can only be vectorized with another load so long as certain criteria are met, including data locality and alignment. Consider this.

x=a[0]
y=a[1]
These are vectorizable since our vectors can store up to 4 elements. These are 2 side-by-side.

x=a[0]
y=a[3]

These are also vectorizable (providing we meet the alignment constraints which I can discuss with you if you don't understand them -- loads often need to be aligned on certain boundaries). We can just ignore the two middle elements (i.e. if we add stuff to them, add 0, and if we multiply them, multiply them by 1).

x=a[0]
y=a[5]

This load is not vectorizable. That's because we'd need a vector that could store up to 6 elements.

Now, many instructions are going to be vectorizable with many other instructions. Look at this code

u=a[0]
v=a[1]
w=a[2]
x=a[3]
y=a[4]
z=a[5]

u is vectorizable with v, w and x. v with u, w, x and y (again, alignment would eliminate y most likely, but let's ignore that for now). w with everything. x with everything. y with v, w, x, and z. finally, z with w, x, and y.

Now, some of these vectorizations can even be combined. For instance, if I were to combine u and v, the pair is still vectorizable with w and x. For each instruction there is a maximum of n concurrent vectorizations possible, where n is the width of the register, in elements.

Once we've established these potential vectorization partners, the next question is, "who do we combine with?" Obviously, some combinations may break other combinations. Occasionally, this will be in trivial ways:

x=a[0]
y=a[1]
y=y+1
z=z+1

The loads are vectorizable, the increments are vectorizable, but you can't merge the two together.

Sometimes it will be a little less intuitive.

x=a[0]
y=a[1]
z=a[2]
j=b[0]
k=b[1]
l=b[2]
x+=j
y+=l
z+=k

Notice if we set up:

v1=(x,y,z) and v2=(j,k,l)

We can't do v1+v2 because that would give us (x+j, y+k, z+l) but we really want (x+j, y+l, z+k).

Sometimes that's not bad. For example, for the PlayStation 2, there's a parallel exchange instruction.

So we can do

lq &a, v1
lq &b, v2
pexcw v2, v3
padd v1, v3, v1

And we've solved that problem.

The additional cost of vectorization is an exchange instruction. In other cases, the cost may be a few more or a few less instructions.

Footnotes
(1) A problem with (c) is that, for some matrix computations, the operand may have a definition inside of the loop, but only because it is ``set up" code... for example, a machine language version of the first unrolling example will probably have a register defined in the loop that contains a[i,j] * b[i]. So more generally, o is loop-invariant. I use the weaker version of (c) for now because I don't know how hard it would be to compute loop-invariants, if we have other tools to do that, etc.

(2) We do not consider loops in which branches out of the loop may occur in the loop body.