15-740 Fall '97
Sparse Matrix Multiplication

When I gave the lecture in class Monday 9/22, showing the code generated for different machines to do sparse matrix * dense vector, I realized that I only had relatively old processors on my list. So, I went and ran the code on some higher performance machines. Here's the complete list:

Machine         clock nanosecs/ele     Cycles/ele      Compiler

VAX 25? 2448 122? (GCC) MIPS 25 365 18 (GCC) SPARC 50 388 39 (GCC) PPC 601 62 63 8 (IBM) PPC 603 75 125 19 (CodeWarrior) Pentium 90 78 14 (GCC) PPC 603e 91 60 11 (CodeWarrior) HP Precision 100 50 10 (GCC) UltraSparc 160 38 12.5 (GCC) MIPS R5000 180 37 13.5 (SGI) MIPS R10000 185 13.4 5.0 (SGI) PPC 604e 200 14 5.6 (CodeWarrior) Pentium II 265 17 9.0 (GCC) Alpha 21064 267 26 13.7 (DEC)

The entry "604e" is interesting. It's my desktop MAC. The code generated by the compiler is nothing special. It makes little use of the fancy PPC instructions (just FP MULT-ADD.) There are 9 instructions in the inner loop, and they aren't ordered very well.

  while (...)
        temp += *(valp++) * x[*cip++];

Using the following register assignment:

        r4      x
        r3      valp
        r8      cip
        fp2     temp

The loop code is:

        lwz     r0,0(r8)        # r0 = *cip
        addi    r8,r8,4         # cip++
        lfd     fp1,0(r3)       # fp1 = *valp
        addi    r3,r3,8         # valp++
        slwi    r0,r0,3         # r0 *= 8
        lfdx    fp0,r4,r0       # fp0 = x[]
        fmadd   fp2,fp1,fp0,fp2 # temp += () * ()
        cmplw   r8,r10          # Compare r8 : r0?
        blt     *-32            # Loop if <

But, the processor's got lots of horsepower.

If you have an interesting processor that's not on this list, I'd like to hear what kind of performance you get. Here's how to run the benchmark: