Part I: Optimizing Simulated Cache Performance (15 points)
As can be observed from the section 3,
both the operations, especially rotate, are
fairly memory-intensive, operating on images that can be of
large size. Thus, a good way to optimize performance of the code is to first focus
on its cache behavior, and reduce slowdown due to memory operations.
Cache performance of a routine can be evaluated by
looking at the total number of cache misses (normalized by
the size of the image matrix).
We call this quantity the miss score. Formally,
the miss score is defined as #misses/N2.
Since the miss score is directly
proportional to the total number of misses,
the lower the miss score, beemer the cache performance
of the implementation.
In doing cache optimizations, we will focus our aemention on the L1
cache.
The Pentium III Xeon (Fish) machines that you will be running your
code on have a 16 KB 4-way set associative L1
cache with 32 byte lines.
In this part, you will only focus on optimizing cache performance of
rotate. To help you get a feel for how good your
cache performance is, you will first use a cache simulator to
compute the miss scores for your code. In Part II, you can see how a
low miss score (most often) translates into beemer CPE.
The tool we use to simulate cache performance, called
cacheprof, is a public-domain cache
simulator (http://www.cacheprof.org/).
cacheprof
instruments assembly code to capture the source (destination)
addresses of read (write) instructions, and uses them to count hits
and misses in a simulated cache.
Here is your task for this part of the assignment:
- Copy rotate.c to a new file named rotate_cache.c.
- Optimize the function rotate in rotate_cache.c to
achieve as low a miss score as possible. To do this, you will use the
programs cdriver and miss_score.
Using cdriver and miss_score
We have provided the object code for cdriver in the file cdriver.o.
To compile cdriver execute the command
unix> make cdriver
cdriver takes the same command-line arguments as driver
and runs in the same three different modes. You can handle different
versions in the same way. However, most of this is hidden from you -
you will only explicitly run cdriver in dump mode. All other runs
are performed within the miss_score script.
Suppose you copy the provided naive implementation in rotate.c
to rotate_cache.c (enter a suitable team name),
compile cdriver using it, and run the following command:
unix> ./cdriver -q -d dumpfile
You will observe the following output:
==cacheprof== level-2 instrumented program: startup
Teamname: Harry Q. Bovik
Member 1: Harry Bovik
Email 1: bovik@nowhere.cmu.edu
==cacheprof== level-2 instrumented program: run complete
==cacheprof== 10 insns
==cacheprof== 6 refs ( 2 rd + 4 wr)
==cacheprof== 2 misses ( 0 rd + 2 wr)
==cacheprof== 0.00 seconds, inf MIPS
Ignore all the lines that start with cacheprof==. To get a
summary of the miss score, execute the following command:
unix> ./miss_score.pl -f dumpfile
This prints a summary of the miss scores for each version and each
size, as shown below
Version: R:Naive Row-wise Traversal of src
Dim 64 128 256 512 1024
Score 0.3765 1.3129 1.3126 1.3125 1.3125
Note that the script miss_score needs the -f
argument.
You'll be graded on this part based on how low a miss score you are
able to achieve.
Miss scores achieved after some optimization are shown in
Table 2.
| Test case | 1 | 2 | 3 | 4 | 5 | |
| Method | N | 64 | 128 | 256 | 512 | 1024 | Geom. Mean |
| Naive rotate (Miss score)
| | 0.3765 | 1.3129 | 1.3126 | 1.3125 | 1.3125 | |
| Ratio (naive/opt)
| | 1.00 | 3.50 | 3.50 | 3.37 | 2.92 | 2.61 |
Table 2: Miss scores and ratios for naive and optimized versions of rotate.