In this section, we present experimental results to evaluate the performance of our multiprocessor prefetching algorithm. We begin with a high-level overview of the performance improvements. We then focus in greater detail on several key aspects of the algorithm-locality analysis, scheduling prefetching, prefetching indirect references, and using exclusive-mode prefetches-in Sections , , , and , respectively. We looked at several of these same aspects of the algorithm earlier in Chapter , but only within the context of a uniprocessor environment; many of these issues change or deserve a closer look under a multiprocessor environment. We start now with a brief performance overview.
The performance of the original code (i.e. without prefetching) for the eight multiprocessor applications is shown in Figure . Notice that there is a new component of execution time: time spent stalled for synchronization. These synchronization operations include events such as acquiring locks, waiting for other processors to reach barriers (i.e. poor load balancing), and instructions spent spinning on empty task queues. The prefetching compiler algorithm was applicable to five of the eight multiprocessor applications (OCEAN, LU, MP3D, CHOLESKY, and LOCUS). We examine those five applications in this section, and will discuss the other three applications (WATER, PTHOR, and BARNES) later in Section . The overall performance of the five cases that did improve is shown in Figure .
The speedups of the applications in Figure range from 6%to 113%, with three of the five speeding up by 45%or more. The memory stall times have been reduced by 31%to 88%through a combination of lower miss rates and lower average miss penalties, as shown in Table . In two cases (OCEAN and LU), more than half of the synchronization latency was eliminated. This was due to improved load balancing, since prefetching helps to reduce the variability between task sizes. However, prefetching did not reduce the synchronization latency of CHOLESKY, where most of the synchronization time is spent in explicit spin loops waiting for more tasks to be produced. Figure also shows that the overheads of prefetching (both instruction and memory stall overheads) are relatively small, therefore translating much of the memory stall reduction into actual performance benefit. In the following subsections, we examine each aspect of the multiprocessor prefetching algorithm in greater detail.