As in the uniprocessor experiments, the SUIF compiler is used to generate fully-functional object code with prefetching. The performance of this object code is simulated using an event-driven simulator which models the architecture at the behavioral level. For example, the caches and the coherence protocol, contention, and arbitration for buses are all modeled in detail. The simulations are based on a 16 processor configuration. The architecture simulator is tightly coupled to the Tango-Lite reference generator (the threads-based successor to the process-based Tango reference generator)  to assure a correct interleaving of accesses. For example, a process doing a read operation is blocked until that read completes, where the latency of the read is determined by the architecture simulator. Operating system references are not modeled. Unless specific directives are given by an application, main memory is distributed uniformly across all nodes using a round-robin page allocation scheme.
We now arrive at a difficult methodological problem that occurs when simulating large multiprocessors. Given that detailed simulators are enormously slower than the real machines being simulated, one can only afford to simulate much smaller problems/applications than those that would be run on the real machine. However, running small problems on a full-sized machine may result in unrealistic caching behavior, since, for example, the entire working set may fit in the cache. Therefore the question is how to scale the machine parameters so as to get realistic performance estimates.
A thorough examination of this question has been presented by Weber . Weber uses variational analysis (i.e. observing the effects of varying cache size and problem size parameters on performance) and application-specific knowledge to choose appropriate cache sizes given ``smaller-than-real'' problem sizes for the SPLASH applications. This analysis provides the basis for our own decisions on how to scale the caches. Taking various factors into account regarding the differences between the two studies (e.g., 16 vs. 64 processors, two-level vs. single-level cache hierarchies, and the fact that we generally run larger problem sizes), we scale down the original DASH cache hierarchy from a 64 Kbyte primary and 256 Kbyte secondary cache to an 8 Kbyte primary and 64 Kbyte secondary cache for seven of the eight applications. The exception is PTHOR, where given the small problem size, we use a 2 Kbyte primary cache and 4 Kbyte secondary cache. (However, since prefetches can only be added to PTHOR by hand, performance results are only presented for PTHOR in Section .) To evaluate the impact of the scaled caches, we present results where the cache size is varied in Section .