To: Distribution From: David Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp] Re: Article by H. Yoshihara on Japanese Supercomputer Performance on specific computational fluid dynamics benchmark program and comparison with Cray. 1 May 1990 Hideo Yoshihara served as a liaison scientist for the Office of Naval Research Far East from April 1988 until May 1990. His assignment was to follow the progress of advanced supercomputers and to review and assess the viscous flow simulation research in the Far East. Dr. Yoshihara formerly was with the Boeing Company, where he was Engineering Manager for Applied Computational Aerodynamics. He was also an affiliate professor in the Department of Aeronautics and Astronautics of the University of Washington, an AIAA Fellow, and a former member of the Fluid Dynamics Panel of AGARD/NATO. The article below is one portion of a longer report prepared by Yoshihara, and he has given me permission to circulate it. The complete article will appear in the Office of Naval Research Scientific Information Bulletin, Vol. 15, No. 3. PERFORMANCE OF JAPANESE SUPERCOMPUTERS VIS-A-VIS CRAY COMPUTERS Hideo Yoshihara This article is part of a final assessment report for the Office of Naval Research Far East titled "Supercomputers and Computational Fluid Dynamics in Japan." EXPANDED SUMMARY 1. The performance of the single-CPU Fujitsu VP-400E, Hitachi S-820/80, and NEC SX-2A and the 8-CPU Cray YMP/832 was measured on an alternating direction implicit (ADI) Navier/Stokes code on a mesh approximately 102 x 102 x 102. 2. Actual speeds attained were 0.395 GFLOPS for the VP-400E, 0.602 GFLOPS for the S-820/44, 0.414 GFLOPS for the SX-3A, and 1.5 GFLOPS for the YMP/832. This corresponded to actual/peak speed ratios of about 30 percent for the single-CPU Japanese computers and 56 percent for the 8-CPU Cray YMP. Because of the idealized definition of peak speed, a performance of 56 percent is outstanding but 30 percent is poor. 3. Reasons for the poor performance of Japanese computers were as follows: o High startup (pipe fill-in) overhead in the arithmetic processor pipes due to reduction of vector length (production run) by its spread over many pipes. o Large memory latency in the VP-400E and SX-2A primarily due to use of slow MOS chips. (Delay in retrieving numbers from memory for bussing to CPU.) o Inadequate memory/register bandwidth for the VP-400E. 4. The performance of the autovectorizer of Japanese computers was outstanding, vectorizing 99 percent of the benchmark code without directives. 5. Reasons for outstanding performance of the Cray YMP/832 were as follows: o Well-balanced architecture. Very low startup overhead in the processor pipes (no dilution of vector length). Low memory latency and adequate memory/register bandwidth. o Outstanding performance of the Autotasker, which identified independent parts of the code, assigning their calculation to different CPUs with minimum idle time. 6. The benchmark on the recently offered Fujitsu VP-2600 (1 CPU, 4 GFLOPS) and NEC SX-3/44 (4 CPU, 22 GFLOPS) is scheduled for April and September 1990, respectively. Much-awaited results will be reported in the fall issue of the ONRFE Scientific Information Bulletin. Expected results of the benchmark are as follows: o Fujitsu VP-2600: Pipe startup overhead should be similar to the VP-400E, but the number of independent paths between memory and register has been doubled, and memory latency has been essentially eliminated by overlapping fetch/store instructions. An actual speed of approximately 1 GFLOPS can be expected with an actual/peak speed ratio of 50 percent, taking the peak speed a more realistic 2 GFLOPS rather than the advertised 4. o NEC SX-3/44: With 16 pipes per CPU, pipe startup overhead will be significantly increased. Memory/register bandwidth should be adequate for the expected reduced actual speeds with four vector fetch and two vector store paths. Memory latency has been mostly eliminated by overlapped fetch/store instructions. Assuming increased pipe startup overhead to be counterbalanced by reduction of memory latency, an actual speed of about 7 GFLOPS can be expected, still an incredible speed. Here a parallelization efficiency of the order of that for the Cray Autotasker is additionally assumed, but obtained, however, with heavier assist from compiler directives. 7. In future supercomputers, much needed increased peak speeds must be obtained by increasing the number of processor pipes per CPU and increasing the number of CPUs. With the ADI Navier/Stokes algorithm, an increased number of pipes/CPU will decrease effective vector length, thereby increasing pipe startup overhead. Similarly, increasing the number of CPUs will decrease granularity and thus increase parallelization overhead. Both trends will decrease computer speed and hence deteriorate performance in ADI algorithms. Other Navier/Stokes algorithms must be explored such as explicit methods with greatly increased vector length, or column relaxation methods with increased code granularity. Such benefits are accompanied by corresponding decreases of convergence rate so that a balancing of the opposing factors will be necessary. A benchmark, not comparing different computers for a given algorithm but comparing different Navier/Stokes algorithms on a given single- or multiple- CPU computer, is thus recommended. Finally, remember that the present computer assessment is based solely on the speed performance on a specific, albeit a mainline, CFD algorithm. Relative performance will undoubtedly change for other programs including other Navier/Stokes algorithms. INTRODUCTION Japanese supercomputers were reviewed and assessed through two activities. First was a benchmark on a Navier/Stokes code, carried out together with Professor K. Fujii of the Institute of Space and Astronautical Sciences (Sagamihara) (Ref 1), that would exercise key components of the supercomputers in a real computational fluid dynamic (CFD) environment. Computers benchmarked included all operational Japanese supercomputers and the 8-CPU Cray YMP/832. Computer company participation was enthusiastic. The benchmark provided an opportunity to meet and know key working level supervisors and senior programmers. Second was a review of the architecture of the new supercomputers, the 1-CPU Fujitsu VP-2600 and the 4-CPU NEC SX-3/44, with senior company architects. This review was made together with Dr. K. Neves of Boeing Computer Services who was provided ONR invitational orders. Results are reported in Reference 2. BENCHMARK PERFORMANCE OF SINGLE-CPU JAPANESE COMPUTERS Japanese supercomputer companies have concentrated on single-processor computers with the determined goal of producing the fastest single-CPU computer. Aside from reducing cycle time, large peak speeds were produced using many parallel arithmetic pipes: 12 pipes for the 1.7-GFLOPS Fujitsu VP-400E, 8 pipes for the 1.3-GFLOPS NEC SX-2A, and 12 pipes for the 4-GFLOPS Hitachi S-820/80. Here the peak speed is the number of floating point operations (FLOP), as an add or multiply, that an ideal computer can perform per second assuming one clock cycle per FLOP. Peak speed in GFLOPS (billions of floating point operations per second) is thus obtained by dividing the number of independent CPU add and multiply pipes by the clock cycle time in nanoseconds (ns). For computers with hard-chained add/multiply pipes as the VP-400E and S-820/80, each add or multiply segment of the chained pipe was considered as independent in the determination of the brochure peak speed. When the operations are predominantly dyadic (simple add or multiply) as in the benchmark code, only one of the segments in the chained pipes can be in operation at a given time, so that it is more meaningful to count the chained pipe combination as one pipe instead of two. Accordingly a more realistic peak speed for the VP-400E for the benchmark problem is 1.14 GFLOPS rather than the advertised 1.7 GFLOPS. Similar considerations would apply to the Hitachi S-820/80, resulting in a relevant peak speed of 2 GFLOPS rather than 3. NEC computers do not have such chained combination pipes, so that the applicable peak speed would not change from the brochure value. Actual speeds measured in the benchmark were 0.395 GFLOPS for the Fujitsu VP-400E, 0.602 GFLOPS for the Hitachi S-820/80, 0.414 GFLOPS for the NEC SX-3A, and 1.50 GFLOPS for the Cray YMP/832. Actual/peak speed ratios corresponded to approximately 30 percent for the single-CPU Japanese computers and 56 percent for the 8-CPU Cray YMP/832. With such a highly idealized definition of peak speed, a performance of 56 percent is outstanding but 30 percent is poor. Causes for the low performance of the single-CPU Japanese computers are clear. The most significant cause in the present benchmark was the high processor startup (pipe fill-in) overhead. In an arithmetic processor, a floating point (FLOP) operation is not achieved until the pipe (assembly line) has been filled. This startup time is an overhead, and it will reduce the computing speed unless it can be amortized over a sufficient vector length (production run). To characterize the pipeline performance for a given arithmetic operation (as a computing kernel), a number n1/2 is defined as the vector length required to achieve half the peak speed. Neves (Ref 2) suggested that a vector length three time n1/2 is necessary for an acceptable pipe startup overhead. In a multi-piped processor, increasing the number of pipes increases the peak computing speed, but more importantly it increases n1/2 since the original vector length must be partitioned over many pipes. (This is akin to splitting the production run in the case of many parallel assembly lines.) An increase of n1/2 leads to increased pipe fill-in overhead in the multi-piped computers and hence to a low actual/peak speed ratio. There is a further overhead (memory latency) arising in the process of fetching numbers from memory for deposit in the vector registers. Memory latency, for example, depends on the type of memory chips used, how skillfully irregular retrievals as gather/scatter are carried out, and whether bank or line conflicts in memory are avoided. The Fujitsu VP-400E and the NEC SX-2A use slow, but inexpensive, MOS chips and have accordingly large memory access times (55 ns and 40 ns, respectively). The Hitachi S-820/80 employs fast, but expensive, bipolar chips and thus has a shorter 20 ns access time. Unquestionably large memory latency in the VP-400E and SX-2A contributed to their reduced performance in the benchmark. It will be seen later that in the new Japanese supercomputers, the Fujitsu VP-2600 and NEC SX-3, memory latency is avoided (after the first fetch) by overlapped fetch/store instructions. Finally, numbers must be supplied from memory to the CPU via the vector registers at a sufficient rate to prevent stoppage of the CPUs; that is, memory/register bandwidth must match the realizable speed of the CPUs, for example, by providing a sufficient number of fetch and store paths. Minimally, two fetch paths and one store path, for example, should be in place for dyadic operations. Memory/register bandwidths for the Japanese computers are as follows: 0.70 GW/s for the Fujitsu VP-400E, 2 GW/s for the Hitachi S-820/80, and 1.38 GW/s for the NEC SX-2A. Here GW/s is billion words per second. More meaningful is the number of words that can be transmitted from memory to CPU for each floating point operation (FLOP) at half peak speed, that is, W/FLOP*. Thus one has 1.2 W/FLOP* for the Fujitsu VP-400E, 2 W/FLOP* for the Hitachi S-820/80, and 2 W/FLOP* for the NEC SX-2A. For dyadic operations memory/register bandwidth is clearly inadequate for the Fujitsu VP-400E but adequate for the Hitachi S-820/80 and NEC SX-2A. In summary, the low performance (low actual/peak speed ratio) of the single-CPU VP-400E, S-820/80, and SX-2A in the benchmark problem was due to large processor pipe startup overhead, large memory latency (VP-400E and SX-2A), and inadequate memory/register bandwidth (VP-400E). The autovectorizers for all three computers were highly effective, attaining a vectorization ratio of 99 percent. MULTI-PROCESSOR PERFORMANCE Multi-processor computers have a clear advantage over single-processor computers for codes having large granularity, that is, independent portions of the code containing significant numbers of operations. In these cases, independent parts of the code can be calculated concurrently on different CPUs, thereby reducing the elapsed computational time. To be effective, the autoparallelizer must not only determine which parts of the code are independent but must skillfully schedule these parts on different CPUs to avoid CPU idleness. Because of the overhead only the large granularity outer loops are parallelized. At this level it is usually a straightforward task for the programmer to implement directives to supplement the autoparallelizer if the latter fails its tasks. Inner loops, where the granularity is small, are vectorized. In practice, most computing centers with multi-processor computing systems strongly discourage use of all CPUs by a single user to maintain high throughput, that is, many jobs per day. This is usually accomplished by charging disproportionately large occupancy charges for multiple-CPU usage and assigning a very low priority that usually leads to unacceptable turnaround. Practical outcome is that users today cannot count on the benefits of concurrent computations with multiple CPUs. CONTRASTING PERFORMANCE OF THE MULTI-PROCESSOR CRAY YMP/832 The approach of Cray Research to supercomputer design has been different from that of the Japanese companies. Peak speeds have been achieved by multiple CPUs with few pipes in each CPU. The peak speed of each CPU of the YMP is 0.33 GFLOPS, with a total peak speed of 2.67 GFLOPS for the 8 CPUs. Each CPU processor has one add and one multiply pipe so that the multi-pipe dilution effect on vector length is absent. For each CPU there are two memory/register fetch paths and one store path, resulting in 4 Word/FLOP at half peak speed. With bipolar chips, memory access time is 30 ns. Unquestionably the Cray YMP/832 has a well-balanced architecture. Performance of the YMP/832 in the benchmark was outstanding. It attained an actual/peak speed ratio of 0.56 based on the 6.41 cycle time of the benchmark YMP. CPU pipe overhead and memory latency were insignificant in the YMP for the vector lengths of the benchmark code. With two fetch paths and one fetch/store path connecting memory and register for each CPU, there was adequate bandwidth. Since widely differing algorithms are available for the Navier/Stokes problem, the degree of parallelization possible in a given algorithm is of interest to the computational fluid dynamicist. For the benchmarked Fujii/Obayashi Navier/Stokes code, elapsed time with 1 CPU was reduced by a factor 7.22 with 8 CPUs, that is, a multiple-CPU efficiency of 90 percent. The difference of 7.22 from 8 is then a measure of the nonparallel content of the code and imperfections of the parallelization including its overhead. Of more importance to the computer manufacturer is the percentage of the "parallelizable" part of the code (Amdahl number) that was parallelized. In the present benchmark the Autotasker (Cray's automatic parallelizer) with the help of four directives achieved a parallelization ratio of 98.8 percent of the Amdahl number. This remarkable performance of the Autotasker is the result of many years of multitasking experience starting with the XMP series. SUPERCOMPUTER COMPETITION IN THE NEAR TERM Figure 1 shows existing and planned supercomputers. Near-term competition will primarily involve the Fujitsu VP-2600, NEC SX-3/44, and Cray 3. Here it is anticipated that the VP-2600 will form the basis for an expected multiple-processor computer from Fujitsu. In the following, some of the features of these computers are given together with their expected performance on the Navier/Stokes benchmark. Features of the Fujitsu VP-2600 and the NEC SX-3/44 Two new Japanese computers, the single-CPU Fujitsu VP-2600 and the 4-CPU NEC SX-3/44, will be available during the summer and fall of 1990, respectively, and will be benchmarked under the same guidelines as the previous benchmark. The benchmark on the VP-2600 is scheduled for the spring of 1990, with the single-CPU measurement for the SX-3 to follow several months later. The 4-CPU SX-3/44 benchmark is scheduled for the fall of 1990. Slippages in the above benchmark dates may be expected due to the inevitable tuning required for the new compiler and higher priority benchmarking for potential customers. The above Navier/Stokes benchmark results and their analysis will be given in a future ONRFE Scientific Information Bulletin article. System schematics for the VP-2600 and the SX-3/44 computers are shown in Figure 2. In the VP-2600 the CPU is composed of four floating point units each with two chained multiply/add pipes. Counting the total number of pipes as 16, one obtains the brochure peak speed of 4 GFLOPS with the 4-ns clock. Since the benchmark code consists predominantly of dyadic operation, the number of pipes in the VP-2600 is effectively eight, half the advertised number. This, then, results in a halving of the peak speed to 2 GFLOPS. For the SX-3/44, there are 4 sets of floating point units per CPU, each unit containing 2 add and 2 multiply pipes for a total of 16 independent pipes per CPU. With a clock of 2.9 ns, a peak speed of 5.5 GFLOPS per CPU results for a total of 22 GFLOPS with 4 CPUs. Operationally the SX-3/44 is basically two SX-2s tied in parallel. Expected Performance of the VP-2600 and SX-3/44 The actual/peak speed ratio of the VP-2600, assuming a reduced peak speed of 2 GFLOPS, should improve over that of the VP-400E. Pipe fill-in overhead should be less, memory/register bandwidth with two vector fetch paths should suffice, and the memory latency should be largely eliminated by the overlapped fetch/store instructions. An actual speed of about 1 GFLOPS might be anticipated for the VP-2600 in the benchmark. For the SX-3/44 with 16 processor pipes per CPU, greatly worsened pipe overhead must be expected, resulting in a doubling of the n1/2s of the SX-2A. Memory/register bandwidth of the SX-3/44 should sustain the reduced actual speeds despite the reduced cycle time of 2.9 ns, half that of the SX-2A. Overlapped fetch/store instructions will greatly reduce memory latency. The NEC autoparallelizer cannot be expected to perform as effectively as the Cray Autotasker at this early stage, but with straightforward compiler directives a reasonable parallelization efficiency can be expected. Assuming the increased pipe startup overhead to be countered by the reduced memory latency, the NEC SX-3/44 should perform at the still phenomenal speed of about 7 GFLOPS. Performance of the Cray 3 The Cray 3 is a 16-CPU multi-processor with a peak speed of 16 GFLOPS. Its 2-ns cycle time is achieved with gallium arsenide semiconductors. With each CPU having one add and one multiply pipe, the pipe fill-in overhead for the benchmark problem should be small and comparable to the YMP. With the balanced architecture of the Cray 3, it is not unreasonable to expect an actual/peak speed ratio comparable to that of the YMP for each CPU. Increasing the number of CPUs to 16 will, however, decrease the granularity and hence increase the parallelization overhead. Thus a speed somewhat under 8 GFLOPS is not unreasonable for the benchmark. The actual speed of the Cray 3 will therefore be in the neighborhood of that of the NEC SX-3/44 for the benchmark. CONCLUSIONS There is strong contrast in the architectural approach in the supercomputer design between Japanese companies and Cray. The emphasis in Japan has been on designing a single-CPU computer with the highest peak speed, sometimes at the expense of a balanced architecture. High speeds were achieved using many parallel processor pipes. A large number of pipes decreases the effective vector length, leading to reduced computing speeds through increased pipe fill-in overhead. Adequate memory/register bandwidth can be obtained by providing a sufficient number of fetch paths, but this has been difficult for the Japanese supercomputers, which have extremely high single-CPU peak speeds. Cray supercomputers, in strong contrast, have achieved high speeds by combining many CPUs with each CPU having few pipes in a well-balanced architecture. In the Cray 3 each of the 16 processors has only one add and one multiply pipe, producing 1 GFLOPS speed in each CPU using gallium arsenide semiconductors with 2-ns cycle time. Use of many CPUs leads to increased parallelization overhead by the reduction of granularity. It is clear that the ADI Navier/Stokes code of the benchmark is unsuited for the NEC SX-3/44 because the 16 pipes in each CPU dilute the already marginal vector length. It is also unsuited for the 16-CPU Cray 3 because of the reduction of granularity by the large number of CPUs. In applications requiring high peak speeds of these computers, other Navier/Stokes codes with greater vector length or larger granularity must be considered. In the present benchmark problem with a mesh approximately 102 points in each coordinate direction, there will be a factor 102 increase in the vector length if the coordinate sweeps are carried out explicitly rather than implicitly, Or more dramatically, if an explicit difference method, for example, using a high-order Kutta-Runge scheme, is used in place of the ADI algorithm, there will be a further 102-fold increase in vector length. There is, however, a penalty accompanying these increases of vector length; namely, the rate of convergence of the iterative procedure is correspondingly decreased. Lowered pipe fill-in overhead must then be weighed against the increased number of iterations needed for convergence. Similarly, increased granularity is achieved by using a Jacobi column relaxation instead of the benchmark ADI procedure. Here subsequent relaxations can be initiated on successive CPUs as soon as needed updated inputs have been calculated in the prior relaxation. Again such a relaxation process will have a lowered convergence rate as compared to the ADI procedure. A benchmark, not comparing computers but measuring the performance of different Navier/Stokes algorithms on a representative computer as the SX-3/44 or Cray 3, would therefore be of significant interest. Finally, one is reminded that the present computer assessment was based solely on the speed performance on a specific, but a mainline, CFD algorithm, namely, an ADI Navier/Stokes code. Relative performance of the benchmarked supercomputers will undoubtedly change for other programs including other Navier/Stokes algorithms. In the future, U.S. dominance in supercomputers is in jeopardy. Future Cray supercomputers must share performance leadership with Japanese supercomputers as the NEC SX-3/44. Creative geniuses of U.S. computer architects, as Seymour Cray, will have difficulty overcoming the corporate might of Fujitsu Ltd., Hitachi Ltd., and NEC Corporation. References. 1. K. Fujii and H. Yoshihara, "A Navier/Stokes benchmark for Japanese and U.S. supercomputers," Scientific Information Bulletin 14(2), 69-74 (1989). 2. K. Neves, "Supercomputers: The next generation," Scientific Information Bulletin 14(4), 77-95 (1989). ------------END OF REPORT--------------------------------------------- y as the Cray