To: Distribution
From: David Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Article by H. Yoshihara on Japanese Supercomputer Performance on
    specific computational fluid dynamics benchmark program and 
    comparison with Cray.
1 May 1990

     Hideo Yoshihara served as a liaison scientist for the Office of Naval
Research Far East from April 1988 until May 1990.  His assignment was to
follow the progress of advanced supercomputers and to review and assess the
viscous flow simulation research in the Far East.  Dr. Yoshihara formerly was
with the Boeing Company, where he was Engineering Manager for Applied
Computational Aerodynamics.  He was also an affiliate professor in the
Department of Aeronautics and Astronautics of the University of Washington, an
AIAA Fellow, and a former member of the Fluid Dynamics Panel of AGARD/NATO.

        The article below is one portion of a longer report prepared by
Yoshihara, and he has given me permission to circulate it. The complete 
article will appear in the Office of Naval Research Scientific 
Information Bulletin, Vol. 15, No. 3.


        PERFORMANCE OF JAPANESE SUPERCOMPUTERS VIS-A-VIS CRAY COMPUTERS

                                Hideo Yoshihara

     This article is part of a final assessment report for the Office of
     Naval Research Far East titled "Supercomputers and Computational
     Fluid Dynamics in Japan."


EXPANDED SUMMARY

1.   The performance of the single-CPU Fujitsu VP-400E, Hitachi S-820/80, and
NEC SX-2A and the 8-CPU Cray YMP/832 was measured on an alternating direction
implicit (ADI) Navier/Stokes code on a mesh approximately 102 x 102 x 102.

2.   Actual speeds attained were 0.395 GFLOPS for the VP-400E, 0.602 GFLOPS
for the S-820/44, 0.414 GFLOPS for the SX-3A, and 1.5 GFLOPS for the YMP/832. 
This corresponded to actual/peak speed ratios of about 30 percent for the
single-CPU Japanese computers and 56 percent for the 8-CPU Cray YMP.  Because
of the idealized definition of peak speed, a performance of 56 percent is
outstanding but 30 percent is poor.

3.   Reasons for the poor performance of Japanese computers were as follows:

o    High startup (pipe fill-in) overhead in the arithmetic processor pipes
     due to reduction of vector length (production run) by its spread over
     many pipes.
o    Large memory latency in the VP-400E and SX-2A primarily due to use of
     slow MOS chips.  (Delay in retrieving numbers from memory for bussing to
     CPU.)

o    Inadequate memory/register bandwidth for the VP-400E.

4.   The performance of the autovectorizer of Japanese computers was
outstanding, vectorizing 99 percent of the benchmark code without directives.

5.   Reasons for outstanding performance of the Cray YMP/832 were as follows:

o    Well-balanced architecture.  Very low startup overhead in the processor
     pipes (no dilution of vector length).  Low memory latency and adequate
     memory/register bandwidth.

o    Outstanding performance of the Autotasker, which identified independent
     parts of the code, assigning their calculation to different CPUs with
     minimum idle time.

6.   The benchmark on the recently offered Fujitsu VP-2600 (1 CPU, 4 GFLOPS)
and NEC SX-3/44 (4 CPU, 22 GFLOPS) is scheduled for April and September 1990,
respectively.  Much-awaited results will be reported in the fall issue of the
ONRFE Scientific Information Bulletin.  Expected results of the benchmark are
as follows:

o    Fujitsu VP-2600: Pipe startup overhead should be similar to the VP-400E,
     but the number of independent paths between memory and register has been
     doubled, and memory latency has been essentially eliminated by
     overlapping fetch/store instructions.  An actual speed of approximately
     1 GFLOPS can be expected with an actual/peak speed ratio of 50 percent,
     taking the peak speed a more realistic 2 GFLOPS rather than the
     advertised 4.

o    NEC SX-3/44:  With 16 pipes per CPU, pipe startup overhead will be
     significantly increased.  Memory/register bandwidth should be adequate
     for the expected reduced actual speeds with four vector fetch and two
     vector store paths.  Memory latency has been mostly eliminated by
     overlapped fetch/store instructions.  Assuming increased pipe startup
     overhead to be counterbalanced by reduction of memory latency, an actual
     speed of about 7 GFLOPS can be expected, still an incredible speed.  Here
     a parallelization efficiency of the order of that for the Cray Autotasker
     is additionally assumed, but obtained, however, with heavier assist from
     compiler directives.

7.   In future supercomputers, much needed increased peak speeds must be
obtained by increasing the number of processor pipes per CPU and increasing
the number of CPUs.  With the ADI Navier/Stokes algorithm, an increased number
of pipes/CPU will decrease effective vector length, thereby increasing pipe
startup overhead.  Similarly, increasing the number of CPUs will decrease
granularity and thus increase parallelization overhead.  Both trends will
decrease computer speed and hence deteriorate performance in ADI algorithms.

     Other Navier/Stokes algorithms must be explored such as explicit methods
with greatly increased vector length, or column relaxation methods with
increased code granularity.  Such benefits are accompanied by corresponding
decreases of convergence rate so that a balancing of the opposing factors will
be necessary.
     A benchmark, not comparing different computers for a given algorithm but
comparing different Navier/Stokes algorithms on a given single- or multiple-
CPU computer, is thus recommended.
     Finally, remember that the present computer assessment is based solely on
the speed performance on a specific, albeit a mainline, CFD algorithm. 
Relative performance will undoubtedly change for other programs including
other Navier/Stokes algorithms.

INTRODUCTION

     Japanese supercomputers were reviewed and assessed through two
activities.  First was a benchmark on a Navier/Stokes code, carried out
together with Professor K. Fujii of the Institute of Space and Astronautical
Sciences (Sagamihara) (Ref 1), that would exercise key components of the
supercomputers in a real computational fluid dynamic (CFD) environment. 
Computers benchmarked included all operational Japanese supercomputers and the
8-CPU Cray YMP/832.  Computer company participation was enthusiastic.  The
benchmark provided an opportunity to meet and know key working level
supervisors and senior programmers.  Second was a review of the architecture
of the new supercomputers, the 1-CPU Fujitsu VP-2600 and the 4-CPU NEC
SX-3/44, with senior company architects.  This review was made together with
Dr. K. Neves of Boeing Computer Services who was provided ONR invitational
orders.  Results are reported in Reference 2.

BENCHMARK PERFORMANCE OF SINGLE-CPU JAPANESE COMPUTERS

     Japanese supercomputer companies have concentrated on single-processor
computers with the determined goal of producing the fastest single-CPU
computer.  Aside from reducing cycle time, large peak speeds were produced
using many parallel arithmetic pipes: 12 pipes for the 1.7-GFLOPS Fujitsu
VP-400E, 8 pipes for the 1.3-GFLOPS NEC SX-2A, and 12 pipes for the 4-GFLOPS
Hitachi S-820/80.  Here the peak speed is the number of floating point
operations (FLOP), as an add or multiply, that an ideal computer can perform
per second assuming one clock cycle per FLOP.  Peak speed in GFLOPS (billions
of floating point operations per second) is thus obtained by dividing the
number of independent CPU add and multiply pipes by the clock cycle time in
nanoseconds (ns).
     For computers with hard-chained add/multiply pipes as the VP-400E and
S-820/80, each add or multiply segment of the chained pipe was considered as
independent in the determination of the brochure peak speed.  When the
operations are predominantly dyadic (simple add or multiply) as in the
benchmark code, only one of the segments in the chained pipes can be in
operation at a given time, so that it is more meaningful to count the chained
pipe combination as one pipe instead of two.  Accordingly a more realistic
peak speed for the VP-400E for the benchmark problem is 1.14 GFLOPS rather
than the advertised 1.7 GFLOPS.  Similar considerations would apply to the
Hitachi S-820/80, resulting in a relevant peak speed of 2 GFLOPS rather than
3.  NEC computers do not have such chained combination pipes, so that the
applicable peak speed would not change from the brochure value.
     Actual speeds measured in the benchmark were 0.395 GFLOPS for the Fujitsu
VP-400E, 0.602 GFLOPS for the Hitachi S-820/80, 0.414 GFLOPS for the NEC
SX-3A, and 1.50 GFLOPS for the Cray YMP/832.  Actual/peak speed ratios
corresponded to approximately 30 percent for the single-CPU Japanese computers
and 56 percent for the 8-CPU Cray YMP/832.  With such a highly idealized
definition of peak speed, a performance of 56 percent is outstanding but
30 percent is poor.
     Causes for the low performance of the single-CPU Japanese computers are
clear.  The most significant cause in the present benchmark was the high
processor startup (pipe fill-in) overhead.  In an arithmetic processor, a
floating point (FLOP) operation is not achieved until the pipe (assembly line)
has been filled.  This startup time is an overhead, and it will reduce the
computing speed unless it can be amortized over a sufficient vector length
(production run).  To characterize the pipeline performance for a given
arithmetic operation (as a computing kernel), a number n1/2 is defined as the
vector length required to achieve half the peak speed.  Neves (Ref 2)
suggested that a vector length three time n1/2 is necessary for an acceptable
pipe startup overhead.  In a multi-piped processor, increasing the number of
pipes increases the peak computing speed, but more importantly it increases
n1/2 since the original vector length must be partitioned over many pipes. 
(This is akin to splitting the production run in the case of many parallel
assembly lines.)  An increase of n1/2 leads to increased pipe fill-in overhead
in the multi-piped computers and hence to a low actual/peak speed ratio.
     There is a further overhead (memory latency) arising in the process of
fetching numbers from memory for deposit in the vector registers.  Memory
latency, for example, depends on the type of memory chips used, how skillfully
irregular retrievals as gather/scatter are carried out, and whether bank or
line conflicts in memory are avoided.  The Fujitsu VP-400E and the NEC SX-2A
use slow, but inexpensive, MOS chips and have accordingly large memory access
times (55 ns and 40 ns, respectively).  The Hitachi S-820/80 employs fast, but
expensive, bipolar chips and thus has a shorter 20 ns access time. 
Unquestionably large memory latency in the VP-400E and SX-2A contributed to
their reduced performance in the benchmark.  It will be seen later that in the
new Japanese supercomputers, the Fujitsu VP-2600 and NEC SX-3, memory latency
is avoided (after the first fetch) by overlapped fetch/store instructions.
     Finally, numbers must be supplied from memory to the CPU via the vector
registers at a sufficient rate to prevent stoppage of the CPUs; that is,
memory/register bandwidth must match the realizable speed of the CPUs, for
example, by providing a sufficient number of fetch and store paths. 
Minimally, two fetch paths and one store path, for example, should be in place
for dyadic operations.
     Memory/register bandwidths for the Japanese computers are as follows:
0.70 GW/s for the Fujitsu VP-400E, 2 GW/s for the Hitachi S-820/80, and
1.38 GW/s for the NEC SX-2A.  Here GW/s is billion words per second.  More
meaningful is the number of words that can be transmitted from memory to CPU
for each floating point operation (FLOP) at half peak speed, that is, W/FLOP*. 
Thus one has 1.2 W/FLOP* for the Fujitsu VP-400E, 2 W/FLOP* for the Hitachi
S-820/80, and 2 W/FLOP* for the NEC SX-2A.  For dyadic operations
memory/register bandwidth is clearly inadequate for the Fujitsu VP-400E but
adequate for the Hitachi S-820/80 and NEC SX-2A.
     In summary, the low performance (low actual/peak speed ratio) of the
single-CPU VP-400E, S-820/80, and SX-2A in the benchmark problem was due to
large processor pipe startup overhead, large memory latency (VP-400E and
SX-2A), and inadequate memory/register bandwidth (VP-400E).  The
autovectorizers for all three computers were highly effective, attaining a
vectorization ratio of 99 percent.

MULTI-PROCESSOR PERFORMANCE

     Multi-processor computers have a clear advantage over single-processor
computers for codes having large granularity, that is, independent portions of
the code containing significant numbers of operations.  In these cases,
independent parts of the code can be calculated concurrently on different
CPUs, thereby reducing the elapsed computational time.  To be effective, the
autoparallelizer must not only determine which parts of the code are
independent but must skillfully schedule these parts on different CPUs to
avoid CPU idleness.  Because of the overhead only the large granularity outer
loops are parallelized.  At this level it is usually a straightforward task
for the programmer to implement directives to supplement the autoparallelizer
if the latter fails its tasks.  Inner loops, where the granularity is small,
are vectorized.
     In practice, most computing centers with multi-processor computing
systems strongly discourage use of all CPUs by a single user to maintain high
throughput, that is, many jobs per day.  This is usually accomplished by
charging disproportionately large occupancy charges for multiple-CPU usage and
assigning a very low priority that usually leads to unacceptable turnaround. 
Practical outcome is that users today cannot count on the benefits of
concurrent computations with multiple CPUs.

CONTRASTING PERFORMANCE OF THE MULTI-PROCESSOR CRAY YMP/832

     The approach of Cray Research to supercomputer design has been different
from that of the Japanese companies.  Peak speeds have been achieved by
multiple CPUs with few pipes in each CPU.  The peak speed of each CPU of the
YMP is 0.33 GFLOPS, with a total peak speed of 2.67 GFLOPS for the 8 CPUs. 
Each CPU processor has one add and one multiply pipe so that the multi-pipe
dilution effect on vector length is absent.  For each CPU there are two
memory/register fetch paths and one store path, resulting in 4 Word/FLOP at
half peak speed.  With bipolar chips, memory access time is 30 ns. 
Unquestionably the Cray YMP/832 has a well-balanced architecture.
     Performance of the YMP/832 in the benchmark was outstanding.  It attained
an actual/peak speed ratio of 0.56 based on the 6.41 cycle time of the
benchmark YMP.  CPU pipe overhead and memory latency were insignificant in the
YMP for the vector lengths of the benchmark code.  With two fetch paths and
one fetch/store path connecting memory and register for each CPU, there was
adequate bandwidth.
     Since widely differing algorithms are available for the Navier/Stokes
problem, the degree of parallelization possible in a given algorithm is of
interest to the computational fluid dynamicist.  For the benchmarked
Fujii/Obayashi Navier/Stokes code, elapsed time with 1 CPU was reduced by a
factor 7.22 with 8 CPUs, that is, a multiple-CPU efficiency of 90 percent. 
The difference of 7.22 from 8 is then a measure of the nonparallel content of
the code and imperfections of the parallelization including its overhead.  Of
more importance to the computer manufacturer is the percentage of the
"parallelizable" part of the code (Amdahl number) that was parallelized.  In
the present benchmark the Autotasker (Cray's automatic parallelizer) with the
help of four directives achieved a parallelization ratio of 98.8 percent of
the Amdahl number.  This remarkable performance of the Autotasker is the
result of many years of multitasking experience starting with the XMP series.

SUPERCOMPUTER COMPETITION IN THE NEAR TERM

     Figure 1 shows existing and planned supercomputers.  Near-term
competition will primarily involve the Fujitsu VP-2600, NEC SX-3/44, and
Cray 3.  Here it is anticipated that the VP-2600 will form the basis for an
expected multiple-processor computer from Fujitsu.  In the following, some of
the features of these computers are given together with their expected
performance on the Navier/Stokes benchmark.

Features of the Fujitsu VP-2600 and the NEC SX-3/44

     Two new Japanese computers, the single-CPU Fujitsu VP-2600 and the 4-CPU
NEC SX-3/44, will be available during the summer and fall of 1990,
respectively, and will be benchmarked under the same guidelines as the
previous benchmark.  The benchmark on the VP-2600 is scheduled for the spring
of 1990, with the single-CPU measurement for the SX-3 to follow several months
later.  The 4-CPU SX-3/44 benchmark is scheduled for the fall of 1990. 
Slippages in the above benchmark dates may be expected due to the inevitable
tuning required for the new compiler and higher priority benchmarking for
potential customers.  The above Navier/Stokes benchmark results and their
analysis will be given in a future ONRFE Scientific Information Bulletin
article.
     System schematics for the VP-2600 and the SX-3/44 computers are shown in
Figure 2.  In the VP-2600 the CPU is composed of four floating point units
each with two chained multiply/add pipes.  Counting the total number of pipes
as 16, one obtains the brochure peak speed of 4 GFLOPS with the 4-ns clock. 
Since the benchmark code consists predominantly of dyadic operation, the
number of pipes in the VP-2600 is effectively eight, half the advertised
number.  This, then, results in a halving of the peak speed to 2 GFLOPS.  For
the SX-3/44, there are 4 sets of floating point units per CPU, each unit
containing 2 add and 2 multiply pipes for a total of 16 independent pipes per
CPU.  With a clock of 2.9 ns, a peak speed of 5.5 GFLOPS per CPU results for a
total of 22 GFLOPS with 4 CPUs.  Operationally the SX-3/44 is basically two
SX-2s tied in parallel.

Expected Performance of the VP-2600 and SX-3/44

     The actual/peak speed ratio of the VP-2600, assuming a reduced peak speed
of 2 GFLOPS, should improve over that of the VP-400E.  Pipe fill-in overhead
should be less, memory/register bandwidth with two vector fetch paths should
suffice, and the memory latency should be largely eliminated by the overlapped
fetch/store instructions.  An actual speed of about 1 GFLOPS might be
anticipated for the VP-2600 in the benchmark.
     For the SX-3/44 with 16 processor pipes per CPU, greatly worsened pipe
overhead must be expected, resulting in a doubling of the n1/2s of the SX-2A. 
Memory/register bandwidth of the SX-3/44 should sustain the reduced actual
speeds despite the reduced cycle time of 2.9 ns, half that of the SX-2A. 
Overlapped fetch/store instructions will greatly reduce memory latency.  The
NEC autoparallelizer cannot be expected to perform as effectively as the Cray
Autotasker at this early stage, but with straightforward compiler directives a
reasonable parallelization efficiency can be expected.  Assuming the increased
pipe startup overhead to be countered by the reduced memory latency, the NEC
SX-3/44 should perform at the still phenomenal speed of about 7 GFLOPS.

Performance of the Cray 3

     The Cray 3 is a 16-CPU multi-processor with a peak speed of 16 GFLOPS. 
Its 2-ns cycle time is achieved with gallium arsenide semiconductors.  With
each CPU having one add and one multiply pipe, the pipe fill-in overhead for
the benchmark problem should be small and comparable to the YMP.  With the
balanced architecture of the Cray 3, it is not unreasonable to expect an
actual/peak speed ratio comparable to that of the YMP for each CPU. 
Increasing the number of CPUs to 16 will, however, decrease the granularity
and hence increase the parallelization overhead.  Thus a speed somewhat under
8 GFLOPS is not unreasonable for the benchmark.  The actual speed of the
Cray 3 will therefore be in the neighborhood of that of the NEC SX-3/44 for
the benchmark.

CONCLUSIONS

     There is strong contrast in the architectural approach in the
supercomputer design between Japanese companies and Cray.  The emphasis in
Japan has been on designing a single-CPU computer with the highest peak speed,
sometimes at the expense of a balanced architecture.  High speeds were
achieved using many parallel processor pipes.  A large number of pipes
decreases the effective vector length, leading to reduced computing speeds
through increased pipe fill-in overhead.  Adequate memory/register bandwidth
can be obtained by providing a sufficient number of fetch paths, but this has
been difficult for the Japanese supercomputers, which have extremely high
single-CPU peak speeds.
     Cray supercomputers, in strong contrast, have achieved high speeds by
combining many CPUs with each CPU having few pipes in a well-balanced
architecture.  In the Cray 3 each of the 16 processors has only one add and
one multiply pipe, producing 1 GFLOPS speed in each CPU using gallium arsenide
semiconductors with 2-ns cycle time.  Use of many CPUs leads to increased
parallelization overhead by the reduction of granularity.
     It is clear that the ADI Navier/Stokes code of the benchmark is unsuited
for the NEC SX-3/44 because the 16 pipes in each CPU dilute the already
marginal vector length.  It is also unsuited for the 16-CPU Cray 3 because of
the reduction of granularity by the large number of CPUs.  In applications
requiring high peak speeds of these computers, other Navier/Stokes codes with
greater vector length or larger granularity must be considered.  In the
present benchmark problem with a mesh approximately 102 points in each
coordinate direction, there will be a factor 102 increase in the vector length
if the coordinate sweeps are carried out explicitly rather than implicitly, 
Or more dramatically, if an explicit difference method, for example, using a
high-order Kutta-Runge scheme, is used in place of the ADI algorithm, there
will be a further 102-fold increase in vector length.  There is, however, a
penalty accompanying these increases of vector length; namely, the rate of
convergence of the iterative procedure is correspondingly decreased.  Lowered
pipe fill-in overhead must then be weighed against the increased number of
iterations needed for convergence.  Similarly, increased granularity is
achieved by using a Jacobi column relaxation instead of the benchmark ADI
procedure.  Here subsequent relaxations can be initiated on successive CPUs as
soon as needed updated inputs have been calculated in the prior relaxation. 
Again such a relaxation process will have a lowered convergence rate as
compared to the ADI procedure.
     A benchmark, not comparing computers but measuring the performance of
different Navier/Stokes algorithms on a representative computer as the SX-3/44
or Cray 3, would therefore be of significant interest.
     Finally, one is reminded that the present computer assessment was based
solely on the speed performance on a specific, but a mainline, CFD algorithm,
namely, an ADI Navier/Stokes code.  Relative performance of the benchmarked
supercomputers will undoubtedly change for other programs including other
Navier/Stokes algorithms.
     In the future, U.S. dominance in supercomputers is in jeopardy.  Future
Cray supercomputers must share performance leadership with Japanese
supercomputers as the NEC SX-3/44.  Creative geniuses of U.S. computer
architects, as Seymour Cray, will have difficulty overcoming the corporate
might of Fujitsu Ltd., Hitachi Ltd., and NEC Corporation.

References.

1. K. Fujii and H. Yoshihara, "A Navier/Stokes benchmark for Japanese and 
 U.S.  supercomputers," Scientific Information Bulletin 14(2), 69-74 
 (1989).  
2. K. Neves, "Supercomputers:  The next generation," Scientific 
 Information Bulletin 14(4), 77-95 (1989).  


------------END OF REPORT---------------------------------------------

y as the Cray