
%%%%%%%%		TITLE PAGE		%%%%%%%%

\title{
Heterogeneous Distributed \\
Environmental Modeling
}

\author{
\makebox[3in]{Edward Segall}\\
\makebox[3in]{Peter Steenkiste}\makebox[3in]{Armistead Russell}\\
\makebox[3in]{Bernd Bruegge}\\
\makebox[3in]{Erik Riedel}\\
\makebox[3in]{School of Computer Science}\makebox[3in]{Department of Mechanical Engineering}\\
\makebox[3in]{Carnegie Mellon University}\makebox[3in]{Carnegie Mellon University}\\
\makebox[3in]{Pittsburgh, PA 15213}\makebox[3in]{Pittsburgh, PA 15213}\\
\\
Contact: Edward Segall - edward.segall@cs.cmu.edu - (412) 268 1590
}
\maketitle
\unmarkedfootnote{
This research was supported in part by the 
National Science Foundation under Contract xxx
and in part by the Defence Advanced Research Projects
Agency (DOD) monitored by DARPA/CMO under Contract 
MDA972-90-C-0035.

The authors acknowledge the use of computational
resources provided by the Pittsburgh Supercomputing Center (PSC).
}

%%%%%%%%		ABSTRACT		%%%%%%%%

\begin{abstract}
\normalsize
%Abstract from Supercomputing 94 paper:
%Many large applications today make use of parallel and distributed
%computer systems to obtain the cycles they need.  Such systems can
%have very different architectures, ranging from shared-memory systems
%to loosely-coupled distributed-memory systems such as workstation
%clusters.  Scientists would like to use whatever platform is available
%that provides sufficient performance, regardless of its architecture,
%so long as moving to it does not require a substantial incremental
%programming effort. The prefered platform at any given time will
%depend on convenience, availablity, and application-specific factors
%such as data sizes and the expected length of the desired run.  This
%raises a significant portability issue, since users want to maintain a
%single copy of their application, in order that change maintenance is
%manageable.  In this paper we look at the portability issue using an
%environmental modeling application as an example.  We analyze this
%large-scale, distributed application with regard to three requirements
%for portablity: architecture-independent application code, good
%performance on different architectures, and a uniform runtime
%environment. Finally, we present performance numbers for the
%application for different parallel and distributed systems.
\end{abstract}

\newpage

%%%%%%%%		PAPER BODY		%%%%%%%%

\section{Introduction}

Photochemical smog, containing ozone (a strong oxidant), acids, toxics
and aerosols (which are responsible for the visibility degradation and
health effects), impacts virtually every major city. In spite of very
stringent controls, Los Angeles still experiences ozone levels over
the national standard of 0.12 ppm up to about 100 days per
year. Mexico City exceeds their standard up to about 300 days per
year. One reason for the lack of progress is the deficiencies in our
understanding of the processes that can lead to moderate high levels
of ozone. To attack this problem, comprehensive, three-dimensional air
quality models that can follow pollutant dynamics have been developed
and applied to identify the best methods to improve air quality. These
models are used to relate how changing pollutant emissions due to
emissions controls will affect future air quality. As such, they are
central to air quality management.

Computationally, there are a variety of interesting aspects to
comprehensive air quality models. In essence, air quality models solve
a system of 35 to 100 time dependent coupled, stiff, highly non-linear
PDEs, following pollutant evolution over a number of days and
thousands of kilometers. As such, they are computationally
demanding. In addition, their structure indicates that they are highly
parallelizable, and that portions should also vectorize well. This
indicates that a distributed computational environment is well suited
for model execution. On the other hand, all-to-all data dependences on
critical paths lead to high burst communication rates, which in turn
lead to a bottlenecks in some parallel and distributed
environments. In addition, the input data, which may be many gigabytes
in size, may be distributed among various sites. Finally, the models
are generally complex, requiring a good deal of operator
expertise. This inhibits widespread use which, in turn, presents an
obstacle to applying the scientifically most sound approaches to
environmental management. It is desired to develop a system that will
allow easier management and analysis of the models, the inputs and the
results to be used by a broader community, including scientists,
government agencies and industry. In an effort that is part of NSFs
Grand Challenge program, these issues are being attacked in an
interdisciplinary fashion. The product of this research is an approach
that will facilitate bringing the best science to bear in air quality
management.


Air quality models are based on numerical solution of the atmospheric
advection-diffusion- reaction equation (ADRE), which describes the
compound dynamics undergoing turbulent transport, diffusion and
chemical reaction:

$$\frac{\partial c_i}{\partial t}~+~\nabla .  ({\bf{u}}c_i)~=~~\nabla
.  ({\bf{K}} \nabla c_i)~+~f_i~+~S_i \eqno(1)$$ Here, $c_i$ is the
concentration of the {\it i}th pollutant among {\it p} species, i.e.,
{\it i} = 1, ..., p, {\bf{u}} describes the velocity field, {\bf{K}}
is the diffusivity tensor, f$_i$(c$_1$, ..., c$_p$) is the chemical
reaction term and S$_i$ is the emission rate of species $i$.
term. Here the major components of the model that have been
parallelized will be described. The detailed description of the model
can be found in \cite{kumar:euler:geophysical}.

The resulting set of PDEs are coupled through the chemical reaction
term. In general, the species lifetimes have  a wide range -
milliseconds to months. This lead to the set of equations not only
being highly non-linear, but stiff. Traditionally, the set of PDEs is
transformed into ODEs using operator splitting:

The solution is advanced in time as $$c^{n+1}~=~L_{xy}~(\Delta
t/2)~L_{cz}~(\Delta t)~L_{xy}~(\Delta t/2)~c^{n} \eqno(2)$$ $L_{xy}$
is the two dimensional horizontal transport operator and $L_{cz}$ is
the chemistry and vertical transport operator.  Since the
diffusion-dominated vertical transport has timescales very similar to
the chemistry, and since and the solution of a diffusive process
involves an exponential structure similar to chemical decay, chemistry
and vertical transport are combined in a single operator, $L_{cz}$.
The Streamline Upwind Petrov-Galerkin (SUPG) finite element method is
used for the solution of horizontal transport
\cite{odman:multiscale:atmospheric}.  For chemistry and the vertical
transport equations, the hybrid scheme of
\cite{young:equations:reactive} for stiff systems of ordinary
differential equations is used.

One reason for splitting the operators is that specialized algorithms
can be used to treat each process. Ultimately, the resulting set of
ODEs may include millions of equations to be solved for time periods
of days. On a single processor machine, this can take days or weeks
depending on the application. However, the structure of the models
lends them to being parallelized and, of particular interest, to being
parallelized for execution in distributed environments.

As an example, the air quality model used in the project, called the
Urban-to-Regional Multiscale (URM) Model, is being applied to develop
control strategies for improving air quality in the Northeastern
United States. The modeling domain is about 1200 km (East-West) by
1200 km (North-South) by 3 km (vertically). Horizontal grid resolution
ranges from 4.2 km to 72 km. Vertical grid resolution ranges from 30 m
to 1500 m. Approximately 10,000 computational nodes are used, and 50
species are being followed. On a single-processor DEC Alpha
workstation, a simulation of this size may take 5 days. Given that one
may need to test hundreds of strategies, roughly in succession (to use
knowledge gained from previous tests), the computational delay to do
so is burdensome and often prohibitive. In approaching this problem, a
variety of distributed environments were tested and utilized to
address a specific issue: whether regulations should specifiy a
maximum emissions level over the whole domain, or if emissions
controls in specific locations were more effective. As is usual when
dealing with such problems, the answers were needed in a matter of
days, thus exemplifying one need for high performance
computing. Another is that these models are central to computational
experimentation, so slow turnaround, or reducing the comprehensiveness
of the model, can lead to inferior results. The greater performance
needed to achieve these goals has been achieved by creation of a
distributed, heterogeneous version of the URM model.

\section{Heterogeneous Computing}

Environmental modeling is a computationally expensive process, and
similar to many grand challenge and national challenge applications,
these models can make ready use of parallel and distributed computer
systems to obtain the necessary cycles. A wide variety of such systems
are available, including shared-memory systems (Cray C90) and
tightly-coupled (Intel Paragon\cite{paragon:xps}, Cray T3D
\cite{t3d:overview}) and loosely-coupled distributed-memory systems
(e.g. workstation clusters).

An interesting option is heterogeneous computing: an application is
distributed over a heterogeneous collection of machines connected by a
high-speed local-area or wide-area network. Heterogeneous computing
not only combines the resources of multiple computer systems, but also
has the potential of performing the computation very efficiently by
mapping each component (task) of the application onto the architecture
that is most appropriate for that task. The choice of architecture can
be driven by the nature of the computation (scalar versus vectorizable
versus parallel code) or by specific features of the system such large
memory size.

An issue that is at the core of both distributed and heterogeneous
computing is that of {\em portability}. Users want to execute the
application on a variety of different platforms. The preferred
platform at any given time will depend on convenience, availability,
and application-specific factors such as input sizes and running time
estimates. This issue is even more critical for heterogeneous
computing applications, since the set of platforms used for a specific
execution can change dramatically from user to user. While portability
of sequential programs is relatively well understood, developing
portable distributed applications is much harder.  Thus, distributed
applications often must be substantially reworked when moved between
platforms in order to achieve correct execution, acceptable
performance, or both.

In this project, we investigate how applications can be distributed
over heterogeneous sets of systems connected by high-speed
networks. We are using the Urban and Regional Multiscale (URM) Airshed
model\cite{odman:multiscale:atmospheric}\cite{kumar:euler:geophysical}
to drive the research. We approach the problem from two different
angles. We first distributed the model by hand using PVM (Parallel
Virtual Machine), and showed exploiting a combination of task and data
parallelism can create a parallel application that runs efficiently on
a variety of parallel
architectures\cite{pvm:dmcc6}\cite{pvm:user}. Parallelization by hand
provides implementation flexibility and exposes the problems
associated with heterogeneous computing to user control. A second part
of our work is to develop tools that automate some of the tasks
associated with heterogeneous computing. This research is done in
cooperation with the Fx parallelizing compiler group at CMU.


\section{Parallelism in URM}

The URM model is a three dimensional, time-dependent Eulerian
mathematical model that models the transport, deposition and chemical
evolution of various pollutants in the atmosphere. The program uses a
three-dimensional array to model the atmosphere. This array represents
a three-dimensional multiscale grid of chemical
concentrations. Typical dimensions are 5-20 horizontal layers,
5000-10000 grid cells inside each layer, and 50-100 chemical species.

The Airshed program spends most of its time in two main computational
phases: a horizontal transport phase, and a combined chemistry and
vertical transport phase. Both phases can be solved using data
parallelism, although they distribute the concentration array along
different dimensions (see Figure phases.gif) - the horizontal and
vertical dimensions, respectively. Since the two phases access the
data structure in different directions, the two phases in the
distributed implementation are separated by a global transpose.

Data parallelism in the Horiz and ChemVert routines is the main source
of parallelism. However, for a typical execution, a non-trivial amount
of time (10%) is spent in the rest of the computation, and these tasks
become a bottleneck once the easy data parallelism has been
exploited. The types of parallelism available in those tasks fall into
two classes: parallelization of secondary tasks such as communication
and I/O, and pipelining of several data preprocessing tasks with the
main computation. These optimizations applied in isolation would
result in negligible speedup because each optimization applies to only
a relatively small fraction of the sequential computation. However,
once the main computation has been parallelized, their combined effect
can be significant.

The wide range of parallelism present in URM is by no means
unique. Many other applications (especially other physical simulations
based on multiple time scales or geometries) exhibit a wide range of
types of parallelism.

\section{Performance Using PVM}

After parallelizing the URM, it was executed on a number of systems,
including the Pittsburgh Supercomputing Center (PSC)s Cray C90, the
PSC DEC Alpha Supercluster, the Cray C90-T3D heterogeneous system, and
a group of Alpha workstations distributed in the Carnegie Mellon
Computer Science Department.

The first distributed Airshed implementation parallelized only the
Horiz and ChemVert data-parallel components. We observed respectable
but lower than desired speedups on a number of different platform;
e.g. 6.5 on a 10 node DEC Alpha cluster. A careful analysis of
execution traces confirms our earlier observation: We get almost
linear speed up in the data-parallel computational phases, but the
lack of task parallelism reduces the overall application speedup. In
other words, in this application, focusing on the inner loops is not
sufficient. Note that the mixture of types of parallelism raises the
difficult problem of how tasks should be mapped onto the distributed
system. In general, the optimal mapping will depend on the
characteristics of both the system and the application.

The hand parallelization effort exposed a number of challenges
associated with writing distributed applications using explicit
message passing. Although the same message passing library (PVM) was
used on each system, moving the application across the different
systems involved significant effort. Besides the obvious problems
(e.g. non-compatible versions of PVM on different systems),
differences in the compilation environment, execution environment and
file system have forced the development of a system called ARO (A
Restructuring Organizer) that supports compilation and execution on
these different environments in a consistent way. This environment is
designed to support different performance optimizations on each system
as well, via conditional compilation. 

\section{Fx}

Our experience with the hand-parallelized URM clearly shows the need
for tools that support heterogeneous distributed computing. These
tools should not only hide the compilation and execution details of
each system, as ARO does, but should simplify the task of development
of the distributed program by raising the level of abstraction
(i.e. higher than message passing) and should automate performance
optimization, which typically requires a detailed understanding of the
system.

We are working with the Fx parallelizing compiler group at CMU to
achieve this goal. Fx is a FORTRAN dialect that combines data
parallelism as it is found in HPF with task
parallelism\cite{FORTRAN:TASK}. Fx is quite expressive - using it, we
were able to express all the task and data parallelism present in
Airshed. Our current activities focus on restructuring Airshed and
tuning the Fx compiler so that we can realize this potential with the
actual Airshed program.

At this point we have successfully distributed the ChemVert phase of
the computation across a number of platforms using Fx (see Figure
graph.gif. Note that in this figure, the T3D numbers are based on PVM,
not Fx). We observe that Fx is able to achieve almost linear speed up,
similar to the speed ups observed using PVM. Contrary to the PVM
approach, Fx allows us to achieve this performance using a single
source code, i.e. the code is very portable. The next step is to
exploit the other types of parallelism across both homogeneous and
heterogeneous systems.

\section{Software Engineering Issues}

While their high computational requirement is one reason advanced
environmental models are not more widely used, a second, often more
severe, bottleneck is their "unfriendliness". Generally, the more
scientifically advanced the model, the harder it is to use. This is a
significant problem. In a recent National Academy Report, it was
underlined that one reason greater progress has not been made in
improving air quality is that the best science has not been used to
address the problem. 

To addres this issue, the Geographic and Environmental Modeling System
(GEMS) has been developed by our group. The most important goal of
GEMS from the end user's perspective is ease of use. The intent is to
provide an environmental spreadsheet for the exploration of "What
if...?" scenarios by a scientist or regulator. To do this, a system
must provide a consistent interface to a set of physically distributed
and heterogeneous computing resources that cooperate to produce a
given analysis, and must also facilitate the necessary communication
among these resources.

GEMS is intended not only to help the regulatory end users, but also
help the model developers incorporate new science more rapidly into
their models.  Our long-term goal is to provide a general
application-specific framework for developing scientific applications
sructured like the URM model, while our short-term goal is to develop
a prototype system specifically to solve the air quality modeling
problem.  Several enabling technologies in user interfaces, process
control, database management, and visualization are now available
within GEMS to provide a usable, extensible and flexible framework for
environmental modeling. In order to make the best use of these
technologies, we need a continous dialog between developers and users.

\section{Use of GEMS}

There are several distinct types of data that the GEMS system works
with: 

Map data - extracted from the U.S. Census Department TIGER files. This
data provides a vector map down to street level broken down by state,
county, civil district, census tract. and census block.

Imagery data - provided by LANDSAT satellite system. This data
provides a raster representation of a picture of the earth's surface
taken from space at 30 meter resolution.  

Elevation data - provided by the LANDSAT satellite system. This data
provides elevation figures at the same 30 meter resolution as the
imagery data. This allows the display of physical features such as
mountain ranges and provides a representation of the topology of the
area being viewed.

Population data - provide by the U.S. Census Department PL94-171 and
STF1A data sets. This data provides population figures and other
statistics as collect by the Census Department. Data in these sets are
registered to the divisions provided by the TIGER data, again down to
the detail of census blocks.

Gridded Model data - provided by the U.S. EPA, the National Weather
Service, and local environmental protection agencies.  This data
provides information on the meteorology and initial pollutant
concentrations which serve as the inputs for the air quality
model. The output from the air quality model consists of pollutant
concentrations and fluxes gridded in the same manner.

Emission Source data - provided by the U.S. EPA and local
environmental agencies. This data provides information on the sources
of pollutants in a region under study. This information, in addition
to the meteorological conditions, are the major input parameters to
the air quality model.

In order to perform a model calculation, the user determines the
appropriate parameters for a run, including the choice of which code
is to be executed and the machine on which the code should run. When
the user initiates a remote execution, a process is started on the
remote machine and the parameters chosen are passed to the remote
program.

Each execution produces a considerable amount of data. In order
for this data to provide useful insight into the workings of the
atmosphere it must be visualized in a form that users can readily
interpret. We have made use of an existing visualization system
PV-Wave from Precision Numerics, Inc.\cite{PVWAVE} - to produce a
visual representation of the model outputs. 

Figure XXX shows the analysis that can be performed using the GEMS
modeling system. The picture depicts the GEMS user interface as it
would be seen by a scientist or regulator. The view is of the Los
Angeles basin and data is based on measurements made during the SCAQS
period in 1987 with calculations based on the CIT Airshed Model. Data
from U.S. Census Department TIGER maps forms the geographic base for
the main map. Overlayed on the geographic data is a surface shading
depicting the ozone concentrations over the region at 2:00 PM on
August 27, 1987. The color scale in the upper left shows values going
from very low (blue) to very high (red) shading. The Toolbox at the
left of the picture is the point of user interaction with the system,
the selection of date, output format, and species can be seen in the
Chemical Lab tool depicted there. Other available tools control
details of the map and choice of scenarios, the display of population
data, and the display and editing of emissions sources. The bottom
left shows a plot of the average ozone concentration in the entire
modeling region as it varies over a 24 hour period. The interface code
is written in C++ and uses the X/Motif user interface toolkit. The
system currently runs on Digital DECstation, DEC Alpha AXP, and Sun
SPARCstation platforms. 

\section{Frameworks}

The two key aspects of GEMS that are crucial in achieving our goals
are: An underlying object-oriented system model, and a set of
protocols for the interaction between the GEMS components that allow
new components to be easily incorporated into the framework.

The GEMS system is decomposed into five software subsystems: User
Interface, Execution, Data Management, Visualization, and
Monitoring. One goal of the GEMS system is to connect these five
subsystems together in a single coherent object-oriented framework

Frameworks provide a higher level of reuse than abstract classes and
allow considerable leverage in the rapid development of software
systems. The eventual goal of GEMS is to provide an
application-specific framework to support software development in the
environmental modeling community.  The GEMS framework provides a
plug-and-play architecture that allows a developer to rapidly build
software systems to support environmental modeling.

One of the main problems for such an architecture is defining the
protocols for the interaction between the components of one subsystem
and components from one of the other subsystems. This task is
especially difficult if the individual components were not developed
with this specific interaction in mind. A more detailed description of
these issues is described in\cite{GEMS:ECOOP94}

\section{User Centered Analysis}

In the traditional model of software development, the end user and
domain expert interact with the development team only at the beginning
and end of the life cycle.  In the development of the GEMS system, we
have chosen to take the more realistic view that this type of
interaction, where the user can come up with a well-defined set of
requirements which the development team can transform into an
acceptable system, is the exception rather than the rule. There is
significant evidence that limited interaction with users leads to the
undesirable situation, present in so many software projects, that
70-80 percent of a projects time and resources are spent in the
maintenance phase. The user may feel perfectly satisfied with the
requirements they have set out in the Problem Statement but when the
system is finally ready for delivery its functionality rarely matches
the clients expectations. We felt that an organizational approach
based on mutual negotiations between the developers and users
throughout the entire process is required in order to produce
high-quality software\cite{GEMS:JCSE}. Close interaction between users
and developers is beneficial to both sides . The software developer
needs to gain a significant understanding of the problem domain under
consideration in order to be able to provide the best solution to the
problem. It is also very useful for clients to gain some insight into
the field of available software tools and technologies since this will
give them a better idea of how a new software system will interact
with existing capabilities and to be able to make better decisions
about the requirements for such a system. Such an approach requires
users who are willing to put a good deal of their time into working
with the developers and giving them feedback on the progress of the
development. Although it may not always be easy to have software
developers interact with their users, we strongly believe that it is
well worth the extra effort on both sides and can serve to avoid many
of the problems that arise in the more traditional approaches and will
produce a higher quality software product as a result. In we describe
this approach in more detail.

\comment{Conclusion?}

