Summary of basic ideas to be conveyed with the HP-MP work.

Note on past experience:
------------------------
The write-up summarizes our experiences with a message passing system for
iWarp. All observations are due to implementation experience.

Q: How do we make the point that our architecture is backed up well
   with experience in a previous architecture where we did most of
   the work ourselves.

Environment: 
------------ 
The target system is an advanced private memory distributed computer
based on nodes with all the characteristics properties of a high
performance computer including high speed network access, memory
hierarchy and virtual memory.

The system architecture is optimized for the execution of Parallel
Fortran code heavily influenced by the CMU Fortran FX.

A few design principles: 
----------------------- 
Parallelizing compilers issue parallel code with communication
statements that move data out of a distribute array data structure in
user space into another distributed array data structure at the
destination processors.  In our view these message transfers
include all the gathering of data at the sender and all the storing
and scattering of data at the receiver.

The data transfers should be carried out in blocks that are as large
as the parallel program allows. There is a maximal number of elements
to be communicated between any two processors in every communication
phase. In the regular, optimized case data should never be copied and
should be passed through the protocol layers by reference.

For the class of routers used in high performance parallel computers,
immediate extraction of the data is essential to performance. These
systems do not have very fast links and low latency but they do not
provide much local buffering in each node and no per connection
end-to-end flow control or throttling. Therefore blocked, waiting
messages reduce the throughput significantly.

Message passing allows arbitrary connectivity and is applied when
global communication schedules can not be easily derived. So message
passing requires that each cells participates as a sender and as a
receiver equally well - simultaneously or at least with no latency for
switching between the two activities. Even on architecture with very
favorable latency parameters maintaining the proper flow of control
was proven to be very hard.  Servicing the interaction between the
sender and the receiver with the network interface simultaneously is a
challenge.

Proposal:
---------
The architectural proposal includes:
- long messages that travel from user space to user space. A security 
  concept is established for handling of exceptions and task swapping.
- a receiver end with direct deposit based on address data pairs or
  address data blocks
- separate computing resources for sender and receiver. In the interest
  flexibility and exception message handling there will be two processors
  and a two headed network interface.

Possible concepts for design verification:
------------------------------------------
Q: How can we verify such a design?
   - taking iWarp processors in pairs?
   - writing simulators?
   - building hardware?
   - trying to make a few feasibility implementation on Paragon?

+
