/*
 * PCN Abstract Machine Emulator
 * Authors:     Steve Tuecke and Ian Foster
 *              Argonne National Laboratory
 *
 * Please see the DISCLAIMER file in the top level directory of the
 * distribution regarding the provisions under which this software
 * is distributed.
 *
 * sr_doc.h  -  Documentation for all of the sr_*.c Send/Receive routines
 */

/*

Each SR (send/receive) module consists of two files:

	sr_*.c	- All of the code that implements initialization,
			sends, and receives.
	sr_*.h	- Any header information that is needed by other 
			parts of the emulator.

The functions that the SR module should implement for the emulator
are:

	_p_sr_get_argdesc()
	_p_sr_init_node()
	_p_sr_node_initialized()
	_p_destroy_nodes()
	_p_abort_nodes()
	_p_alloc_msg_buffer()
	_p_msg_send()
	_p_msg_receive()


This file contains general documentation describing the use of the SR
module as a whole, as well as descriptions of each procedure that the
SR module must implement.

Argument parsing
================

_p_sr_get_argdesc() is called immediately before command line
arguments are parsed.  It is passed argv and argc, in case something
is needed directory out of them -- for example, sr_bsdipc.c saves a
pointer to argv[0] (the program name) so that it can use it later.

It should fill in its argument argdescp with a pointer to an argument
description array that contains the arguments needed by this SR
module. And n_argdescp should be set to the number of arguments held
in this array.


Initialization
==============

The parallel emulator is started up by calling _p_sr_init_node(),
which should in turn call _p_init_node(), which will in turn call
_p_sr_node_initialized().

It is the responsibility of _p_sr_init_node() to make sure all the
nodes are created and their send/receive primitives are initialized.

The last thing _p_sr_init_node() should do on all nodes is call
_p_init_node(), which will take care of all of the general emulator
initialization.  When _p_sr_init_node() makes this call, all
send/receive operations should be fully functional.

At the end of initialization, _p_init_node() will call
_p_sr_node_initialized().  This is basicly just a debugging hook,
though it can also be used to verify initialization.  It need not do
anything.  However, it is very useful when debugging a new SR module,
because it is called after all initialization, immediately before the
main emulator loop is entered.  It provides a good place to check out
initialization and test out the SR primitives.  It can also be used to
verify that all the nodes have actually initialized correctly, and if
not then it can shut things down.

There is one other function, sr_fatal_error(), that is used by
_p_sr_init_node(), but that should not be exported to the rest of the
emulator.  Once the emulator has been completely initialized,
_p_fatal_error() (in boot.c) should be used to kill the emulator in
the case of a fatal error.  But _p_fatal_error() should not be called
until all of the SR routines are initialized and functional.  If there
is an error in _p_init_node(), then _p_fatal_error() cannot be used.
Therefore, sr_fatal_error() should be used during SR initialization to
kill everything in the case of an initialization error.  It should try
to kill off all nodes by whatever method possible.

Global variables
================

_p_sr_init_node() is responsible for setting the following global
variables:

	_p_my_id
	_p_host_id
	_p_nodes
	_p_host
	_p_default_msg_buffer_size

All nodes of a parallel emulator run are given a unique integer.  If
there are N nodes in the system, they must be numbered 0..N-1, where
the host is always node 0.

The first four variables listed above must be set to reflect the
parallel architecture:

_p_nodes :	The number of nodes (N) in the emulator on this run.
_p_my_id :	The node number (from 0..N-1) for my node.
_p_host_id :	The node number for the host (always 0).
_p_host :	A boolean variable that should be set to TRUE if 
		this is the host (_p_my_id == _p_host_id), otherwise
		it should be set to FALSE.

_p_default_msg_buffer_size : The default message buffer size (in
cells) for message buffers.  This size should not include any header
information that the SR code might tack onto the message.  Thus, if
4096 bytes is an good default message size, cells are 4 bytes each,
and 4 cells are needed for header information, then
_p_default_msg_buffer_size should be set to 1020 (4096/4 - 4).

So what is a good value for _p_default_msg_buffer_size?  That's a good
question -- and one that doesn't have a pat answer.  It is used when
the emulator does not know exactly what size buffer should allocated
before it starts packing stuff into that buffer.  

For example, if a tuple needs to be sent in the message, how much
space should be allocated?  Just enough to allow the first level of
the tuple to be copied?  Or do you allow additional space in case the
tuple contains other tuples (for example, it is a list), so that you
can pack more of the contents of the tuple into the message?

The emulator will always allocate enough space for the top level of
the tuple.  But, if it requires less than the
_p_default_msg_buffer_size to hold the top level, then it allocates a
space of size _p_default_msg_buffer_size, so that it can pack addition
levels of the tuple into the message, if those additional levels
exist.

Finally, one last factor in determining a value for this variable.  As
mentioned, the emulator does not know how much space it needs to
allocate for the message before packing the message into the buffer.
However, after the message is packed into the buffer, it knows exactly
how many cells from the buffer it actually used.  And it is this value
(the number of cells actually used) that is passed to the
_p_msg_send() routine.

Therefore, _p_msg_send() routine need not send the entire allocated
buffer.  It only needs to send the part that is used.  So it is ok to
allocate considerably more space than you actually send.

So, in general, this value should probably be at least 100-200 cells.
That way, at least a few levels of a tuple (such as a list) can be
packed into a single message.  But if memory is available, and your
send/receive routines allow allocation of buffer that are larger than
what is actually sent, then the _p_default_msg_buffer_size should be
made considerably bigger.  What is "considerably bigger"?  At least
1000 cells, and perhaps even more.


Sending messages
================

Messages are sent using the code:

	_p_alloc_msg_buffer(...);
	<Fill in the message buffer>
	_p_msg_send(...);

_p_alloc_msg_buffer() allocates a message buffer of the appropriate
size.  _p_msg_send send the message in that message buffer to a node
and frees the message buffer.


Receiving a message
===================

The function _p_msg_receive() is used to receive messages.  It places
the received message onto the heap starting at _p_heap_ptr.  (It will
check to make sure there is enough space left on the heap for message
first, and if not it will call the garbage collector.)

_p_msg_receive() has several different modes of operation, depending
on the receive type:

    RCV_BLOCK	: Blocking receive of any type
    RCV_NOBLOCK	: Non-blocking receive of any type
    RCV_PARAMS	: Only receive a MSG_PARAMS (parameter) message, or
		  a MSG_EXIT or MSG_INITIATE_EXIT.  Queue up messages
		  of other types.  This is a blocking receive.
    RCV_PARAMS	: Only receive a MSG_GAUGE (gauge) message, or
		  a MSG_EXIT or MSG_INITIATE_EXIT.  Queue up messages
		  of other types.  This is a blocking receive.
    RCV_COLLECT	: Only receive a MSG_READ, MSG_CANCEL, MSG_EXIT, or
		  MSG_INITIATE_EXIT message.  This type will be used
		  if the heap space fills up in a parallel run, and
		  we're waiting for space to be free up.
		  This is a blocking receive.


Normal termination of the emulator
==================================

A normal termination will occur if any node runs the exit PAM
instruction, or runs one of the exit_from_*() procedures (in utils.c).
The guts of the exit routines are in parallel.c, _p_host_handle_exit()
and _p_node_handle_exit().

If a node initiates the exit, it will send a MSG_INITIATE_EXIT to the
host and then wait for the normal exit protocol to occur.

If the host initiates the exit, or receives a MSG_INITIATE_EXIT
message from a node, then it will run the exit protocol:

  1) The host will sync up with the nodes by sending a MSG_EXIT
message to each node, and then waiting for a return MSG_EXIT message
from each node.  Upon receipt of a MSG_EXIT message, a node will
simply return a MSG_EXIT message to the host.

  2) The host will initiate the Gauge profile dump.  MSG_GAUGE type
messages will be used within this chunk of code to control the dump.

  3) All nodes will dump their Upshot logs to files.

  4) The host will sync up with the nodes again, as described in step #1.

  5) _p_destroy_nodes() will be called on the host and all nodes

  6) a) The host will send a MSG_EXIT to each node and then call
_p_shutdown_pcn().

     b) The nodes will wait for a MSG_EXIT from the host and then call
_p_shutdown_pcn().

  7) The host and all nodes will call exit().

The _p_destroy_nodes() that is called during step 5 on the host and
the nodes need not do anything.  If it does not do anything, then
everyone will proceed to call exit() normall.

However, if normal shutdown must be done in some manner other than
having all nodes call exit(), this can be implemented in
_p_destroy_nodes().


Aborting the emulator
=====================

If the emulator encounters a fatal error during its execution (a
signal, corrupt heap, etc), it will call the _p_fatal_error()
function.  That function will try to cleanly shut all nodes of the
emulator down.  (As opposed to leaving stray processes hanging around,
etc.)  But it will not go through the normal exit protocol described
above.

Along the way it will call _p_abort_nodes().  If a method exists for
killing all nodes of the emulator, then _p_abort_nodes() should use
it.  For example, the Sequent Symmetry version uses a killpg() to kill
the entire process group which consists of all the nodes.  In other SR
modules (sr_machipc), _p_abort_nodes() sends a special abort message
to all other nodes before exiting.  In that case, the _p_msg_receive()
routine watches for an abort message and calls _p_fatal_error() if it
receives one.

In general, the goal of _p_abort_nodes() is to everything possible to
kill all nodes of the emulator, so that under abortive circumstances
some nodes aren't left hanging around while others have terminated.

If a fatal error occurs in the emulator after it has been completely
initialized, there are two procedures (in boot.c) that should be used:

	_p_fatal_error("Error string");

and

	_p_malloc_error();

Neither of these procedures return.  They will kill the node and
hopefully all other nodes as well.

Note: There is a separate _p_malloc_error() procedure because on some
machines the fprintf's used by _p_fatal_error() will call malloc and
fail and cause a real mess.


sr_*.h
======

At a minimum, the following needs to be defined in sr_*.h:

#undef PARALLEL
#define PARALLEL

#undef ASYNC_MSG
#define ASYNC_MSG 0

The PARALLEL definition causes all of the parallel emulator code to be
compiled into the emulator.  Without this definition, the emulator
only has the code to run a 1 node emulator.

The ASYNC_MSG definition causes the proper message handling code to
get linked into the emulator.  It signals whether this SR module uses
synchronous (polled) message handling (ASYNC_MSG==0) or asynchronous
message handling (ASYNC_MSG==1).


Asynchronous message handling
=============================

When ASYNC_MSG is set to 0 (synchronous message handling), the
emulator will occasionally poll for new messages.  It does this by
calling _p_msg_receive(...,RCV_NOBLOCK) -- a non-blocking receive.
Unfortunately, this can be a relatively expensive operation.

However, some systems can be set up so that when a message arrives,
the emulator can be asynchronous notified of this fact.  In this
situation, the emulator need not call _p_msg_receive(...,RCV_NOBLOCK) in
order to find out if there are messages.  Rather, the asynchronous
notification can set a variable that the emulator can check, instead
of having to call _p_msg_receive() each time.

When ASYNC_MSG is set to 1, this asynchronous notification is enabled.
Instead of calling _p_msg_receive(...,RCV_NOBLOCK) to check for new
messages, the emulator just checks the _p_msg_avail variable.

Thus, if a SR module uses asynchronous messaging, then it must set
_p_msg_avail to TRUE when a message arrives.  When the emulator finds
that _p_msg_avail has been set to TRUE, only then it will call
_p_msg_receive(...,RCV_NOBLOCK).  So, once _p_msg_receive() handles all
available messages, it should reset _p_msg_avail to FALSE.


*/




/*****************************************************************
******************************************************************
**								**
**		PROCEDURE DESCRIPTIONS				**
**								**
******************************************************************
*****************************************************************/




/***********************************************************
void _p_sr_get_argdesc(int argc; char **argv;
		       argdesc_t **argdescp, int *n_argdesc)

Called by boot.c to get a pointer to argument description table.  If
the sr code needs something from argc and argv directly (for example,
argv[0] has the name of this executable), it can get this.

We can also initialize sr variables here if they might be modified
during argument handling.

********************* _p_sr_get_argdesc() ******************/


/***********************************************************
void sr_fatal_error(char *msg)

Used by _p_sr_init_node() to deal with fatal errors during the
worker creation process.  _p_fatal_error() cannot be called until
everything is up and running.  So sr_fatal_error() fills in until
then. 

********************* sr_fatal_error() *********************/



/***********************************************************
void _p_sr_init_node()

This procedure is responsible for setting up and initializing the SR
module on all nodes.  It is the first thing called.

The last thing it should do is call _p_init_node() on all nodes
(including the host).  When it makes this call, the SR module should
be fully functional.

This module usually works in one of two ways:

1) The host process must spawn off all the nodes (using fork, or rsh,
or some such means), initialize itself, and then call _p_init_node().
Then, when the node processes hit this routine (by way of the fork, or
the rsh, or whatever), then initialize themselves, and call
_p_init_node().

2) On some parallel machines, the OS takes care of loading the
executable onto all nodes simultaneously.  In this case, the procedure
must figure out how to initialize the SR module for the node it is
running on, get everything setup so that it can communicate with
other nodes, and then call _p_init_node().

********************* _p_sr_init_node() ********************/



/***********************************************************
void _p_sr_node_initialized()

This function is called after the node has been completely
initialized.  It need not do anything.  However, it can be useful for
two things:

1) SR module debugging code can be put here.  For example, I often put
a simple ring test in here, just to see if the proper connections are
being made.

2) It can make a final check to make sure all the other nodes came up
ok.  And if it didn't then it can shut down.

********************* _p_sr_node_initialized() *************/



/***********************************************************
void	_p_destroy_nodes()

This procedure is described above under the "Normal termination of
the emulator" section.

To recap, it is called on every node during normal termination of the
emulator.  It can kill all of the nodes.  Or it can do nothing, in
which case all nodes will proceed to execute an exit(0).

********************* _p_destroy_nodes() *******************/



/***********************************************************
void _p_abort_nodes()

This procedure is called from _p_fatal_error() -- when we encounter a
fatal error situation.  It should do what it can to kill off all of
the nodes.

Some typical ways in which this is done:

1) A special (machine specific) procedure is called which will kill
off all the nodes.  For example, on the Sequent Symmetry, killpg() is
called to kill all the nodes.

2) An abort message is sent to the host.  When the _p_msg_receive()
routine on the host receives this abort message, it calls some special
procedure to kill all the nodes.

3) An abort message is sent to all other nodes.  When those other
nodes do a _p_msg_receive() and see the abort messages, then they will
shutdown using _p_fatal_error().

********************* _p_abort_nodes() *********************/



/***********************************************************
cell_t *_p_alloc_msg_buffer(int size)

Allocate a message buffer that will later be used by _p_msg_send().
The 'size' argument specifies how many cells (NOT bytes) the message
buffer should contain.

Note: If this procedure uses malloc(), and the malloc fails, then it
should call _p_malloc_error(), not _p_fatal_error().  The difference
is that _p_malloc_error() does not use fprintf().  One many machines,
once a malloc fails once, it will fail from then on.  Unfortunately,
fprintf() usually uses malloc() for temporary space, so it fails after
a malloc error.  Therefore, _p_malloc_error() does not use fprintf().

Return:	A pointer to a message buffer with 'size' cells.

********************* _p_alloc_msg_buffer() ****************/


/***********************************************************
void _p_msg_send(cell_t *buf, int node, int size, int type)

Sends the message that is pointed to by 'buf' to 'node'.  Only the
first 'size' cells of the buffer need to be sent.  The message has the
specified 'type'.  If buf==NULL (and size==0), then an empty message
of the specified type is sent.

After the send is completed, free the message buffer.

This send will block until the message can be delivered (though not
necessarily until it has been received, if there is buffering in
transit).

If an error occurs, _p_fatal_error() or _p_malloc_error() should be
called to abort the program.

********************* _p_msg_send() ************************/


/***********************************************************
bool_t _p_msg_receive(int *node, int *size, int *type, int rcv_type)

Receive a message from ANY node. Place the message onto the heap.
(And make sure there is room for it on the heap.)

Valid 'rcv_type' arguments are:
	RCV_NOBLOCK	Do not block if no messges are waiting
	RCV_BLOCK	Block until a message is received.
        RCV_PARAMS	Block until a MSG_PARAMS message arrives.
        RCV_GAUGE	Block until a MSG_GAUGE message arrives.
	RCV_COLLECT	When called from _p_garbage_collect().  Ignore
				MSG_COLLECT messages, and queue up
				MSG_DEFINE and MSG_VALUE.  Block until
				we get a MSG_CANCEL or MSG_READ.

See above for more detailed info on receiving messages.

Return:	TRUE if we read a message, otherwise FALSE.
		The node, size and type arguments are return values.

********************* _p_msg_receive() *********************/
