		Using NANNY and RUNCONSOLE
		**************************

OVERVIEW
========

  TCA provides many of the facilities necessary for fault tolerance. 
  However, TCA has no concept of the process control mechanisms or file 
  system organization required in order to handle failures at the Unix level. 

  With complex multi person robot projects, the standard run time set
  up nonetheless is too reliant upon people being coached in handling the
  setup and debugging of subsystem processes. NANNY is a system for extracting
  much of this knowledge, codifying it, and then managing the inevitable
  pitfalls of multiple processes running in the real world (code failure etc).

  The old organization typically used in robot projects was to start up
  multiple xterms, connected to appropriate machines. For a typical runtime
  Xavier situation, several such windows were necessary. Most people don't
  know nor wish to know what each of these processes did. Instead, you only
  care about these processes when you wish to control them, or want to be
  notified that something has gone wrong. Monitoring the extraneous windows
  proves to be distracting and confusing in practice. 

  To address this problems, We've created two programs;
    nanny  
    runConsole

  The idea is to have a single process on each machine (a nanny) that
  starts up and monitors (baby sits) the processes needed to run the
  robot.  The nanny has all the information needed to give the programs
  the correct arguments and knows which modules depend on other programs
  being run first.  Or course, since the nanny starts these programs and
  manages all the IO, you can't directly see what is going on.  That is
  where runConsole comes in.  The runConsole program connects to the
  nanny server and allows you to interact with any of the running
  programs (see the output, type new input), on any of the computers.

  For an example of this, on one of the machines with nanny,
	 % nanny &

  Assuming everything goes OK, you'll get some messages about processes
  running and eventually you'll be told that nanny is ready. Then do
	 % runConsole &

  A window will pop up. Select "csh" from the "run" menu. Shortly,
  a csh button will appear. Clicking on this button will allow you to interact
  with a csh shell. You can use the standard gnu-emacs commands for editing in
  the shell.

  We'll now go into more detail about what NANNY is actually doing, how to
  tell NANNY about any new modules you write, and tips for using
  RUNCONSOLE. I'm afraid, there are a few bugs and some outstanding work to
  warn you about as well.

Using NANNY
===========

  Location:
  ---------
	 % tca/bin/nanny

  Command Line Options:
  ---------------------
    -help   

    -load resource-file
   
       If this argument isn't present, NANNY first tries to load 
       nanny.rc from the current directory, then
       /afs/cs/project/robocomp/robot/src/robust/Simulator.rc, then
       /usr/xavier/xavier/src/robust/Simulator.rc.  These last two are
       for compatibility with the original release of nanny.

       The resource file contains all the necessary information about what the
       possible processes to run are, and the run time information necessary,
       such as where to run from, what environment variables to use, what
       machine to run on, what dependencies are necessary, and what conflicts
       to avoid. The next section will cover how to adapt or create your own
       resource file.
	
    -clean  

       When nanny is run, it first check to see if any of the programs
       mentioned in the resource file are running from a previous system. If
       so, -clean will try to kill them. Don't use -clean with the default
       files, as this will result in processes such as csh being killed!

  Usage:
  ------

       When nanny is run, it first check to see if any of the programs
       mentioned in the resource file are running from a previous system. Any
       conflicts will be notified to the user, or kill off as specified by
       clean. 

       If the default Simulator.rc files are used, this will set things up for
       single machine, simulator based usage. That is, instead of connecting
       to the real robot, a simulator window will be run. Look at
       Simulator2.rc, to see how to specify for multiple machine simulator
       based usage (that is, running central, xavier, CTR, navigate etc on
       multiple machines, to reduce load). 

       Once NANNY is going, it wait for action commands from RUNCONSOLE or
       NANNY's on different machines. (All communication between machines is
       NANNY-to-NANNY, with only one NANNY allowed per machine). Following an
       command, it will check the resource file, and perform the appropriate
       action. 

       In addition to acting on commands from outside, NANNY will monitor
       modules which it has run. If any module crashes, NANNY will restart it
       up to 10 times. Output from processes is stored, so even if no
       RUNCONSOLEs are being run, one can be started up after the fact to see
       what caused failure.

  XAVIER:
  -------

       On Xavier, the intention is that NANNYs will be running full time on
       wolverine, ariel and gambit. Simply run RUNCONSOLE on any of these
       machines to interact with the system. If you are debugging your new
       process, do the debugging of that process as normal (in an xterm, via
       gdb etc), attaching to the central run by nanny, and checking with
       RUNCONSOLE that the correct processes are running (navigate, CTR, CTS,
       CTV etc). NANNY will then make sure that they stay running while you
       debug. 

Extending Resource Files:
=========================

  A resource file, is read each time a nanny is run, or reread as commanded by
  RUNCONSOLE. All of the programs will not be run, only dependent ones, as
  required by RUNCONSOLE commands. Each process is specified by
# Format 
#  program_name	Machine_to_run	path(on_that_machine) executable   arguments
process: csh	localhost 	/usr/bin 	      csh          -i

  Note that localhost is an option for machine_to_run, and means same machine
  as nanny. Alternatively, if you have hard coded in a machine name, yet are
  running on the same machine, it shall be treated as a localhost. If
  Machine_to_run is different from the host this nanny is running on, it will
  cause nanny to communicate to a remote nanny, ie the following will cause a
  csh shell to be executed remotely on lung.
#  program_name	Machine_to_run	path(on_that_machine) executable   arguments
process: cshlung lung 	 	/usr/bin 	      csh          -i

  By default, NANNY things a program is ready to run as soon as UNIX marks a
  process as runnable. It is possible to be more explicit, and nanny can
  search for a token in the output. No dependent processes will be run until
  this signal arrives. 

# We now add a ready signal. This string will be searched for. 
ready: csh	 %
ready: xavier 	Opened UNIX & TCP/IP sockets to accept connections.

  We also can specify environment variables. * signifies that an environment
  variable should be applied to all processes. (Note that DISPLAY should be
  provided from runConsole, not from here (although it can be) 
env: *	 	CENTRALHOST=heart
env: CTR	LASER_THRESHOLD=75

  When everything has been defined, mention the dependencies. If a program is
  not mentioned, it is assumed to have no dependencies. 
dependencies: navigate 	central CTR xavier

  Some process conflict, and can't be run at the same time. These conflicts
  are not necessarily symmetric. (eg if we are running CTV we can't run
  ColorVideo, but we can do vice versa). This will not impact simulation runs,
  but we nonetheless present it here to show how its done..

conflict: CTV 		track ColorVideo
conflict: track 	CTV

# A more useful example (?). Two choices of central - a quiet option, or a
# verbose option. You decide.
conflict: central central.verbose
conflict: central.verbose central

  In addition, you may wish to vary the numner of times nanny will rerun a 
  dead process. The default is 10, but there are occasions when you would
  not want to restart a process at all - setting restarts to 0 signifies
  once only.

# default is MAX_RESTARTS - can reset that easily.
restarts: *       5
# and there are some we don't want to restart at all...
restarts: panSnap 0

  Comments in resource files are any text on a line after #
  Any form of white spacing is OK, except for line breaks.

Interacting via RUNCONSOLE:
===========================

   The visible part of the whole affair. NOTE: if RUNCONSOLE acts up, it is
   quite trivial to shut it, using either "Hide Window" or just CTRL-Cing and
   starting it up again. No data will be lost, as all data is store on the
   nannys. 

  Location:
  ---------
	 % /afs/cs/project/robocomp/robot/bin/runConsole

  Command Line Options:
  ---------------------
    -h | -help                 this message

    -run name  
    -kill name 
    -reload [current | file] 

       These are intended for circumstances in which an xwindows display is
       not available. -run will cause "name" to be run via NANNY, -kill will
       attempt to kill "name" via NANNY, and -reload will adjust the resource
       file nanny is utilizing. However, they haven't been tested recently, so
       stick to the xwindow display. 

  Usage:
  ------

       RUNCONSOLE xwindow can be divided into 4 sections. 

       RUNCONSOLE commands:
	QUIT:        Shut down all executing processes, and close RUNCONSOLE
	HIDE WINDOW: Close RUNCONSOLE, without affecting running programs.
        RELOAD:      Not quite implemented fully - adjusts the resource file 
		     NANNY is utilizing.
        HELP:        You're reading it

       MESSAGE information:
        FILTER:      Will restrict messages to those containing a string, eg
                     those containing "[xavier]", or maybe "malloc"
        CLEAR BUFFER: Clear current messages in this text widget
	text widget: This displays error signals from the running processes,
                     from the nannys (notified via machine name) and from
                     RUNCONSOLE itself. 

       PROCESS information:
	NEW:	Menu list -> select a program for NANNY to run, automatically
                executing all dependents. 
	KILL:   Kill the currently selected program.
        RESTART:kill the currently selected program, then restart it.
	BREAK:  A INTERRUPT signal will be sent to the current process.
        CLEAR BUFFER: Clear current messages in this text widget

        The list of currently available processes is next presented. Clicking
        one will cause it to be the currently selected process, and will the
        text widget will allow for interaction with it. If an error occurs in
        a process, the button will turn red until the error is examined.

       STATUS indicator:
        This tells how communication with nanny is going. The scroll bar will
        continue moving until a command is acknowledge to have completed
        successfully, or failed. 

Advise for writing modules:
===========================

  This section advises upon some adages to keep in mind when designing
  modules, in order to utilize NANNY to the full benefit. 

  Notifying Errors:
  ----------------
    When writing programs to be run under nanny, try to limit the writing of
    messages to stderr to times it is something the end-user _must_ be notified
    about, such as a critical device failure. For trivial warnings, or system
    communications, RUNCONSOLE can be set to ignore stdout, reducing the noise
    level. 

  Control-Characters:
  -------------------
    Special control codes aren't supported. Therefore putting ^G's and such
    like in your printfs won't cause bells to ring, as they would in
    xterms. If you are certain you want to draw some-ones attention to your
    module, write a message to stderr. 

  Retry:
  ------
    Have all start-up information coded on the command-line, as environment
    variables, or as a runtime parameter file. Interaction with the user
    should only be required as a last resort.

  Starting:
  ---------
    It is useful if there is a message outputed to stdin when your process
    is ready for action. This allows easy sequencing of dependent
    processes. Add this message to the resource file. 

  TCA:
  ----
    Rather than depending on the number of expected modules to delay the 
    tcaWaitForReady call until all the needed modules are up, use 
    tcaModuleRequires in order to specify which capabilities a module needs. 

In the Event of Problems:
=========================

  The intention is that nanny will provide robust to Xavier despite buggy
  modules. However, nanny itself is stil young and being worked upon. While it
  should still assit you in day to day running, you may nonetheless run into
  trouble. 

  First, try exiting RUNCONSOLE and restarting it. This should not alter running
  processes but may clear a bad state. If after restarting, you don't have
  access to any of the running processes, try executing a dummy - I frequently
  use tcsh - to wake up the nannys. 

  If all fails and NANNY crashes, do a ps in your system since it may have left
  stray processes running. You will in fact be notified of such processes when
  you next run NANNY.

Future work
===========

  Increasing fault tolerance is possible by providing alternatives/allowing
  for process migration. For example, if a process was running upon a machine
  which suffered hard disk failure, it would be possible to code a list of
  alternative machines to use as backup. More realistically, processes such as
  navigate don't care where they are run, and could be shoved around for load
  balancing purposes. 

  The message communication mechanism isn't all it could be, and has a lot of
  overhead for each message. 

Bugs
====

  Suspending either NANNY or RUNCONSOLE breaks pipes, which doesn't get
  reattatched correctly. 

  Quitting while a lot of processes are running on multiple machines can have
  some the kill messages fail. Currently, kill each process individually
  before quitting. Or type q to NANNY.  (Fixed - let me know if any problems 
  reappear)

  The screen update in RUNCONSOLE is pretty shoddy. If the current processes
  dumps a lot of messages, mouse input will lag far behind. 

  The notification bar on the bottom in RUNCONSOLE is only semi-implemented,
  and doesn't act as intended under Linux. 

Conclusion
==========

  Low-level knowledge about individual module has been removed from hands of
  the end-users. The responsibility has been placed in the hands of the
  original module author to provide resources with correct dependency
  information to NANNY, which will then detect when the module is required,
  and ensure that it will stay running. 

  RUNCONSOLE simplifies interaction with a complex system. Instead of having
  to monitor extraneous modules, the end-user can now concentrate on a specific
  process, with nanny handling any failures, and in worst cases highlighting
  them to the end-user.  

Credits
=======

	Joseph O'Sullivan	josullvn@cs.cmu.edu
	Rich Goodwin		rich@cs.cmu.edu	
