Providing Resource Management and
Consistent Checkpointing for PVM
Georg Stellner
Institut für Informatik der Technischen Universität
München
Lehrstuhl für Rechnertechnik und Rechnerorganisation
D-80290
München
stellner@informatik.tu-muenchen.de
Jim Pruyne
Dept. of Computer Sciences
University of Wisconsin -- Madison
pruyne@cs.wisc.edu
Abstract
One of the most often desired features of a high performance computing
environment is the ability to checkpoint, and later re-start, a running
application. This capability provides a degree of fault tolerance, and
gives the system scheduler the freedom to preemptively halt a running
application and restart it later. The Condor batch scheduler has supported
checkpointing for a number of years, but programs running under Condor
which wish to checkpoint are restricted to a single process. Using the
single process checkpointing mechanisms as offered by Condor, the CoCheck
system allows for creating consistent checkpoints of PVM applications.
CoCheck consists of a library of overlays for all PVM functions, and a
Resource Management (RM) process which coordinates the checkpoint protocol.
The overlay library intercepts all calls to PVM functions, insures that no
checkpoint will be initiated while the function is running, and provides
task identifier mapping needed after re-start. By using this overlay
library there is no need to modify the PVM implementation, so CoCheck
requires few updates in order to remain compatible with new releases of
PVM. The RM process controls the checkpoint protocol running among the
application processes, and provides needed mappings between old task
identifiers and the current task identifiers in use after start-up. CARMI,
which supports PVM applications on an opportunistic pool of workstations
managed by Condor, incorporates CoCheck. Instead of being forced to kill
processes running on a workstation which is reclaimed by its owner, CARMI
will is able to checkpoint the application and restart it when sufficient
resources become available. The CoCheck protocol will also be enhanced to
support direct migration of a single task within a CARMI application which
will allow CARMI to migrate a process from a machine which is reclaimed
rather than having to checkpoint and restart the entire application.
More Information...
Georg Stellner
Tue Jun 14 15:34:39 MESZ 1995