Providing Resource Management and Consistent Checkpointing for PVM

Georg Stellner
Institut für Informatik der Technischen Universität München
Lehrstuhl für Rechnertechnik und Rechnerorganisation
D-80290 München

Jim Pruyne
Dept. of Computer Sciences
University of Wisconsin -- Madison


One of the most often desired features of a high performance computing environment is the ability to checkpoint, and later re-start, a running application. This capability provides a degree of fault tolerance, and gives the system scheduler the freedom to preemptively halt a running application and restart it later. The Condor batch scheduler has supported checkpointing for a number of years, but programs running under Condor which wish to checkpoint are restricted to a single process. Using the single process checkpointing mechanisms as offered by Condor, the CoCheck system allows for creating consistent checkpoints of PVM applications. CoCheck consists of a library of overlays for all PVM functions, and a Resource Management (RM) process which coordinates the checkpoint protocol. The overlay library intercepts all calls to PVM functions, insures that no checkpoint will be initiated while the function is running, and provides task identifier mapping needed after re-start. By using this overlay library there is no need to modify the PVM implementation, so CoCheck requires few updates in order to remain compatible with new releases of PVM. The RM process controls the checkpoint protocol running among the application processes, and provides needed mappings between old task identifiers and the current task identifiers in use after start-up. CARMI, which supports PVM applications on an opportunistic pool of workstations managed by Condor, incorporates CoCheck. Instead of being forced to kill processes running on a workstation which is reclaimed by its owner, CARMI will is able to checkpoint the application and restart it when sufficient resources become available. The CoCheck protocol will also be enhanced to support direct migration of a single task within a CARMI application which will allow CARMI to migrate a process from a machine which is reclaimed rather than having to checkpoint and restart the entire application.

More Information...

Georg Stellner
Tue Jun 14 15:34:39 MESZ 1995