Jeremy Casas, Dan Clark, Phil Galbiati, Ravi Konuru,
Steve Otto, Robert Prouty and Jonathan Walpole
Department of Computer Science and Engineering
Oregon Graduate Institute of Science & Technology
PO Box 91000 Portland, OR 97291-1000, USA
We are currently involved in research to enable PVM to take advantage of shared networks of workstations (NOWs) more effectively. In such a computing environment, it is important to utilize workstations unobtrusively and recover from machine failures. Towards this goal, we have enhanced PVM with transparent task migration, checkpointing, and global scheduling. These enhancements are part of the MIST project which takes an open systems approach in developing a cohesive, distributed parallel computing environment. This open systems approach promotes plug-and-play integration of independently developed modules, such as Condor, DQS, AVS, Prospero, XPVM, PIOUS, Ptools, etc.
Transparent task migration, in conjunction with a global scheduler, facilitates the use of shared NOWs by allowing parallel jobs to unobtrusively utilize nodes that are currently unused. PVM tasks can be moved onto nodes that are otherwise idle, and moved off when the node is no longer free. Experiments show that migration performance is limited by the bandwidth of the underlying network. E.g. An 8 MB process migrates in 8 seconds on a 10 Mbps ethernet.
We have implemented a global scheduler as a PVM resource manager which can take advantage of task migration to perform dynamic scheduling of tasks. Some extensions to the resource manager interface were required. The task migration mechanism also serves as the basis for transparent checkpointing, which is a common method for improving a system's fault-tolerance. We have developed a PVM prototype that integrates checkpointing and migration. This paper presents an overview of the entire system, issues raised by this work, and discusses future plans.