Dome Technical Reports

Dome: Parallel programming in a heterogeneous multi-user environment.

J. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, M. Starkey, and P. Stephan. Technical Report CMU-CS-95-137, School of Computer Science, Carnegie Mellon University, April 1995. A condensed version of this paper has been published in the Proceedings of the International Parallel Processing Symposium 1996.

Writing parallel programs for distributed multi-user computing environments is a difficult task. The Distributed object migration environment (Dome) addresses three major issues of parallel computing in an architecture independent manner: ease of programming, dynamic load balancing, and fault tolerance. Dome programmers, with modest effort, can write parallel programs that are automatically distributed over a heterogeneous network, dynamically load balanced as the program runs, and able to survive compute node and network failures. This paper provides the motivation for and an overview of Dome, including a preliminary performance evaluation of dynamic load balancing for distributed vectors. Dome programs are shorter and easier to write than the equivalent programs written with message passing primitives. The performance overhead of Dome is characterized, and it is shown that this overhead can be recouped by dynamic load balancing in imbalanced systems. Finally, we show that a parallel program can be made failure resilient through Dome's architecture independent checkpoint and restart mechanism.

Application Level Fault Tolerance in Heterogeneous Networks of Workstations

Adam Beguelin, Erik Seligman, and Peter Stephan. Technical Report CMU-CS-96-157, School of Computer Science, Carnegie Mellon University, August 1996. A condensed version of this paper will appear in a special issue of the Journal of Parallel and Distributed Computing on Workstation Clusters and Network-based Computing, September, 1997.

We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems.