Distributed System Fault Tolerance Research

In a distributed system, long-running distributed application programs run the risk of failing completely if even one of the processors on which they are executing should fail. My work in this area involves the development of transparent, low-overhead mechanisms to allow the survival of the execution of distributed application programs in spite of such failures. My Ph.D. thesis studied the theory, implementation, and performance of transparent rollback recovery methods based on message logging and checkpointing, and I continue to be active in the area of distributed system fault tolerance research.

I have investigated primarily methods using optimistic rollback recovery based on combinations of message logging and checkpointing. Message logging and checkpointing methods allow very low-overhead fault tolerance support, typically adding less than 1 percent overhead to the execution time of many distributed application programs. My research has included the development of a model for reasoning about and proving the correctness of these systems, the design of several new logging and recovery algorithms and techniques, and the implementation and performance measurement of pessimistic and optimistic message logging. I have also worked with transparent rollback recovery based on consistent checkpointing, and on recoverable distributed shared memory.

Research Papers:

David B. Johnson, dbj@cs.cmu.edu. Last modified February 20, 1996.