Distributed System Fault Tolerance Research

In a distributed system, long-running distributed application programs run the risk of failing completely if even one of the processors on which they are executing should fail. My work in this area involves the development of transparent, low-overhead mechanisms to allow the survival of the execution of distributed application programs in spite of such failures. My Ph.D. thesis studied the theory, implementation, and performance of transparent rollback recovery methods based on message logging and checkpointing, and I continue to be active in the area of distributed system fault tolerance research.

I have investigated primarily methods using optimistic rollback recovery based on combinations of message logging and checkpointing. Message logging and checkpointing methods allow very low-overhead fault tolerance support, typically adding less than 1 percent overhead to the execution time of many distributed application programs. My research has included the development of a model for reasoning about and proving the correctness of these systems, the design of several new logging and recovery algorithms and techniques, and the implementation and performance measurement of pessimistic and optimistic message logging. I have also worked with transparent rollback recovery based on consistent checkpointing, and on recoverable distributed shared memory.

Research Papers:

Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. Technical Report CMU-CS-99-148, School of Computer Science, Carnegie Mellon University, June 1999.
Sean W. Smith and David B. Johnson. Minimizing Timestamp Size for Completely Asynchronous Optimistic Recovery with Minimal Rollback. In Proceedings of 15th IEEE Symposium on Reliable Distributed Systems, pp. 66-75, IEEE Computer Society, Niagara-on-the-Lake, Ontario, Canada, October 1996.
Sean W. Smith, David B. Johnson, and J. D. Tygar. Completely Asynchronous Optimistic Recovery with Minimal Rollbacks. In The 25th Annual International Symposium on Fault-Tolerant Computing: Digest of Papers, IEEE Computer Society, Pasadena, CA, June 1995.
David B. Johnson. Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs. In Proceedings of the 12th Symposium on Reliable Distributed Systems, pp. 86-95, IEEE Computer Society, Princeton, NJ, October 1993.
John B. Carter, Alan L. Cox, Sandhya Dwarkadas, Elmootazbellah N. Elnozahy, David B. Johnson, Pete Keleher, Steven Rodrigues, Weimin Yu, and Willy Zwaenepoel. Network Multicomputing Using Recoverable Distributed Shared Memory. In Digest of Papers: COMPCON Spring 1993, The Thirty-Eighth IEEE Computer Society International Conference, pp. 519-527, San Francisco, CA, February 1993.
Elmootazbellah Nabil Elnozahy, David B. Johnson, and Willy Zwaenepoel. The Performance of Consistent Checkpointing. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pp. 39-47, IEEE Computer Society, Houston, TX, October 1992.
David B. Johnson and Willy Zwaenepoel. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing. Journal of Algorithms, 11(3):462-491, September 1990. Revised version of a paper presented at the Seventh Annual ACM Symposium on Principles of Distributed Computing, Toronto, Ontario, Canada, August 1988.
David B. Johnson and Willy Zwaenepoel. Transparent Optimistic Rollback Recovery. In Proceedings of the Fourth ACM SIGOPS European Workshop: Fault Tolerance Support in Distributed Systems, Bologna, Italy, September 1990. Reprinted in ACM SIGOPS Operating Systems Review, 25(2):99-102, April 1991.
David B. Johnson, Peter J. Keleher, and Willy Zwaenepoel. A Simple Algorithm for Finding the Maximum Recoverable System State in Optimistic Rollback Recovery Methods. Technical Report Rice COMP TR90-125, Department of Computer Science, Rice University, July 1990.
David B. Johnson and Willy Zwaenepoel. Output-Driven Distributed Optimistic Message Logging and Checkpointing. Technical Report Rice COMP TR90-118, Department of Computer Science, Rice University, May 1990.
David B. Johnson and Willy Zwaenepoel. Distributed System Fault Tolerance Using Sender-Based Message Logging. Technical Report Rice COMP TR90-119, Department of Computer Science, Rice University, May 1990.
David B. Johnson. Distributed System Fault Tolerance Using Message Logging and Checkpointing. Ph.D. thesis, Rice University, December 1989. Also Technical Report Rice COMP TR89-101, Department of Computer Science, Rice University, December 1989.
David B. Johnson and Willy Zwaenepoel. Sender-Based Message Logging. In The Seventeenth Annual International Symposium on Fault-Tolerant Computing: Digest of Papers, IEEE Computer Society, pp. 14-19, Pittsburgh, PA, July 1987.

David B. Johnson, dbj@cs.cmu.edu. Last modified February 20, 1996.