The Manetho Project

Welcome to Manetho WWW page. The Manetho project started at Rice University in 1989 as a prototype system to investigate the issues of providing reliability in network multicomputers. The prototype implementation was completed in 1993. For information contact Mootaz .

Publications

Description

The Manetho system addresses the problem of providing low-overhead fault tolerance in distributed systems, with emphasis on high performance during failure-free operation. This problem is especially prominent in network multicomputing, where a number of powerful workstations connected by a high-speed network offer the processing capacity to run compute-intensive applications. In such an environment, it is important to provide fault tolerance without affecting the failure-free performance of the system. Manetho provides application-transparent fault tolerance to long-running distributed computations. It is based on maintaining an antecedence graph, a technique that allows rollback-recovery to co-exist in the same system with active process replication. Thus, inexpensive rollback-recovery is used for client processes, while active process replication is used for server processes where high-availability is required. Combining rollback-recovery and process replication allows the system to accommodate different application requirements, and differs from previous work, where a single method is used to provide fault tolerance for all processes.

Key Contributions

Manetho has been implemented on an Ethernet network that connects 16~workstations running the V system. The main contributions and results of the thesis are:

The Name

The old Egyptian civilization had no exact system of chronology. The priests usually dated events according to the years of a king's reign. For this purpose, several lists of kings were maintained at the various temples throughout Egypt. Some of these lists survived the decline of the old Egyptian empire. The priest Manetho (Ma-Net-Ho) lived during the reign of the Ptolemies, circa 300 B.C., at the center of the Nile Delta in Sebennytus, a place now called Samannud. He collected the lists that were preserved in the various temples and used them to write the history of Egypt in a three volume book. This book remained the authentic source for Egypt's history for several centuries until it was lost, probably during the fire of the library of Alexandria (circa 390 A.D.).

The operation of the system metaphorically resembles what Manetho did to restore the history of Egypt. Like old Egypt, the system does not have access to an exact, global time service. Each individual process maintains information about its perception of the system's execution history, similar to what the priests of old Egypt did. If a failure occurs, a protocol collects the fragments of the system's execution history from the individual processes, and like Manetho, restores the full history of the system. This history is used to recover from the failure.

Mootaz Elnozahy, mootaz@cs.cmu.edu