Event logs provide an abundance of information about the health of a computing system. Previous studies have shown that definite trends precede many failures and crashes. By designing a system to monitor the event log and to detect these trends, it is possible to predict failures and reconfigure systems before catastrophic events can occur. The volume of data present in an event log makes real-time hand analysis for the purposes of prediction infeasible. Automated means of analysis must be used, and methods of reducing the amount of data must also be found. Tupling techniques are used in this report to group related events in the event logs in order to reduce the amount of information to a manageable size. Reductions of one to two orders of magnitude are typical with tupling algorithms.
The present work deals with an analysis of event log behavior. It compares the differences between and among uni-processor and multi-processor systems. Data from thirteen VAX-780's were used for the uni-processor analysis and data from five Tandem-TNS II's were used for the multi-processor analysis. Individual processors of the same make were found to behave similarly on a processor-by-processor basis. Multiple processors configured into fault-tolerant systems generally had significantly different behaviors. The logs were compared with simulated event logs as well as with a statistical model to evaluate several aspects of the tupling techniques. Reliability models for hypothetical systems incorporating failure prediction are presented to asses potential gains in availability. It was found that significant decreases in down time are achievable with no hardware expenditure.