Carnegie Mellon Dependable Systems Laboratory

Dependable Systems Laboratory

RESEARCH GOALS. Our goal is to design and build computer systems that are safe and robust against various kinds of faults, including malicious faults from information-warfare attacks. In our high-tech culture of tightly-coupled computer systems that comprise much of the national critical infrastructure (e.g., energy, communications, finance, etc.), computer failure, even on a small scale, could be disastrous to the nation's economic and security interests. Understanding how and why computers fail, as well as what can be done to prevent or tolerate failure, is the main thrust of our research. This includes failures due to human operators, as well as failures due to design flaws or to information-warfare operations. The consequences of computer failure can be enormous, including loss of data integrity or confidentiality, loss of life, or loss of revenue that can exceed tens of thousands of dollars an hour; by making systems more dependable, such consequences can be avoided or mitigated.

COMPUTER SYSTEM FAILURE. Computers have always been subject to hardware failures, most of which can be mitigated through the use of fault tolerance. Because the mean time between hardware failures in modern commodity-type computers now exceeds 40,000 hours, software failures appear to dominate undependability statistics. As progress is made in solving the problem of software reliability, the dominant cause of outages is likely to shift to those that are operator-induced, suggesting that the failures reside in the user interface. Sometimes these modes are mixed, such as when a software fault is the result of a programmer's failure to recognize and handle exceptions correctly. More recently, the growth in scale and complexity of computer systems has exacerbated the problem of making systems of systems robust to failure. The Internet, for example, is subject not only to traditional failures, but also facilitates malicious attacks such as denial of service, information theft, and intrusion and corruption by unauthorized users. Because corporations and nations have come to depend so heavily on computer-based information and infrastructure, some pundits have claimed that the next great war will be fought in information space, where foreign operatives use the Internet or other means in attempts to steal corporate secrets, disable regional power grids, cripple a nation's telephone system, or destroy a nation's military command and control system.

PROJECTS

FAILURE DETECTION, DIAGNOSIS AND COMPENSATION. Detecting and interpreting unanticipated failures (failures that no one ever thought of during system design), including human interaction failures, is a focus for our research activities. We are building systems that cope with failures by learning about the characteristics of their own environments. One example lies in semiconductor wafer fabrication, possibly the most complex manufacturing process known to mankind; we conduct real-time fault-injection, detection and diagnosis experiments in an operational semiconductor fabrication plant. A second example is the automatic detection of anomalous conditions that are indicative of information warfare activities.

INFORMATION WARFARE. Our work in intrusion detection is closely related to our work in fault detection and diagnosis, and includes the construction of synthetic environments for validating the intrusion detection systems we field. We regard intrusions to be examples of unanticipated anomalous conditions, and we treat them as we do anomalies in hardware and software systems. Key in this work is mapping the performance regions of different types of anomaly detectors so that detectors can be composed in ways that cover the entire information space. We are also developing new kinds of anomaly detectors that can be used not only in intrusion detection, but also in other domains such as electrocardiology and seismology.

ATTACKER/DEFENDER TESTBED. Many intrusion-detection algorithms rely on patterns embedded in keystroke or system-call data. It is not certain, however, that different attacks actually manifest in these patterns, and so our reliance on pattern detection is questionable. This project will build a hardware/software testbed for attacking computer systems as well as for defending them. Scripted or red-team attacks will be directed against a universal victim. The outcomes, monitored in system-level data, will be mined for patterns which will be matched against a taxonomy of anomaly types. Detection coverage will be assessed based on the results.

MASQUERADE DETECTION. A masquerader is someone who pretends to be another user while invading the target user's accounts, directories, or files. This project is building systems that will detect the activities of a masquerader by determining that a user's activities violate a profile developed for that user. Profiling is based on various machine-learning and classification techniques.

TESTING AND EVALUATION OF DETECTION SYSTEMS. Determining exactly how well a system works (e.g., an intrusion-detection system) is a difficult undertaking. We are exploring diverse ways of calibrating how good a system is. One way is by using benchmarking techniques that employ both real and synthesized workload data (see synthetic-data below). A second is through the use of dependability cases (see below). Finally, carefully controlled test and measurement technology is essential, including data monitoring and sampling techniques, instrumentation and measurement, and robust experimental designs and methods. This project works in collaboration with government and industry to establish dependable methods of assessing intrusion and other detection systems.

DATA MINING: STRUCTURE DISCOVERY IN LARGE, UNCONTROLLED DATA SETS. Detection or diagnosis of faults usually necessitates recognizing known patterns or structures embedded in monitored data. If the system environment is novel, as it is in most new products, new medical patients or new information-warfare attacks, then few known patterns will exist. Patterns must first be discovered before they can be employed in detection processes. This project seeks to understand how patterns can be discovered in large data sets monitored from such applications as manufacturing process control, networked communications, medical operating rooms, international banking transactions, intrusion-detection systems and others.

SYNTHETIC-DATA ENVIRONMENT. How do we gain confidence in a system's ability to detect failures, anomalies or performance perturbations? One method is by synthetic fault injection. This project's goal is to build a synthetic environment that can be used for validating algorithms for fault/anomaly/intrusion detection. It will be able to replicate environmental conditions faithfully and repeatably, and will be easy to use for both experts and novices.

DEPENDABLE SOFTWARE. This work investigates the kinds of software errors committed by human programmers and their causes (e.g., different programmers have different cognitive styles, possibly resulting in different propensities to make certain kinds of errors). We seek to characterize the kinds of mistakes that humans make, and determine ways to overcome the limitations of the human cognitive engine, with the result of improved software. Results to date show a nearly fifty percent reduction in exception-handling errors in program code.

DEPENDABLE USER INTERFACES. Undependable user interfaces are the Achilles' heels of highly dependable systems. Even if a system's hardware and software underpinnings are completely reliable, user-interface errors can cripple or destroy a mission. Our objectives are to mitigate these errors (and corresponding downtime) through careful design of predictably dependable systems, and to provide measurable confidence of dependability in user interfaces. This research is investigating what makes user interfaces (un)dependable, what are the sources of human error, and what are the limitations of the human operator? It seeks an understanding of how such knowledge can be coupled with interfaces that can be depended on to work, as needed, the first time. User and system modeling are major constituents of ongoing efforts. The research incorporates work on human error, robust evaluation, methodologies for requirements analyses, task and user modeling, design and testing of predictably dependable interfaces, fault tolerance and reliability, quantitative metrics and instrumentation, and empirical and experimental methods.

DEPENDABILITY CASES. We are exploring the use of formal argumentation to support claims of system dependability. One example is justifying safety claims for fly-by-wire or drive-by-wire systems; another is justifying claims that a fault-diagnosis system will handle all unanticipated faults. Evidence is gathered to support claims, and formal arguments are constructed on the basis of the evidence and the methods used to acquire it.