As the size and complexity of modern IT systems increases, there is greater need for automatic recovery from failures. Recently, self-adaptive control loops have started to replace human oversight as means to ensure high availability of software systems. Two critical pieces of the self-adaptive loop for high availability are failure identification and fault localization.
Failure identification, figuring out something is not working, is a challenging activity as (1) the monitoring is not done at the same abstraction level as the failures manifest themselves, and (2) because systems perform several activities concurrently, incorrect behavior will appear mixed with correct behavior. Identifying faults, pinpointing the source of the failure, is also challenging as (1) there may be multiple explanations for a fault and (2) diagnosis must be performed in a useful time frame.
In this thesis, we propose to improve self-diagnosis through a framework that allows a system to identify failures and pinpoint the corresponding faulty parts in a running system. This framework is based in two key principles: reasoning about the system's behavior at the software architecture level and providing a declarative approach to describe system behavior. The use of architectural models allows the diagnostic infrastructure to scale gracefully, supports efficient run-time execution of common fault localization algorithms, and supports failure diagnosis of system-level properties such as end-to-end performance. The use of a declarative approach to behavior allows one to systematically specify rules for bridging the gap between low-level monitoring and higher-level problem detection. It also supports reuse across systems that share a common architectural style.
David Garlan, (Chair)
Mario Zenha-Rela (Co-Chair, Universidade de Coimbra, Portugal)
Antónia Lopes (Universidade de Lisboa, Portugal)
Raul Barbosa (Universidade de Coimbra, Portugal)
Rui Abreu (Universidade do Porto, Portugal)
cherold [atsymbol] cs.cmu.edu