17-811 Self-Healing Systems: Class Discussion Summary

David Garlan
Spring Semester 2003

Summary of Class Discussion for April 14 by Justin Kenlon
 

Graceful Degradation & fitting SHS concepts with work already done
How do we engineer this/what really needs to be figured out?
Taxonomies: from "old school" dependability to "new school" adaptability
                    refrain from gratuitous differences (i.e. vendor taxonomies)
        How do the "old" hardware concepts map to SE?
                fault containment
                   solution space
                phases of fault-tolerance and synchronization
                hardware and software defects (n-version software)
                best practices from an existing related field

***2nd paper***
        Emphasis: Fault Models
             (between papers, paper I was all about "black and white" views
                                         paper 2 was "shades of gray" with looser notions of correctness)
        Degradation is concerned with a more general model than the binary working/not working
        resource axes included, not just performance or specification conformity
        QoS
            this was a minimal model of degradation; service quality does not explain the system functionality as we are talking about it.  QoS is different than system make-up adaptability, it changes the funcitons of the components, but not the relationships, connections, etc that make up a structure or architecture or the stated roles and abilities of a system.
            Assumptions:
                perhaps 1 Major and N minor faults
                never with an error during re-integration or recovery
                does not account for bursty behavior
            FAULT MODELS
                not every healing mechanism heals everything (grayness)
                what can go wrong?
                what does the failure affect?
                what happens to resources, componenets, communications, data?
             what about design-time incompleteness? upgrades? new components?
                does degradation imply recoverability? should it?  
            what about accounting for the timing of aults vs the timing for recovery?
            Paper II was a checklist or guideline, not a taxonomy

        David:  identification of patterns
                             external, internal, hierarchical, peer-to-peer, blackboard;
                            new patterns and engineering trade-offs involved
        supervisor-control with master-level child-replication - doesnt scale
        bottom up... we have to account for where the complexity goes (it doesn't go away, it's inherent)
        big big issue: we need evaluation metrics for FT, for adaptability, for and across fault models

        Owen: who watches the watchers, who votes the voters?
        Note: stay away from Airbus 380 aircraft (firewalls and IP based flight-control)

Paper 3
        David: is graceful degradation in the eye of the beholder?
            the way it is now, it is in the eye of the designer; utility metrics are based on designer's view.
            "some people may be fine driving a car without brakes"
            bringing up the point of other quality attributes and other stakeholders in designing for GD
            Usability metrics?
            applying social utility theory to stakeholders (Pareto optimality of degradation solution space for resources AND specification (floor, extended, lofty) compliance AND users?)
           
            What about reliance on the human/end user workarounds?
            Again, environmental contexts and where the boundaries are
                    747 auto-pilot, Osprey "reboot" button, self-leveling Fighter jet(upside down)
            can the end-user be considered part of the system?  is that self-healing?
                    or are we looking for autonomous self-healing (mostly autonomous?)

Aside:
        instead of detecting, assuming always in a crisis state
        assumptions: no downtime and there is acceptible or negligable runtime costs compared to simple detection, and component I and A are the same and O and M are the same

Automotive application
        no data is the M signal that signifies malfuntion
        specify N system variables
        at N-i expected found, then degrade service appropriately
        communicate via data values (out of band signals)
       
        within a component, detection may be complex
        on a global level (inter-component anyway) it's so simple

Trusting higher level authorities -> continue working until you hear? (space shuttle)
                                                    until you don't hear?
Hierarchically, especially with extenally visible subsystems (brakes, steering)
            at a lower level, what's their utility?
            what does this utility say to the configuration management?
            what kind of CM is there? (tow truck or FT redundant.. or....)
            orthoganality and linear composition are important assumptions
            Loose coupling is generally applicable
            big challenge: automating utility calculations and utility function evolution

Main Ideas:
        Fault Models
        Not reinventing the wheel
        Low-level complexity and high-level simplicity (abstraction!)
        Graceful Degradation based on utility