17-811 Self-Healing Systems: Class Discussion Summary
David Garlan
Spring Semester 2003
Summary of Class Discussion for April 14 by Justin Kenlon
Graceful Degradation & fitting SHS concepts with work already done
How do we engineer this/what really needs to be figured out?
Taxonomies: from "old school" dependability to "new school" adaptability
refrain from gratuitous differences (i.e. vendor taxonomies)
How do the "old" hardware concepts
map to SE?
fault containment
solution space
phases of fault-tolerance and synchronization
hardware and software defects (n-version software)
best practices from an existing related field
***2nd paper***
Emphasis: Fault Models
(between papers,
paper I was all about "black and white" views
paper 2 was "shades of gray"
with looser notions of correctness)
Degradation is concerned with a more
general model than the binary working/not working
resource axes included, not just performance
or specification conformity
QoS
this was a minimal
model of degradation; service quality does not explain the system functionality
as we are talking about it. QoS is different than system make-up adaptability,
it changes the funcitons of the components, but not the relationships, connections,
etc that make up a structure or architecture or the stated roles and abilities
of a system.
Assumptions:
perhaps 1 Major and N minor faults
never with an error during re-integration or recovery
does not account for bursty behavior
FAULT MODELS
not every healing mechanism heals everything (grayness)
what can go wrong?
what does the failure affect?
what happens to resources, componenets, communications, data?
what about
design-time incompleteness? upgrades? new components?
does degradation imply recoverability? should it?
what about accounting
for the timing of aults vs the timing for recovery?
Paper II was a checklist
or guideline, not a taxonomy
David: identification of patterns
external,
internal, hierarchical, peer-to-peer, blackboard;
new patterns and
engineering trade-offs involved
supervisor-control with master-level
child-replication - doesnt scale
bottom up... we have to account for
where the complexity goes (it doesn't go away, it's inherent)
big big issue: we need evaluation metrics
for FT, for adaptability, for and across fault models
Owen: who watches the watchers, who
votes the voters?
Note: stay away from Airbus 380 aircraft
(firewalls and IP based flight-control)
Paper 3
David: is graceful degradation in the
eye of the beholder?
the way it is now,
it is in the eye of the designer; utility metrics are based on designer's
view.
"some people may
be fine driving a car without brakes"
bringing up the
point of other quality attributes and other stakeholders in designing for
GD
Usability metrics?
applying social
utility theory to stakeholders (Pareto optimality of degradation solution
space for resources AND specification (floor, extended, lofty) compliance
AND users?)
What about reliance
on the human/end user workarounds?
Again, environmental
contexts and where the boundaries are
747 auto-pilot, Osprey "reboot" button, self-leveling
Fighter jet(upside down)
can the end-user
be considered part of the system? is that self-healing?
or are we looking for autonomous self-healing (mostly
autonomous?)
Aside:
instead of detecting, assuming always
in a crisis state
assumptions: no downtime and there
is acceptible or negligable runtime costs compared to simple detection, and
component I and A are the same and O and M are the same
Automotive application
no data is the M signal that signifies
malfuntion
specify N system variables
at N-i expected found, then degrade
service appropriately
communicate via data values (out of
band signals)
within a component, detection may be
complex
on a global level (inter-component
anyway) it's so simple
Trusting higher level authorities -> continue working until you hear?
(space shuttle)
until you don't hear?
Hierarchically, especially with extenally visible subsystems (brakes, steering)
at a lower level,
what's their utility?
what does this utility
say to the configuration management?
what kind of CM
is there? (tow truck or FT redundant.. or....)
orthoganality and
linear composition are important assumptions
Loose coupling is
generally applicable
big challenge: automating
utility calculations and utility function evolution
Main Ideas:
Fault Models
Not reinventing the wheel
Low-level complexity and high-level
simplicity (abstraction!)
Graceful Degradation based on utility