Innovations in Systems and Software Engineering

, Volume 9, Issue 4, pp 293–318 | Cite as

Deliberative, search-based mitigation strategies for model-based software health management

  • Nagabhushan Mahadevan
  • Abhishek Dubey
  • Daniel Balasubramanian
  • Gabor Karsai
SI:SwHM

Abstract

Rising software complexity in aerospace systems makes them very difficult to analyze and prepare for all possible fault scenarios at design time; therefore, classical run-time fault tolerance techniques such as self-checking pairs and triple modular redundancy are used. However, several recent incidents have made it clear that existing software fault tolerance techniques alone are not sufficient. To improve system dependability, simpler, yet formally specified and verified run-time monitoring, diagnosis, and fault mitigation capabilities are needed. Such architectures are already in use for managing the health of vehicles and systems. Software health management is the application of these techniques to software systems. In this paper, we briefly describe the software health management techniques and architecture developed by our research group. The foundation of the architecture is a real-time component framework (built upon ARINC-653 platform services) that defines a model of computation for software components. Dedicated architectural elements: the Component Level Health Manager (CLHM) and System Level Health Manager (SLHM) provide the health management services: anomaly detection, fault source isolation, and fault mitigation. The SLHM includes a diagnosis engine that (1) uses a Timed Failure Propagation Graph (TFPG) model derived from the component assembly model, (2) reasons about cascading fault effects in the system, and (3) isolates the fault source component(s). Thereafter, the appropriate system-level mitigation action is taken. The main focus of this article is the description of the fault mitigation architecture that uses goal-based deliberative reasoning to determine the best mitigation actions for recovering the system from the identified failure mode.

Keywords

Software health management Deliberative reasoning Component-based software development ARINC-653 

Notes

Acknowledgments

This paper is based on work supported by NASA under award NNX08AY49A. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration. The authors would like to thank Dr. Paul Miner, Eric Cooper, and Suzette Person of NASA LaRC for their help and guidance on the project.

References

  1. 1.
    Abdelwahed S, Karsai G, Mahadevan N, Ofsthun SC (2009) Practical considerations in systems diagnosis using timed failure propagation graph models. Instrum Meas IEEE Trans 58(2):240–247CrossRefGoogle Scholar
  2. 2.
    ARINC (2010) ARINC specification 653p1-3: Avionics application software standard interface part 1 - required services. https://www.arinc.com/
  3. 3.
    Australian Transport Safety Bureau (2005) In-flight upset; 240km NW Perth, WA; Boeing Co 777–200, 9M-MRG. Tech. rep., http://www.atsb.gov.au/publications/investigation_reports/2005/aair/aair200503722.aspx
  4. 4.
    Australian Transport Safety Bureau (2008) AO-2008-070: In-flight upset, 154 km west of Learmonth, WA, 7 October 2008, VH-QPA, Airbus A330–303. Tech. rep., http://www.atsb.gov.au/publications/investigation_reports/2008/aair/ao-2008-070.aspx
  5. 5.
    Bailleux O, Boufkhad Y (2003) Efficient cnf encoding of boolean cardinality constraints. In: Principles and practice of constraint programming-9th international conference (CP 2003), pp 108–122Google Scholar
  6. 6.
  7. 7.
    Bengtsson J, Larsen K, Larsson F, Pettersson P, Yi W (1996) UPPAAL: a tool suite for automatic verification of real-time systems. In: Proceedings of the DIMACS/SYCON workshop on Hybrid systems III— verification and control, Springer-Verlag New York, Inc., Secaucus, pp 232–243Google Scholar
  8. 8.
    Bustard DW, Sterritt R (2006) A requirements engineering perspective on autonomic systems development. Autonomic computing. Concepts, infrastructure, and applications, pp 19–33Google Scholar
  9. 9.
    Butler R (2008) A primer on architectural level fault tolerance. Tech. rep., NASA scientific and technical information (STI) Program Office, Report No. NASA/TM-2008-215108, available at http://shemesh.larc.nasa.gov/fm/papers/Butler-TM-2008-215108-Primer-FT.pdf
  10. 10.
    Charette RN (2009) This car runs on code. IEEE Spectrum 46(3):3 http://www.spectrum.ieee.org/feb09/7649 Google Scholar
  11. 11.
    Cheng BH (2009) Software engineering for self-adaptive systems. In: Chap software engineering for self-adaptive systems: a research roadmap. Springer-Verlag, Berlin, Heidelberg, pp 1–26, doi:10.1007/978-3-642-02161-9_1
  12. 12.
    Conmy P, McDermid J, Nicholson M (2002) Safety analysis and certification of open distributed systems. In: International system safety conference, DenverGoogle Scholar
  13. 13.
    Dashofy EM, van der Hoek A, Taylor RN (2002) Towards architecture-based self-healing systems. In: WOSS ’02: Proceedings of the first workshop on Self-healing systems, ACM Press, New York, pp 21–26, doi:10.1145/582128.582133
  14. 14.
    Dubey A, Karsai G, Mahadevan N (2011) A component model for hard real-time systems: CCM with ARINC-653. Softw Pract Exp 41(12):1517–1550. doi:10.1002/spe.1083 CrossRefGoogle Scholar
  15. 15.
    Dubey A, Karsai G, Mahadevan N (2011) Model-based software health management for real-time systems. In: Aerospace conference, 2011 IEEE, IEEE, pp 1–18Google Scholar
  16. 16.
    Dubey A, Mahadevan N, Karsai G (2012) A deliberative reasoner for model-based software health management. In: The eighth international conference on autonomic and autonomous systems, doi:10.1109/ISORC.2010.39
  17. 17.
    Dubey A, Mahadevan N, Karsai G (2012) The inertial measurement unit example: a software health management case study. Tech. Rep. ISIS-12-101, Institute for Software Integrated Systems, Vanderbilt University, http://www.isis.vanderbilt.edu/sites/default/files/TechReport_IMU.pdf
  18. 18.
    Dubey A, Karsai G, Mahadevan N (2013) Fault-adaptivity in hard real-time component based systems. In: de Lemos R, Giese H, Muller HA, Shaw M (eds) Software engineering for self-adaptive systems II, no. 7475 in, Lecture Notes in Computer Science, Springer-Verlag, Berlin, pp 294–323Google Scholar
  19. 19.
    Eén N, Sörensson N (2003) An extensible sat-solver. In: Theory and applications of satisfiability testing, 6th international conference (SAT 2003), pp 502–518Google Scholar
  20. 20.
    Eén N, Sörensson N (2006) Translating pseudo-boolean constraints into sat. JSAT 2(1–4):1–26MATHGoogle Scholar
  21. 21.
    Garlan D, Cheng SW, Schmerl B (2003) Architecting dependable systems. In: Chap increasing system dependability through architecture-based self-repair. Springer-Verlag, Berlin, pp 61–89, http://dl.acm.org/citation.cfm?id=1768179.1768183
  22. 22.
    Goldberg A, Horvath G (2007) Software fault protection with ARINC 653. In: Proceeding of IEEE aerospace conference, Montana, pp 1–11Google Scholar
  23. 23.
    Greenwell WS, Knight J, Knight JC (2003) What should aviation safety incidents teach us? Technical report, University of Virginia. http://dependability.cs.virginia.edu/publications/safecomp.2003.lessons.pdf
  24. 24.
    Jagadeesan LJ, Viswanathan R (2005) Passive mid-stream monitoring of real-time properties. In: EMSOFT ’05: Proceedings of the 5th ACM international conference on Embedded software, ACM, New York, pp 343–352, doi:10.1145/1086228.1086291
  25. 25.
    Johnson SB, Gormley TJ, Kessler SS, Mott CD, Patterson-Hine A, Reichard KM, Scandura PA (2011) System health management: with aerospace applications. Wiley, New YorkCrossRefGoogle Scholar
  26. 26.
    Laprie JC (1995) Dependable computing and fault tolerance: concepts and terminology. In: Proceeding of twenty-fifth international symposium on fault-tolerant computing, ’ Highlights from Twenty-Five Years’, p 2, http://ieeexplore.ieee.org/iel3/3846/11214/00532603.pdf?arnumber=532603
  27. 27.
    Laprie JC, Arlat J, B’eounes C, Kanoun K (1995) Architectural issues in software fault-tolerance, chapter 2. Software Fault Tolerance http://www.cse.cuhk.edu.hk/lyu/book/sft/pdf/chap3.pdf
  28. 28.
    Lightstone S (2007) Seven software engineering principles for autonomic computing development. ISSE 3(1):71–74Google Scholar
  29. 29.
    Lyu MR (1995) Software fault tolerance, Wiley, New York http://www.cse.cuhk.edu.hk/lyu/book/sft/
  30. 30.
    Lyu MR (2007) Software reliability engineering: a roadmap. In: 2007 Future of software engineering, IEEE computer society, FOSE ’07, Washington, pp 153–170. doi:10.1109/FOSE.2007.24
  31. 31.
    Mahadevan N, Dubey A, Karsai G (2011) Application of software health management techniques. In: Proceedings of the 6th international symposium on software engineering for adaptive and self-managing systems, SEAMS ’11, ACM, New York, pp 1–10.doi:10.1145/1988008.1988010
  32. 32.
    Mahadevan N, Dubey A, Balasubramaniam D, Karsai G (2013) Deliberative reasoning in software health management. Tech. Rep. ISIS-13-111, Institute for Software Integrated Systems, Vanderbilt University, http://www.isis.vanderbilt.edu/sites/default/files/TechReport2013.pdf
  33. 33.
    Marques-Silva J, Lynce I (2007) Towards robust cnf encodings of cardinality constraints. In: Bessière C (ed) Proceedings of 13th international conference on principles and practice of constraint programming (CP2007), LNCS, vol 4741. Springer, Heidelberg, pp 483–497Google Scholar
  34. 34.
    Mcintyre MDW, Sebring DL (1994) Integrated fault-tolerant air data inertial reference system. Future of Software Engineering, pp 153–170Google Scholar
  35. 35.
    Potocti de Montalk J (1991) Computer software in civil aircraft. In: Proceedings. IEEE/AIAA 10th digital avionics systems conference, 1991. pp 324–330, doi:10.1109/DASC.1991.177187
  36. 36.
    de Moura LM, Bjørner N (2008) Z3: An efficient smt solver. In: Tools and algorithms for the construction and analysis of systems (TACAS), New York, pp 337–340Google Scholar
  37. 37.
    NASA (2000) Report on the loss of the mars polar lander and deep space 2 missions. Tech. rep., NASA, ftp://ftp.hq.nasa.gov/pub/pao/reports/2000/2000_mpl_report_1.pdf
  38. 38.
    Nicholson M (2007) Health monitoring for reconfigurable integrated control systems. In: Constituents of modern system safety thinking proceedings of the thirteenth safety-critical systems symposium, vol 5, pp 149–162Google Scholar
  39. 39.
    Ofsthun S (2002) Integrated vehicle health management for aerospace platforms. Instrum Meas Mag IEEE 5(3):21–24. doi:10.1109/MIM.2002.1028368 CrossRefGoogle Scholar
  40. 40.
    Pike L, Goodloe A, Morisset R, Niller S (2010) Copilot: a hard real-time runtime monitor. In: Runtime verification, Springer, pp 345–359Google Scholar
  41. 41.
    Pullum LL (2001) Software fault tolerance techniques and implementation. Artech House, Inc., NorwoodMATHGoogle Scholar
  42. 42.
    Robertson P, Williams B (2006) Automatic recovery from software failure. Commun ACM 49(3):41–47. doi:10.1145/1118178.1118200 CrossRefGoogle Scholar
  43. 43.
    Rohr M, Boskovic M, Giesecke S, Hasselbring W (2006) Models in software engineering, workshops, and symposia at models 2006. In: Proceedings of the workshop “Models@run.time” at the 9th international conference on model driven engineering languages and systems (MoDELS/UML’06), vol 4364. University of Massachusetts, BostonGoogle Scholar
  44. 44.
    Sammapun U, Lee I, Sokolsky O (2005) Rt- Ma C: runtime monitoring and checking of quantitative and probabilistic properties. In: Proceeding of 11th IEEE international conference on embedded and real-time computing systems and applications, pp 147–153. doi:10.1109/RTCSA.2005.84
  45. 45.
    Schumann J, Srivastava AN, Mengshoel OJ (2010) Who guards the guardians?: toward v &#v of health management software. In: Proceedings of the First international conference on Runtime verification, RV’10, Springer-Verlag, Heidelberg, pp 399–404. http://dl.acm.org/citation.cfm?id=1939399.1939432
  46. 46.
    Sha L (2006) The complexity challenge in modern avionics software. In: National Workshop on aviation software systems, design for certifiably dependable systems, AlexandriaGoogle Scholar
  47. 47.
    Shaw M (2002) “self-healing”: softening precision to avoid brittleness: position paper for woss ’02: workshop on self-healing systems. In: WOSS ’02: Proceedings of the first workshop on self-healing systems, ACM Press, New York, pp 111–114. doi:10.1145/582128.582152
  48. 48.
    Sheffels M (1992) A fault-tolerant air data/inertial reference unit. In: Proceedings of IEEE/AIAA 11th digital avionics systems conference, 1992, pp 127–131, doi:10.1109/DASC.1992.282171
  49. 49.
    Srivastava A, Schumann J (2011) The case for software health management. In: Fourth IEEE international conference on space mission challenges for information technology, 2011. SMC-IT 2011, pp 3–9Google Scholar
  50. 50.
    Taleb-Bendiab A, Bustard DW, Sterritt R, Laws AG, Keenan F (2005) Model-based self-managing systems engineering. In: DEXA workshops, pp 155–159Google Scholar
  51. 51.
    Torres-pomales W (2000) Software fault tolerance: a tutorial. Tech. rep., NASA, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.8307
  52. 52.
    Tseitin GS (1968) On the complexity of derivations in the propositional calculus. Stud Math Math Log Part II:115–125Google Scholar
  53. 53.
    Wang N, Schmidt DC, O’Ryan C (2001) Overview of the CORBA component model. In: Component-based software engineering: putting the pieces together, Addison-Wesley Longman Publishing Co., Inc., Boston, pp 557–571Google Scholar
  54. 54.
    Wang S, Ayoub A, Sokolsky O, Lee I (2012) Runtime verification of traces under recording uncertainty. In: Proceedings of the second international conference on runtime verification, RV’11. Springer-Verlag, Berlin, pp 442–456. doi:10.1007/978-3-642-29860-8_35
  55. 55.
    Williams B, Williams B, Ingham M, Chung S, Elliott P (2003) Model-based programming of intelligent embedded systems and robotic space explorers. Proc IEEE 91(1):212–237. doi:10.1109/JPROC.2002.805828 CrossRefGoogle Scholar
  56. 56.
    Williams BC, Ingham M, Chung S, Elliott P, Hofbaur M, Sullivan GT (2004) Model-based programming of fault-aware systems. AI Mag 24(4):61–75Google Scholar
  57. 57.
    Zhang J, Cheng BHC (2005) Specifying adaptation semantics. In: WADS ’05: Proceedings of the 2005 workshop on architecting dependable systems, ACM, New York, pp 1–7. doi:10.1145/1083217.1083220
  58. 58.
    Zhang J, Cheng BHC (2006) Model-based development of dynamically adaptive software. In: ICSE ’06: Proceeding of the 28th international conference on software engineering, ACM, New York, pp 371–380, doi:10.1145/1134285.1134337

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Nagabhushan Mahadevan
    • 1
  • Abhishek Dubey
    • 1
  • Daniel Balasubramanian
    • 1
  • Gabor Karsai
    • 1
  1. 1.Department of Electrical Engineering and Computer Science, Institute for Software-Integrated SystemsVanderbilt UniversityNashvilleUSA

Personalised recommendations