Skip to main content
Log in

A comparative analysis of hardware and software fault tolerance: Impact on software reliability engineering

  • Published:
Annals of Software Engineering

Abstract

Today's digital systems are growing increasingly complex, and are being used in increasingly critical functions. The first premise makes them more prone to contain faults, and the second premise makes their failure less tolerable. This widening gap highlights the need for fault tolerant techniques, which make provisions for reliable operation of digital systems despite the presence and occasional manifestation of faults. In this paper we present a brief comparative survey of fault tolerance as it arises in hardware systems and software systems. We discuss logical models as well as statistical models of fault tolerance, and use these models to analyze design tradeoffs of fault tolerant systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alkalai, L. and A.T. Tai (1998), "Long-life Deep-space Applications," Computer 31, 4, 37–38.

    Google Scholar 

  • Avizienis A. (1985), "The n-version Approach to Fault Tolerant Software," IEEE Transactions on Software Engineering 11, 12.

    Google Scholar 

  • Bastani, F.B. and C.V. Ramamoorthy (1986), "Input-domain-based Models for Estimating the Correctness of Process Control Programs," In Reliability Theory, North-Holland, Amsterdam.

  • Bishop, P.G. (1993), "The Variation of Software Survival Time for Different Operational Input Profiles," In Proceedings of IEEE International Symposium on Fault-Tolerant Computing FTCS-23, Toulouse, France, pp. 98–107.

  • Bishop, P.G. and F.D. Pullen (1988), "Pods Revisited-A Study of Software Failure Behavior," In Proceedings of IEEE International Symposium on Fault-Tolerant Computing FTCS-18, Tokyo, Japan.

  • Brilliant, S.S. J.C. Knight, and N.G. Leveson (1989), "The Consistent Comparison Problem in n-version Software," IEEE Transactions on Software Engineering 15, 11, 1481–1485.

    Article  Google Scholar 

  • Carlton, M. (1995), "Pentium Divide Bug Faq," Technical Report, Intel Corp.

  • Dugan, J.B. and M.R. Lyu (1993), "System Reliability Analysis of n-version Programming Applications," In International Symposium on Software Reliability Engineering, Denver, CO.

  • Lyu, M.R., Ed. (1996), The Handbook of Software Reliabilty Engineering, IEEE Computer Society Press.

  • Finelli, G.B. (1988), "Results of Software Error-data Experiments," In AIAA/AHS/ASEE Aircraft Design, Systems and Operations Conference, AIAA'88, Atlanta, GA.

  • Hecht, H. and P. Crane (1994), "Rare Conditions and Their Effect on Software Failures," In Proceedings of Annual Reliability and Maintainability Symposium, Anaheim, CA, pp. 334–337.

  • Hecht, H. M. Hecht, and D. Wallace (1997), "Toward More Effective Testing for High Assurance Systems," In Proceedings of the 2nd IEEE High Assurance Systems Engineering Workshop, Washington, DC.

  • Huang, Y., C. Kintala, N. Kolettis, and N.D. Fulton (1995), "Software Rejuvenation: Analysis, Modules and Applications," In The 25th International Symposium on Fault-Tolerant Computing (FTCS 25), Pasadena, CA, pp. 381–390.

  • Intel Corporation (1995), "Floating Point Division with Optional Checking to Ensure Full Result Precision".

  • Johnson, B.W. (1989), Design and Analysis of Fault Tolerant Systems, Addison-Wesley, Reading, MA.

    Google Scholar 

  • Kanoun, K., M. Kaaniche, C. Beounes, and J.C. Laprie (1993), "Reliability Growth in Fault-tolerant Software," IEEE Transactions on Reliability 42, 2, 205–219.

    Article  MATH  Google Scholar 

  • Laplace J. and M. Brun (1998), "Critical Software for Nuclear Reactors: 11 Years of Field Experience Analysis," In The 9th International Symposium on Software Reliability Engineering (ISSRE'98), Paderborn, Germany, pp. 364–368.

  • Lorczak, P.R., A.K. Koglayan, and D.E. Eckhardt (1989), "A Theoretical Investigation of Generalized Voters for Redundant Systems," In The 19th International Symposium on Fault-Tolerant Computing (FTCS 19), pp. 444–451.

  • Mangir, T.E. and A. Avizienis (1982), "Fault Tolerant Design for VLSI: Effect of Interconnect Requirements on Yield Improvement of VLS Designs," IEEE Transactions on Computers C-31, 7, 609–616.

    Google Scholar 

  • Mili, A. (1990), An Introduction to Program Fault Tolerance: A Structured Programming Approach, Prentice-Hall, Englewood Cliffs, NJ.

    Google Scholar 

  • Moranda, P.B. (1979), "An Error Detection Model for Application During Software Development," IEEE Transactions on Reliability R-28, 5, 325–329.

    Google Scholar 

  • Munson, J.C. and T.M. Khoshgoftaar (1996), "Software Metrics for Reliability Assessment," In The Handbook of Software Reliability Engineering, IEEE Computer Society Press, CA.

    Google Scholar 

  • Musa, J.D. (1998), Software Reliability Engineering, McGraw-Hill.

  • Musa, J.D., A. Iannino, and K. Okumoto (1990), Software Reliability: Measurement, Prediction, Applications, McGraw-Hill (professional edition).

  • Payload Integration Company (1997), "Space Systems Performance, Endurance and Survivability," Technical Report PL-TR-97-C-1141, Report submitted to U.S. Air Force, Phillips Laboratory, Space Missile Technology Directorate.

  • Pradhan, D.K. (1986), Fault Tolerant Computing: Theory and Practice, Prentice-Hall, Englewood Cliffs, NJ.

    Google Scholar 

  • Randall, B. (1975), "System Structure for Software Fault Tolerance," IEEE Transactions on Software Engineering SE-1, 2.

  • ESA Release (1996), "Ariane 501-Presentation of the Inquiry Board Report," Technical Report 33–96, European Space Agency, Paris, France. Also available at http://www.esrin.esa.it/ htdocs/tidc/Press/Press96/press33.html.

  • Ries, G., Z. Kalbarczyk, T. Kraljevic, M.C. Hsueh, and R.K. Iyer (1996), "Depend: A Simulation Enviroment for System Dependability Modelling and Evaluation," In IPDS'96: IEEE International Performance and Dependability Symposium.

  • Selding, P.B. (1996), "Faulty Software Caused Ariane 5 Failure," Space News 7, 25.

    Google Scholar 

  • Sharangpani, H.P. and M.L. Barton (1994), "Statistical Analysis of Floating Point Flaw in the Pentium Processor," White paper, Intel Corporation.

  • Shima, K., K. Matsumoto, and K. Torii (1993) "A Mathematical Comparison of Software Breeding and Community Error Recovery in Mutliversion Software," In International Symposium on Software Reliability Engineering, Denver, CO.

  • Siewiorek, D.P. and R.S. Swarz (1998), Reliable Computer Systems: Design and Evaluation, A.K. Peters, Natick, MA, 3rd edition.

  • Skuce, D.R. and A. Mili (1995), "Behavioral Specifications in Object Oriented Programming," Journal of Object Oriented Programming, pp. 41–49.

  • DoD Staff (1979), "Military Standardization Handbook: Reliability Prediction of Electronic Equipment," Technical Report MIL-HDBK-217C, The US Department of Defense.

  • Stallings, W. (1987), Computer Organization and Architecture: Principles of Structure and Function, McMillan, New York, NY.

    Google Scholar 

  • Tai, A.T. and L. Alkalai (1998), "On Board Maintenance for Long-life Systems," In 1998 IEEE Workshop on Application Specific Software Engineering and Technology, Dallas, TX, March, pp. 69–74.

  • Tang, D. and R.K. Iyer (1992), "Analysis of VAX/VMS Error Logs in Multicomputer Environments-A Case Study of Software Dependability," In Third International Symposium on Software Reliability Engineering, Research Triangle Park, NC, pp. 216–226.

  • Tomek, L.A., J.K. Mupalla, and K.S. Trivedi (1993), "Modelling Correlation in Software Recovery Blocks," IEEE Transactions on Software Engineering SE-19, 11.

  • Trevedi, K.S. (1982), Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall, Englewood Cliffs, NJ.

    Google Scholar 

  • Tsai, T.K., R.K. Iyer, and D. Jewett (1996), "An Approach Towards Benchmarking of Fault Tolerant Commercial Systems," In FTCS'96: Fault Tolerant Computing Symposium, Sendai, Japan.

  • Weiss, S.N. and E.J. Weyuker (1988), "An Extended Domain-based Model of Software Reliability," IEEE Transactions on Software Engineering SE-14, 10, 1512–1524.

    Article  MATH  MathSciNet  Google Scholar 

  • Winter, V.L., J.M. Covan, and L.J. Dalton (1998), "Passive Safety in High-consequence Systems," Computer 31, 4.

    Google Scholar 

  • Yamada, S. and S. Osaki (1985), "Software Reliability Growth Modeling: Models and Applications," IEEE Transactions on Software Engineering SE-11, 12, 1431–1437.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ammar, H.H., Cukic, B., Mili, A. et al. A comparative analysis of hardware and software fault tolerance: Impact on software reliability engineering. Annals of Software Engineering 10, 103–150 (2000). https://doi.org/10.1023/A:1018987616443

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1018987616443

Keywords

Navigation