Skip to main content

Software-Level Soft-Error Mitigation Techniques

  • Chapter
  • First Online:

Part of the book series: Frontiers in Electronic Testing ((FRET,volume 41))

Abstract

Several application domains exist, where the effects of Soft Errors on processor-based systems cannot be faced by acting on the hardware (either by changing the technology, or the components, or the architecture, or whatever else). In these cases, an attractive solution lies in just modifying the software: the ability to detect and possibly correct errors is obtained by introducing redundancy in the code and in the data, without modifying the underlying hardware. This chapter provides an overview of the methods resorting to this technique, outlining their characteristics and summarizing their advantages and limitations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The term alternate reflects sequential execution, which is a feature specific to the recovery block approach.

  2. 2.

    Task duplication [40] was introduced to detect transient faults, based on duplicating the computation of a task on two processors. If the results of the two executions do not match, the task is executed again in another processor until a pair of processors produces identical results. This scheme does not use checkpoints, and every time a fault is detected, the task has to be started from its beginning.

References

  1. M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, Soft-error detection through software fault-tolerance techniques. Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 1999, pp. 210–218

    Google Scholar 

  2. M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, A source-to-source compiler for generating dependable software. Proceedings of the IEEE International Workshop on Source Code Analysis and Manipulation, 2001, pp. 33–42

    Google Scholar 

  3. P. Cheynet, B. Nicolescu, R. Velazco, M. Rebaudengo, M. Sonza Reorda, M. Violante, Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Transactions on Nuclear Science 47(6), 2000, 2231–2236

    Article  Google Scholar 

  4. A. Benso, S. Chiusano, P. Prinetto, L. Tagliaferri, A C/C++ source-to-source compiler for dependable applications. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 71–78

    Google Scholar 

  5. N. Oh, P.P. Shirvani, E.J. McCluskey, Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51(1), 2002, 63–75

    Article  Google Scholar 

  6. G. Sohi, M. Franklin, K. Saluja, A study of time-redundant fault tolerance techniques for high-performance pipelined computers. 19th International Symposium on Fault Tolerant Computing, 1989, pp. 463–443

    Google Scholar 

  7. C. Bolchini, A software methodology for detecting hardware faults in VLIW data paths. IEEE Transactions on Reliability 52(4), 2003, 458–468

    Article  Google Scholar 

  8. N. Oh, E.J. McCluskey, Error detection by selective procedure call duplication for low energy consumption. IEEE Transactions on Reliability 51(4), 2002, 392–402

    Article  Google Scholar 

  9. K. Echtle, B. Hinz, T. Nikolov, On hardware fault detection by diverse software. Proceedings of the 13th International Conference on Fault-Tolerant Systems and Diagnostics, 1990, pp. 362–367

    Google Scholar 

  10. H. Engel, Data flow transformations to detect results which are corrupted by hardware faults. Proceedings of the IEEE High-Assurance System Engineering Workshop, 1997, pp. 279–285

    Google Scholar 

  11. M. Jochim, Detecting processor hardware faults by means of automatically generated virtual duplex systems. Proceedings of the International Conference on Dependable Systems and Networks, 2002, pp. 399–408

    Google Scholar 

  12. S.K. Reinhardt, S.S. Mukherjee, Transient fault detection via simultaneous multithreading. Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp. 25–36

    Google Scholar 

  13. E. Rotenberg, AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. 29th International Symposium on Fault-Tolerant Computing, 1999, pp. 84–91

    Google Scholar 

  14. N. Oh, S. Mitra, E.J. McCluskey, ED4I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers 51(2), 2002, 180–199

    Article  Google Scholar 

  15. M. Hiller, Executable assertions for detecting data errors in embedded control systems. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 24–33

    Google Scholar 

  16. J. Vinter, J. Aidemark, P. Folkesson, J. Karlsson, Reducing critical failures for control algorithms using executable assertions and best effort recovery. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2001, pp. 347–356

    Google Scholar 

  17. S.S. Yau, F.-C. Chen, An approach to concurrent control flow checking. IEEE Transactions on Software Engineering 6(2), 1980, 126–137

    Article  MathSciNet  MATH  Google Scholar 

  18. N. Oh, P.P. Shirvani, E.J. McCluskey, Control-flow checking by software signatures. IEEE Transactions on Reliability 51(2), 2002, 111–122

    Article  Google Scholar 

  19. Z. Alkhalifa, V.S.S. Nair, N. Krishnamurthy, J.A. Abraham, Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems 10(6), 1999, 627–641

    Article  Google Scholar 

  20. O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante, Soft-error detection using control flow assertions. Proceedings of the 18th International Symposium on Defect and Fault Tolerance in VLSI Systems, 3–5 November 2003, pp. 581–588

    Google Scholar 

  21. R. Vemu, J.A. Abraham, CEDA: control-flow error detection through assertions. Proceedings of the 12th IEEE International On-Line Testing Symposium, 2006, pp. 151–158

    Google Scholar 

  22. R. Vemu, J.A. Abraham, Budget-dependent control-flow error detection. Proceedings of the 14th IEEE International On-Line Testing Symposium, 2008, pp. 73–78

    Google Scholar 

  23. C. Babbage, On the mathematical powers of the calculating engine, unpublished manuscript, December 1837, Oxford, Buxton Ms7, Museum of History of Science. Printed in The Origins of Digital Computers: Selected Papers, B. Randell (ed.), Springer, Berlin, 1974, pp. 17–52

    Google Scholar 

  24. A. Avizienis, J.C. Laprie, Dependable computing: from concepts to design diversity. Proceedings of the IEEE 74(5), 1986, 629–638

    Article  Google Scholar 

  25. A. Avizienis, The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering 11(12), 1985, 1491–1501

    Article  Google Scholar 

  26. B. Randell, System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 1975, 220–232

    Article  Google Scholar 

  27. D. Pradhan, Fault-Tolerant Computer System Design. Prentice-Hall, Englewood Cliffs, NJ, 1996

    Google Scholar 

  28. J.P. Kelly, T.I. McVittie, W.I. Yamamoto, Implementing design diversity to achieve fault tolerance. IEEE Software 8(4), 1991, 61–71

    Article  Google Scholar 

  29. J.H. Lala, L.S. Alger, Hardware and software fault tolerance: a unified architectural approach. Proceedings of the 18th International Symposium on Fault-Tolerant Computing, 1988, pp. 240–245

    Google Scholar 

  30. C.E. Price, Fault tolerant avionics for the space shuttle. Proceedings of the 10th IEEE/AIAA Digital Avionics Systems Conference, 1991, pp. 203–206

    Google Scholar 

  31. D. Briere, P. Traverse, AIRBUS A320/A330/A340 electrical flight controls: a family of fault-tolerant systems. Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, 1993, pp. 616–623

    Google Scholar 

  32. R. Riter, Modeling and testing a critical fault-tolerant multi-process system. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 516–521

    Google Scholar 

  33. G. Hagelin, ERICSSON safety system for railway control. Proceedings of the Workshop on Design Diversity in Action, Springer, Vienna, 1988, pp. 11–21

    Google Scholar 

  34. H. Kanzt, C. Koza, The ELEKTRA railway signalling system: field experience with an actively replicated system with diversity. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 453–458

    Google Scholar 

  35. A. Amendola, L. Impagliazzo, P. Marmo, G. Mongardi, G. Sartore, Architecture and safety requirements of the ACC railway interlocking system. Proceedings of IEEE International Computer Performance and Dependability Symposium, 1996, pp. 21–29

    Google Scholar 

  36. A.M. Tyrrell, Recovery blocks and algorithm-based fault tolerance, EUROMICRO 96. Beyond 2000: Hardware and Software Design Strategies. Proceedings of the 22nd EuroMicro Conference, 1996, pp. 292–299

    Google Scholar 

  37. K.M. Chandy, C.V. Ramamoorthy, Rollback and recovery strategies for computer programs. IEEE Transactions on Computers 21(6), 1972, 546–556

    Article  MathSciNet  MATH  Google Scholar 

  38. W.K. Fuchs, C.-C.J. Li, CATCH – compiler-assisted techniques for checkpointing. Proceedings of the 20th Fault-Tolerant Computing Symposium, 1990, pp. 74–81

    Google Scholar 

  39. J. Long, W.K. Fuchs, J.A. Abraham, Compiler-assisted static checkpoint insertion. Proceedings of the 22nd Fault-Tolerant Computing Symposium, 1992, pp. 58–65

    Google Scholar 

  40. D.K. Pradhan, N.H. Vaidya, Roll-forward checkpointing scheme: a novel fault-tolerant architecture. IEEE Transactions on Computers 43(10), 1994, 1163–1174

    Article  MATH  Google Scholar 

  41. A. Ziv, J. Bruck, Performance optimization of checkpointing scheme with task duplication. IEEE Transactions on Computers 46(12), 1997, 1381–1386

    Article  MathSciNet  Google Scholar 

  42. K.H. Huang, J.A. Abraham, Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6), 1984, 518–528

    Article  Google Scholar 

  43. A. Roy-Chowdhury, P. Banerjee, Tolerance determination for algorithm based checks using simplified error analysis. Proceedings of the IEEE International Fault Tolerant Computing Symposium, 1993

    Google Scholar 

  44. M. Rebaudengo, M. Sonza Reorda, M. Violante, A new software-based technique for low-cost fault-tolerant application. Proceedings of the IEEE Annual Reliability and Maintainability Symposium, 2003, pp. 25–28

    Google Scholar 

  45. M. Rebaudengo, M. Sonza Reorda, M. Violante, A new approach to software-implemented fault tolerance. Journal of Electronic Testing: Theory and Applications 20, 2004, 433–437

    Article  Google Scholar 

  46. B. Nicolescu, R. Velazco, M. Sonza Reorda, Effectiveness and limitations of various software techniques for “soft error” detection: a comparative study. Proceedings of the IEEE 7th International On-Line Testing Workshop, 2001, pp. 172–177

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matteo Sonza Reorda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Rebaudengo, M., Reorda, M.S., Violante, M. (2011). Software-Level Soft-Error Mitigation Techniques. In: Nicolaidis, M. (eds) Soft Errors in Modern Electronic Systems. Frontiers in Electronic Testing, vol 41. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6993-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-6993-4_9

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-6992-7

  • Online ISBN: 978-1-4419-6993-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics