Software Quality Journal

, Volume 24, Issue 1, pp 87–113 | Cite as

Experiences with software-based soft-error mitigation using AN codes

  • Martin Hoffmann
  • Peter Ulbrich
  • Christian Dietrich
  • Horst Schirmeier
  • Daniel Lohmann
  • Wolfgang Schröder-Preikschat
Article

Abstract

Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.

Keywords

Fault injection Arithmetic code Dependability 

References

  1. Aidemark, J., Vinter, J., Folkesson, P., & Karlsson, J. (2002). Experimental evaluation of time-redundant execution for a brake-by-wire application. 32nd International Conference on Dependable Systems & Networks (DSN ’02) (pp. 210–215). doi:10.1109/DSN.2002.1028902.
  2. Avižienis, A., Gilley, G., Mathur, F. P., Rennels, D., Rohr, J., & Rubin, D. (1971). The star (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers, 20(11), 1312–1321. doi:10.1109/T-C.1971.223133.MATHCrossRefGoogle Scholar
  3. Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., et al. (2011). The gem5 simulator. SIGARCH Computer Architecture News, 39(2), 1–7. doi:10.1145/2024716.2024718.CrossRefGoogle Scholar
  4. Borkar, S. Y. (2005). Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25(6), 10–16.CrossRefGoogle Scholar
  5. Braun, J., Geyer, D., & Mottok, J. (2012). Alternative measure for safety-related software. ATZelektronik Worldwide, 7(4), 40–43. doi:10.1365/s38314-012-0106-1.Google Scholar
  6. Chang, J., Reis, G., & August, D. (2006). Automatic instruction-level software-only recovery. 36th International Conference on Dependable Systems & Networks (DSN ’06), IEEE (pp. 83–92). Washington, DC, USA. doi:10.1109/DSN.2006.15.
  7. Cho, H., Mirkhani, S., Cher, C.Y., Abraham, J., & Mitra, S. (2013). Quantitative evaluation of soft error injection techniques for robust system design. Proceedings of the 50th annual Design Automation Conference (pp. 1–10).Google Scholar
  8. Dodd, P. E., & Massengill, L. W. (2003). Basic mechanisms and modeling of single-event upset in digital microelectronics. IEEE Transactions on Nuclear Science, 50(3), 583–602. doi:10.1109/TNS.2003.813129.CrossRefGoogle Scholar
  9. Engel, M., & Döbel, B. (2012). The reliable computing base: A paradigm for software-based reliability. 1st International W’shop on Software-Based Methods for Robust Emb. Sys. (SOBRES ’12). LNCS. Gesellschaft für Informatik.Google Scholar
  10. Forin, P. (1989). Vital coded microprocessor principles and application for various transit systems. Symposium on Control, Computers, Communication in Transportation (CCCT ’89) (pp. 79–84).Google Scholar
  11. Frohwerk, R. A. (1977). Signature analysis: A new digital field service method. Hewlett-Packard Journal, 28(9), 2–8.Google Scholar
  12. Goloubeva, O., Rebaudengo, M., Reorda, M. S., & Violante, M. (2006). Software-Implemented Hardware Fault Tolerance (1st ed.). New York, NY: Springer.MATHGoogle Scholar
  13. Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147–160.MathSciNetCrossRefGoogle Scholar
  14. Hoffmann, M., Dietrich, C., & Lohmann, D. (2013). dOSEK: A dependable RTOS for automotive applications. 19th International Symposium on Dependable Computing (PRDC ’13). IEEE. Washington, DC, USA. doi:10.1109/PRDC.2013.22. http://www.danceos.org/publications/PRDC-FAST-2013-Hoffmann.pdf. Fast abstract.
  15. Hoffmann, M., Ulbrich, P., Dietrich, C., Schirmeier, H., Lohmann, D., & Schröder-Preikschat, W. (2014). A practitioner’s guide to software-based soft-error mitigation using AN-codes. 15th IEEE International Symposium on High-Assurance Systems Engineering (HASE ’14), IEEE (pp. 33–40). Miami, Florida, USA. doi:10.1109/HASE.2014.14.
  16. Kanawati, G. A., Kanawati, N. A., & Abraham, J. A. (1995). Ferrari: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 44, 248–260.MATHCrossRefGoogle Scholar
  17. Lawton, K. P. (1996). Bochs: A portable PC emulator for Unix/X. Linux Journal, 1996(29es), 7.Google Scholar
  18. Li, X., Shen, K., Huang, M.C., & Chu, L. (2007). A memory soft error measurement on production systems. In: 2007 USENIX ATC, pp. 1–14. USENIX, Berkeley, CA, USA.Google Scholar
  19. Maiz, J., Hareland, S., Zhang, K., & Armstrong, P. (2003). Characterization of multi-bit soft error events in advanced SRAMs. International Electron Devices Meeting (IEDM ’03). IEEE Press, New York, NY, USA. doi:10.1109/IEDM.2003.1269335.
  20. Mandelbaum, D. (1967). Arithmetic codes with large distance. IEEE Transactions on Information Theory, 13(2), 237–242. doi:10.1109/TIT.1967.1054015.MATHCrossRefGoogle Scholar
  21. Massey, J. L. (1964). Survey of residue coding for arithmetic errors. International Computation Center Bulletin, 3(4), 3–17.MathSciNetGoogle Scholar
  22. Medwed, M., & Schmidt, J.M. (2009). Coding schemes for arithmetic and logic operations - how robust are they? In: H. Youm, M. Yung (eds.) Information Security Applications, Lecture Notes in Computer Science, vol. 5932, pp. 51–65. Springer, Heidelberg. doi:10.1007/978-3-642-10838-9_5.
  23. Oh, N., Mitra, S., & McCluskey, E. (2002). Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2), 180–199. doi:10.1109/12.980007.CrossRefGoogle Scholar
  24. Peterson, W. W., & Weldon, E. J. (1972). Error-correcting codes (2nd ed.). Cambridge, MA, USA: MIT Press.MATHGoogle Scholar
  25. Rao, T. R. N. (1974). Error coding for arithmetic processors (1st ed.). Orlando, FL: Academic Press.MATHGoogle Scholar
  26. Reis, G., Chang, J., Vachharajani, N., Rangan, R., August, D., & Mukherjee, S. (2005). Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization (TACO ’05), 2(4), 366–396. doi:10.1145/1113841.1113843.CrossRefGoogle Scholar
  27. Schiffel, U. (2011). Hardware error detection using AN-codes. Ph.D. thesis, Technische Universität Dresden, Fakultät Informatik.Google Scholar
  28. Schiffel, U., Schmitt, A., Süßkraut, M., & Fetzer, C. (2010). ANB- and ANBDmem-encoding: detecting hardware errors in software. In: E. Schoitsch (ed.) 29th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’10) (pp. 169–182). Springer, Heidelberg, Germany. doi:10.1007/978-3-642-15651-9_13.
  29. Schirmeier, H., Hoffmann, M., Kapitza, R., Lohmann, D., & Spinczyk, O. (2012). FAIL*: Towards a versatile fault-injection experiment framework. 25th International Conference on Architecture of Computer Systems, Lecture Notes in Informatics, vol. 200. Gesellschaft für Informatik.Google Scholar
  30. Shye, A., Moseley, T., Reddi, V.J., Blomstedt, J., & Connors, D.A. (2007). Using process-level redundancy to exploit multiple cores for transient fault tolerance. 37th International Conference on Dependable Systems & Networks (DSN ’07), IEEE (pp. 297–306). Washington, DC, USA. doi:10.1109/DSN.2007.98.
  31. Steindl, M., Mottok, J., & Meier, H. (2010). Ses-based framework for fault-tolerant systems. Proceedings of the 8th Workshop on Intelligent Solutions in Embedded Systems (WISES ’10) (pp. 12–16). doi:10.1109/WISES.2010.5548427.
  32. Ulbrich, P., Hoffmann, M., Kapitza, R., Lohmann, D., Schröder-Preikschat, W., & Schmid, R. (2012). Eliminating single points of failure in software-based redundancy. 9th Europe Dep. Computing Conference (EDCC ’12), IEEE (pp. 49–60). Washington, DC, USA. doi:10.1109/EDCC.2012.21.
  33. Ulbrich, P., Kapitza, R., Harkort, C., Schmid, R., & Schröder- reikschat, W. (2011). I4Copter: An adaptable and modular quadrotor platform. 26th ACM Symposium on Applied Computing (SAC ’11), ACM (pp. 380–396). New York, NY, USA.Google Scholar
  34. Wappler, U., & Fetzer, C. (2007). Software encoded processing: Building dependable systems with commodity hardware. In: F. Saglietti, N. Oster (eds.) 26th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’07) (pp. 356–369). Springer, Heidelberg, Germany. doi:10.1007/978-3-540-75101-4_34.

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Martin Hoffmann
    • 1
  • Peter Ulbrich
    • 1
  • Christian Dietrich
    • 1
  • Horst Schirmeier
    • 2
  • Daniel Lohmann
    • 1
  • Wolfgang Schröder-Preikschat
    • 1
  1. 1.Chair of Distributed Systems and Operating SystemsFriedrich–Alexander University Erlangen–NurembergErlangenGermany
  2. 2.Department of Computer Science 12Technische Universität DortmundDortmundGermany

Personalised recommendations