Skip to main content
Log in

Experiences with software-based soft-error mitigation using AN codes

  • Published:
Software Quality Journal Aims and scope Submit manuscript

Abstract

Arithmetic error coding schemes are a well-known and effective technique for soft-error mitigation. Although the underlying coding theory is generally a complex area of mathematics, its practical implementation is comparatively simple in general. However, compliance with the theory can be lost easily while moving toward an actual implementation, which finally jeopardizes the aspired fault-tolerance characteristics and effectiveness. In this paper, we present our experiences and lessons learned from implementing arithmetic error coding schemes (AN codes) in the context of our Combined Redundancy fault-tolerance approach. We focus on the challenges and pitfalls in the transition from maths to machine code for a binary computer from a systems perspective. Our results show that practical misconceptions (such as the use of prime numbers) and architecture-dependent implementation glitches occur at every stage of this transition. We identify typical pitfalls and describe practical measures to find and resolve them. This allowed us to eliminate all remaining silent data corruptions in the Combined Redundancy framework, which we validated by an extensive fault-injection campaign covering the entire fault space of 1-bit and 2-bit errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Named by the integers \(A\) (constant key) and \(N\) (value).

  2. Super \(A\)s: 58,659, 59,665, 63,157, 63,859, and 63,877.

  3. Based on a brute-force search. Code available at http://www4.cs.fau.de/Research/CoRed

  4. This is by definition the case for RISC systems. For a CISC architecture, like IA32, this has to be ensured explicitly.

  5. The signatures were chosen by the methods discussed in Sect. 3.2 and have a pairwise minimal Hamming distance of six.

  6. Five input parameter sets for the four equality sets, and one for signal_due().

References

  • Aidemark, J., Vinter, J., Folkesson, P., & Karlsson, J. (2002). Experimental evaluation of time-redundant execution for a brake-by-wire application. 32nd International Conference on Dependable Systems & Networks (DSN ’02) (pp. 210–215). doi:10.1109/DSN.2002.1028902.

  • Avižienis, A., Gilley, G., Mathur, F. P., Rennels, D., Rohr, J., & Rubin, D. (1971). The star (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. IEEE Transactions on Computers, 20(11), 1312–1321. doi:10.1109/T-C.1971.223133.

    Article  MATH  Google Scholar 

  • Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., et al. (2011). The gem5 simulator. SIGARCH Computer Architecture News, 39(2), 1–7. doi:10.1145/2024716.2024718.

    Article  Google Scholar 

  • Borkar, S. Y. (2005). Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25(6), 10–16.

    Article  Google Scholar 

  • Braun, J., Geyer, D., & Mottok, J. (2012). Alternative measure for safety-related software. ATZelektronik Worldwide, 7(4), 40–43. doi:10.1365/s38314-012-0106-1.

    Google Scholar 

  • Chang, J., Reis, G., & August, D. (2006). Automatic instruction-level software-only recovery. 36th International Conference on Dependable Systems & Networks (DSN ’06), IEEE (pp. 83–92). Washington, DC, USA. doi:10.1109/DSN.2006.15.

  • Cho, H., Mirkhani, S., Cher, C.Y., Abraham, J., & Mitra, S. (2013). Quantitative evaluation of soft error injection techniques for robust system design. Proceedings of the 50th annual Design Automation Conference (pp. 1–10).

  • Dodd, P. E., & Massengill, L. W. (2003). Basic mechanisms and modeling of single-event upset in digital microelectronics. IEEE Transactions on Nuclear Science, 50(3), 583–602. doi:10.1109/TNS.2003.813129.

    Article  Google Scholar 

  • Engel, M., & Döbel, B. (2012). The reliable computing base: A paradigm for software-based reliability. 1st International W’shop on Software-Based Methods for Robust Emb. Sys. (SOBRES ’12). LNCS. Gesellschaft für Informatik.

  • Forin, P. (1989). Vital coded microprocessor principles and application for various transit systems. Symposium on Control, Computers, Communication in Transportation (CCCT ’89) (pp. 79–84).

  • Frohwerk, R. A. (1977). Signature analysis: A new digital field service method. Hewlett-Packard Journal, 28(9), 2–8.

    Google Scholar 

  • Goloubeva, O., Rebaudengo, M., Reorda, M. S., & Violante, M. (2006). Software-Implemented Hardware Fault Tolerance (1st ed.). New York, NY: Springer.

    MATH  Google Scholar 

  • Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147–160.

    Article  MathSciNet  Google Scholar 

  • Hoffmann, M., Dietrich, C., & Lohmann, D. (2013). dOSEK: A dependable RTOS for automotive applications. 19th International Symposium on Dependable Computing (PRDC ’13). IEEE. Washington, DC, USA. doi:10.1109/PRDC.2013.22. http://www.danceos.org/publications/PRDC-FAST-2013-Hoffmann.pdf. Fast abstract.

  • Hoffmann, M., Ulbrich, P., Dietrich, C., Schirmeier, H., Lohmann, D., & Schröder-Preikschat, W. (2014). A practitioner’s guide to software-based soft-error mitigation using AN-codes. 15th IEEE International Symposium on High-Assurance Systems Engineering (HASE ’14), IEEE (pp. 33–40). Miami, Florida, USA. doi:10.1109/HASE.2014.14.

  • Kanawati, G. A., Kanawati, N. A., & Abraham, J. A. (1995). Ferrari: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 44, 248–260.

    Article  MATH  Google Scholar 

  • Lawton, K. P. (1996). Bochs: A portable PC emulator for Unix/X. Linux Journal, 1996(29es), 7.

    Google Scholar 

  • Li, X., Shen, K., Huang, M.C., & Chu, L. (2007). A memory soft error measurement on production systems. In: 2007 USENIX ATC, pp. 1–14. USENIX, Berkeley, CA, USA.

  • Maiz, J., Hareland, S., Zhang, K., & Armstrong, P. (2003). Characterization of multi-bit soft error events in advanced SRAMs. International Electron Devices Meeting (IEDM ’03). IEEE Press, New York, NY, USA. doi:10.1109/IEDM.2003.1269335.

  • Mandelbaum, D. (1967). Arithmetic codes with large distance. IEEE Transactions on Information Theory, 13(2), 237–242. doi:10.1109/TIT.1967.1054015.

    Article  MATH  Google Scholar 

  • Massey, J. L. (1964). Survey of residue coding for arithmetic errors. International Computation Center Bulletin, 3(4), 3–17.

    MathSciNet  Google Scholar 

  • Medwed, M., & Schmidt, J.M. (2009). Coding schemes for arithmetic and logic operations - how robust are they? In: H. Youm, M. Yung (eds.) Information Security Applications, Lecture Notes in Computer Science, vol. 5932, pp. 51–65. Springer, Heidelberg. doi:10.1007/978-3-642-10838-9_5.

  • Oh, N., Mitra, S., & McCluskey, E. (2002). Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2), 180–199. doi:10.1109/12.980007.

    Article  Google Scholar 

  • Peterson, W. W., & Weldon, E. J. (1972). Error-correcting codes (2nd ed.). Cambridge, MA, USA: MIT Press.

    MATH  Google Scholar 

  • Rao, T. R. N. (1974). Error coding for arithmetic processors (1st ed.). Orlando, FL: Academic Press.

    MATH  Google Scholar 

  • Reis, G., Chang, J., Vachharajani, N., Rangan, R., August, D., & Mukherjee, S. (2005). Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization (TACO ’05), 2(4), 366–396. doi:10.1145/1113841.1113843.

    Article  Google Scholar 

  • Schiffel, U. (2011). Hardware error detection using AN-codes. Ph.D. thesis, Technische Universität Dresden, Fakultät Informatik.

  • Schiffel, U., Schmitt, A., Süßkraut, M., & Fetzer, C. (2010). ANB- and ANBDmem-encoding: detecting hardware errors in software. In: E. Schoitsch (ed.) 29th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’10) (pp. 169–182). Springer, Heidelberg, Germany. doi:10.1007/978-3-642-15651-9_13.

  • Schirmeier, H., Hoffmann, M., Kapitza, R., Lohmann, D., & Spinczyk, O. (2012). FAIL*: Towards a versatile fault-injection experiment framework. 25th International Conference on Architecture of Computer Systems, Lecture Notes in Informatics, vol. 200. Gesellschaft für Informatik.

  • Shye, A., Moseley, T., Reddi, V.J., Blomstedt, J., & Connors, D.A. (2007). Using process-level redundancy to exploit multiple cores for transient fault tolerance. 37th International Conference on Dependable Systems & Networks (DSN ’07), IEEE (pp. 297–306). Washington, DC, USA. doi:10.1109/DSN.2007.98.

  • Steindl, M., Mottok, J., & Meier, H. (2010). Ses-based framework for fault-tolerant systems. Proceedings of the 8th Workshop on Intelligent Solutions in Embedded Systems (WISES ’10) (pp. 12–16). doi:10.1109/WISES.2010.5548427.

  • Ulbrich, P., Hoffmann, M., Kapitza, R., Lohmann, D., Schröder-Preikschat, W., & Schmid, R. (2012). Eliminating single points of failure in software-based redundancy. 9th Europe Dep. Computing Conference (EDCC ’12), IEEE (pp. 49–60). Washington, DC, USA. doi:10.1109/EDCC.2012.21.

  • Ulbrich, P., Kapitza, R., Harkort, C., Schmid, R., & Schröder- reikschat, W. (2011). I4Copter: An adaptable and modular quadrotor platform. 26th ACM Symposium on Applied Computing (SAC ’11), ACM (pp. 380–396). New York, NY, USA.

  • Wappler, U., & Fetzer, C. (2007). Software encoded processing: Building dependable systems with commodity hardware. In: F. Saglietti, N. Oster (eds.) 26th International Conference on Computer Safety, Reliability, and Security (SAFECOMP ’07) (pp. 356–369). Springer, Heidelberg, Germany. doi:10.1007/978-3-540-75101-4_34.

Download references

Acknowledgments

This work was partly supported by the Bavarian Ministry of State for Economics, Traffic, and Technology under the (EU EFRE funds) Grant No. 0704/883 25 and the German Research Foundation (DFG) priority program SPP 1500 under grant no. LO 1719/1-2 and SP 968/5-2. Implementation and further experimental results: http://www4.cs.fau.de/Research/CoRed.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Hoffmann.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hoffmann, M., Ulbrich, P., Dietrich, C. et al. Experiences with software-based soft-error mitigation using AN codes. Software Qual J 24, 87–113 (2016). https://doi.org/10.1007/s11219-014-9260-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11219-014-9260-4

Keywords

Navigation