Transactional Encoding for Tolerating Transient Hardware Errors

  • Jons-Tobias Wamhoff
  • Mario Schwalbe
  • Rasha Faqeh
  • Christof Fetzer
  • Pascal Felber
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8255)

Abstract

The decreasing feature size of integrated circuits leads to less reliable hardware with higher likelihood for errors. Without adding additional failure detection and masking mechanisms, the next generations of CPUs would at least be unfit for executing mission- and safety-critical applications. One common approach is the replicated execution of programs on redundant cores, which is increasingly difficult considering that most programs are non-deterministic. To be able to detect and mask execution errors, one typically need to execute three copies of each thread.

In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Andrew Frame, C.T.: Introducing new armæ cortextm-r technology for safe and reliable systems. Technical report, ARM Ltd. (2011)Google Scholar
  2. 2.
    Berger, E.D., Zorn, B.G.: Diehard: probabilistic memory safety for unsafe languages. In: ACM SIGPLAN (2006)Google Scholar
  3. 3.
    Blundell, C., Lewis, E., Martin, M.: Deconstructing transactional semantics: The subtleties of atomicity. In: WDDD (2005)Google Scholar
  4. 4.
    Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 25 (2005)Google Scholar
  5. 5.
    Cristal, A., Felber, P., Fetzer, C., Harmanci, D., Sobe, A., Unsal, O., Wamhoff, J.-T., Yalcin, G.: Leveraging transactional memory for energy-efficient computing below safe operation margins. In: TRANSACT 2013 (2013)Google Scholar
  6. 6.
    Dalessandro, L., Scott, M.L.: Sandboxing transactional memory. In: PACT (2012)Google Scholar
  7. 7.
    Fetzer, C., Felber, P.: Transactional memory for dependable embedded systems. In: HotDep (2011)Google Scholar
  8. 8.
    Forin, P.: Vital Coded Microprocessor Principles and Application for Various Transit Systems. In: FAC-GCCT (1989)Google Scholar
  9. 9.
    Yalcin, G., Unsal, O., Cristal, A., Valero, M.: FaulTM-multi: Fault tolerance for multithreaded applications running on transactional memory hardware. In: WANDS (2011)Google Scholar
  10. 10.
    Harris, T., Larus, J., Rajwar, R.: Transactional Memory, 2nd edn. Morgan & Claypool (2010)Google Scholar
  11. 11.
    Horst, R.W., Harris, R.L., Jardine, R.L.: Multiple instruction issue in the nonstop cyclone processor. In: SIGARCH (1990)Google Scholar
  12. 12.
    IBM. Powerpc 750gx lockstep facility- application note. Technical report, International Business Machines Corporation (2008)Google Scholar
  13. 13.
    Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In: CGO 2004 (2004)Google Scholar
  14. 14.
    Lenharth, A., Adve, V.S., King, S.T.: Recovery domains: an organizing principle for recoverable operating systems. In: ASPLOS (2009)Google Scholar
  15. 15.
    Li, M.-L., Ramachandran, P., Sahoo, S.K., Adve, S.V., Adve, V.S., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. In: ASPLOS (2008)Google Scholar
  16. 16.
    Oh, N., Mitra, S., McCluskey, E.J.: Ed4i: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput. (2002)Google Scholar
  17. 17.
    Oh, N., Shirvani, P.P., McCluskey, E.J.: Control-flow checking by software signatures. IEEE Transactions on Reliability (2002)Google Scholar
  18. 18.
    Pattabiraman, K., Grover, V., Zorn, B.G.: Samurai: protecting critical data in unsafe languages. In: ACM SIGOPS/EuroSys. (2008)Google Scholar
  19. 19.
    Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: Swift: Software implemented fault tolerance. In: CGO (2005)Google Scholar
  20. 20.
    Rinard, M., Cadar, C., Dumitran, D., Roy, D., Leu, T.: A dynamic technique for eliminating buffer overflow vulnerabilities (and other memory errors). In: ACSAC (2004)Google Scholar
  21. 21.
    Roberts, D., Austin, T., Blauww, D., Mudge, T., Flautner, K.: Error analysis for the support of robust voltage scaling. In: ISQED (2005)Google Scholar
  22. 22.
    Schiffel, U.: Hardware Error Detection Using AN-Codes. PhD thesis, Technische Universität Dresden (2011)Google Scholar
  23. 23.
    Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software. In: Schoitsch, E. (ed.) SAFECOMP 2010. LNCS, vol. 6351, pp. 169–182. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  24. 24.
    Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: Slice Your Bug: Debugging Error Detection Mechanisms using Error Injection Slicing. In: IEEE TC (2010)Google Scholar
  25. 25.
    Schiffel, U., Schmitt, A., Süßkraut, M., Fetzer, C.: Software-Implemented Hardware Error Detection: Costs and Gains. In: DEPEND (2010)Google Scholar
  26. 26.
    Slegel, T.J., Averill III, R.M., Check, M.A., Giamei, B.C., Krumm, B.W., Krygowski, C.A., Li, W.H., Liptay, J.S., MacDougall, J.D., McPherson, T.J., Navarro, J.A., Schwarz, E.M., Shum, K., Webb, C.F.: Ibm’s s/390 g5 microprocessor design. In: IEEE Micro (1999)Google Scholar
  27. 27.
    Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: SIGARCH (2002)Google Scholar
  28. 28.
    Süßkraut, M., Schmitt, A., Schiffel, U., Brünink, M., Fetzer, C.: Silistra compiler: Building reliable systems with unreliable hardware. In: DSN (2011)Google Scholar
  29. 29.
    Wang, N., Patel, S.: Restore: Symptom-based soft error detection in microprocessors. In: TDSC (2006)Google Scholar
  30. 30.
    Wappler, U., Fetzer, C.: Hardware Failure Virtualization Via Software Encoded Processing. In: INDIN (2007)Google Scholar
  31. 31.
    Webber, S., Beirne, J.: The stratus architecture. In: FTCS (1991)Google Scholar
  32. 32.
    Yalcin, G., Unsal, O., Cristal, A., Hur, I., Valero, M.: SymptomTM: Symptom-based error detection and recovery using hardware transactional memory. In: PACT (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Jons-Tobias Wamhoff
    • 1
  • Mario Schwalbe
    • 1
  • Rasha Faqeh
    • 1
  • Christof Fetzer
    • 1
  • Pascal Felber
    • 2
  1. 1.Dresden University of TechnologyGermany
  2. 2.University of NeuchâtelSwitzerland

Personalised recommendations