Advertisement

International Journal of Parallel Programming

, Volume 46, Issue 2, pp 225–251 | Cite as

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

  • Saurabh Hukerikar
  • Keita Teranishi
  • Pedro C. Diniz
  • Robert F. Lucas
Article
  • 135 Downloads

Abstract

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance.

In this paper we present RedThreads, an interface that provides application-level fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables application programmers to scope the extent of redundant computation. Additionally, the runtime system permits the use of RMT to be dynamically enabled, or disabled, based on the resiliency needs of the application and the state of the system. Our experimental results demonstrate how adaptive RMT exploits programmer insight and runtime inference to dynamically navigate the trade-off space between an application’s resilience coverage and the associated performance overhead of redundant computation.

Keywords

Resilience Exascale Redundant multithreading Programming models Runtime systems Fault tolerance 

References

  1. 1.
    Advanced configuration and power interface (ACPI). http://www.uefi.org/acpi/specs (2013)
  2. 2.
    Austin, T.M.: Diva: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 196–207 (1999)Google Scholar
  3. 3.
    Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: Nonstopadvanced architecture. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks, DSN ’05, pp. 12–21 (2005)Google Scholar
  4. 4.
    Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)CrossRefGoogle Scholar
  5. 5.
    Cheng, E., Mirkhani, S., Szafaryn, L.G., Cher, C.Y., Cho, H., Skadron, K., Stan, M.R., Lilja, K., Abraham, J.A., Bose, P., Mitra, S.: Clear: cross-layer exploration for architecting resilience—combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference, DAC ’16, pp. 68:1–68:6 (2016)Google Scholar
  6. 6.
    Dongarra, J., Beckman, P., Moore, T., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 3–60 (2011)Google Scholar
  7. 7.
    Elnozahy, E., Bianchini, R., El-Ghazawi, T., et al.: System resilience at extreme scale. White Paper. Tech. rep, DARPA (2009)Google Scholar
  8. 8.
    Engelmann, C., Ong, H.H., Scott, S.L.: The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN), pp. 189–194 (2009)Google Scholar
  9. 9.
    Ferreira, K., Stearley, J., Laros III, J.H., et al.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)Google Scholar
  10. 10.
    Gomaa, M.A., Vijaykumar, T.N.: Opportunistic transient-fault detection. In: SIGARCH Computer Architecture News, pp. 172–183 (2005)Google Scholar
  11. 11.
    Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, vol. 3, p. 9 (2011)Google Scholar
  12. 12.
    Hukerikar, S., Diniz, P.C., Lucas, R.F., Teranishi, K.: Opportunistic application-level fault detection through adaptive redundant multithreading. In: International Conference on High Performance Computing Simulation (HPCS), pp. 243–250 (2014). doi: 10.1109/HPCSim.2014.6903692
  13. 13.
    Hukerikar, S., Lucas, R.F.: Rolex: resilience-oriented language extensions for extreme-scale systems. J. Supercomput. 72, 1–33 (2016). doi: 10.1007/s11227-016-1752-5 CrossRefGoogle Scholar
  14. 14.
    Hukerikar, S., Teranishi, K., Diniz, P.C., Lucas, R.F.: An evaluation of lazy fault detection based on adaptive redundant multithreading. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2014) doi: 10.1109/HPEC.2014.7040999
  15. 15.
    Kogge, P., Bergman, K., Borkar, S., et al.: Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPA (2008)Google Scholar
  16. 16.
    Liao, C., Quinlan, D.J., Vuduc, R., Panas, T.: Effective source-to-source outlining to support whole program empirical optimization pp. 308–322 (2010)Google Scholar
  17. 17.
    Lidman, J., Quinlan, D.J., Liao, C., McKee, S.A.: ROSE::FTTransform—a source-to-source translation framework for exascale fault-tolerance research. In: Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pp. 1–6 (2012). doi: 10.1109/DSNW.2012.6264672
  18. 18.
    Moon, T.K.: Error correction coding: mathematical methods and algorithms. Wiley, New York (2005)CrossRefzbMATHGoogle Scholar
  19. 19.
    Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: SIGARCH Computer Architecture News, pp. 99–110. Wiley-Interscience, Hoboken, N.J. (2002)Google Scholar
  20. 20.
    Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. pp. 63–75 (2002)Google Scholar
  21. 21.
    Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: Slick: Slice-based locality exploitation for efficient redundant multithreading. SIGOPS Oper. Syst. Rev. 5, 95–105 (2006)CrossRefGoogle Scholar
  22. 22.
    Quinlan, D., et al.: Rose Compiler (2000) http://www.rosecompiler.org
  23. 23.
    Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 25–36 (2000)Google Scholar
  24. 24.
    Reis, G., Chang, J., Vachharajani, N., et al.: SWIFT: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243–254 (2005)Google Scholar
  25. 25.
    Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’13, pp. 4:1–4:8 (2013)Google Scholar
  26. 26.
    Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: Plr: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. 6(2), 135–148 (2009)CrossRefGoogle Scholar
  27. 27.
    Siddiqua, T., Gurumurthi, S.: Balancing soft error coverage with lifetime reliability in redundantly multithreaded processors. In: 2009 IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems, pp. 1–12 (2009)Google Scholar
  28. 28.
    Slegel, T., Averill R.M., I., Check, M., et. al: IBM’s S/390 G5 Microprocessor Design. In: IEEE Micro, pp. 12–23 (1999)Google Scholar
  29. 29.
    Somers, J.: Stratus ftserver–intel fault tolerant platform. Intel Developer Forum (2002)Google Scholar
  30. 30.
    Stearley, J., Ferreira, K., Robinson, D., et al.: Does partial replication pay off? In: IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W) (2012)Google Scholar
  31. 31.
    The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee (2010)Google Scholar
  32. 32.
    USC: Center for high-performance computing. https://hpcc.usc.edu/
  33. 33.
    Vadlamani, R., Zhao, J., Burleson, W., Tessier, R.: Multicore soft error rate stabilization using adaptive dual modular redundancy. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 27–32 (2010)Google Scholar
  34. 34.
    Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002)Google Scholar
  35. 35.
    von Neumann, J.: Probabilistic logics and the synthesis of reliable organisms from unreliable components. In Automata Studies, pp. 43–98. ACM, New York, NY (1956)Google Scholar
  36. 36.
    Wang, C., Kim, H., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization, pp. 244–258 (2007). doi: 10.1109/CGO.2007.7
  37. 37.
    Zhang, Y., Lee, J.W., Johnson, N.P., August, D.I.: DAFT: Decoupled acyclic fault tolerance. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 87–98 (2010)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Information Sciences InstituteUniversity of Southern CaliforniaMarina del ReyUSA
  2. 2.Sandia National LaboratoriesLivermoreUSA

Personalised recommendations