Advertisement

The Journal of Supercomputing

, Volume 72, Issue 12, pp 4662–4695 | Cite as

Rolex: resilience-oriented language extensions for extreme-scale systems

Article

Abstract

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behaviour by the underlying computing system. The mean time to failure of the system scales inversely to the number of components in the system and, therefore, faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However, every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Our experiments show that an approach that leverages the programmer’s insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.

Keywords

Resilience Exascale Programming models Runtime systems Fault tolerance 

References

  1. 1.
    Ashby S et al (2010) The Opportunities and Challenges of Exascale Computing. Tech. rep., Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee, pp 1–77Google Scholar
  2. 2.
    Agullo E, Giraud L, Guermouche A, Roman J, Zounon M, Agullo E, Giraud L, Guermouche A, Roman J, Zounon M, Sud-ouest B (2013) Towards resilient parallel linear krylov solvers: recover-restart strategies. Tech. rep, INRIAGoogle Scholar
  3. 3.
    ARB OpenMP (2010) OpenMP Specification. http://www.http://openmp.org/wp/
  4. 4.
    Aumann Y, Bender MA (1996) Fault tolerant data structures. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, FOCS ’96, Washington, DC, pp 580–589Google Scholar
  5. 5.
    Bosilca G, Delmas R, Dongarra J, Langou J (2008) Algorithmic based fault tolerance applied to high performance computing. CoRRGoogle Scholar
  6. 6.
    Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for gpus: stream computing on graphics hardware. In: ACM SIGGRAPH, pp 777–786Google Scholar
  7. 7.
    Carlson W, Draper J, Culler D, Yelick K, Brooks E, Warren K (1999) Introduction to upc and language specificationGoogle Scholar
  8. 8.
    Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 58:1–58:11Google Scholar
  9. 9.
    van Dam HJJ, Vishnu A, de Jong WA (2013) A case for soft error detection and correction in computational chemistry. J Chem Theory Comput 9:3995–4005CrossRefGoogle Scholar
  10. 10.
    Dongarra J et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl:3–60Google Scholar
  11. 11.
    Elnozahy E et al (2009) System resilience at extreme scale, White Paper. Tech. rep, DARPAGoogle Scholar
  12. 12.
    Fujita H, Schreiber R, Chien AA (2013) It’s time for new programming models for unreliable hardware, provocative ideas session. In: International Conference on Architectural Support for Programming Languages and Operating SystemsGoogle Scholar
  13. 13.
    Hoemmen M, Heroux MA (2011) Fault-tolerant iterative methods via selective reliability. Tech. repGoogle Scholar
  14. 14.
    Huang KH, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput C-33(6):518–528Google Scholar
  15. 15.
    Hukerikar S, Diniz PC, Lucas RF (2012) A programming model for resilience in extreme scale computing. In: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp 1–6. doi: 10.1109/DSNW.2012.6264671
  16. 16.
    Hukerikar S, Diniz PC, Lucas RF (2013) Robust graph traversal: resiliency techniques for data intensive supercomputing. In: IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. doi: 10.1109/HPEC.2013.6670340
  17. 17.
    Hukerikar S, Diniz PC, Lucas RF (2015) Enabling application resilience through programming model based fault amelioration. In: IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. doi: 10.1109/HPEC.2015.7322460
  18. 18.
    Kogge P et al (2008) Exascale computing study: technology challenges in achieving exascale systems. Tech. rep, DARPAGoogle Scholar
  19. 19.
    de Kruijf MA, Sankaralingam K, Jha S (2012) Static analysis and compiler design for idempotent processing. In: Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, PLDI ’12, pp 475–486Google Scholar
  20. 20.
    Langou J, Chen Z, Bosilca G, Dongarra J (2007) Recovery patterns for iterative methods in a parallel unstable environment. SIAM J Sci Comput 30:102–116MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Numrich RW, Reid J (1998) Co-array fortran for parallel programming. SIGPLAN Fortran Forum 17(2):1–31CrossRefGoogle Scholar
  22. 22.
    Quinlan D et al (2000) Rose compiler. http://www.rosecompiler.org
  23. 23.
    Sao P, Vuduc R (2013) Self-stabilizing iterative solvers. In: Proceedings of the Workshop on latest advances in scalable algorithms for large-scale systems, ScalA ’13, pp 4:1–4:8Google Scholar
  24. 24.
    Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Proceedings of the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 1–12Google Scholar
  25. 25.
    Sloan J, Kumar R, Bronevetsky G (2013) An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In: Proceedings of the 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 1–12Google Scholar
  26. 26.
    Snir M et al (2013) Addressing failures in exascale computing. Tech. rep., Argonne Report ANL/MCS-TM-332Google Scholar
  27. 27.
    Yajnik S, Jha N (1994) Synthesis of fault tolerant architectures for molecular dynamics. IEEE Int Symp Circuits Syst 4:247–250Google Scholar
  28. 28.
    Yalcin G, Unsal O, Hur I, Cristal A, Valero M (2010) FaulTM: fault-tolerance using hardware transactional memory. In: Workshop on parallel execution of sequential programs on multi-core architecture. Saint Malo, FranceGoogle Scholar
  29. 29.
    Zou A, Lipscomb TJ, Cho SS (2012) Single vs. double precision in md simulations: correlation depends on system length-scale. GPU Technology ConferenceGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Information Sciences InstituteUniversity of Southern CaliforniaMarina del ReyUSA

Personalised recommendations