Concepts for OpenMP Target Offload Resilience

Engelmann, Christian; Vallée, Geoffroy R.; Pophale, Swaroop

doi:10.1007/978-3-030-28596-8_6

Christian Engelmann¹²,
Geoffroy R. Vallée¹³ &
Swaroop Pophale¹²

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11718))

Included in the following conference series:

International Workshop on OpenMP

822 Accesses
1 Citations

Abstract

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U.S. Department of Energy. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ashraf, R., Hukerikar, S., Engelmann, C.: Pattern-based modeling of multiresilience solutions for high-performance computing. In: Proceedings of the 9th ACM/SPEC International Conference on Performance Engineering (ICPE) 2018, pp. 80–87, April 2018. https://doi.org/10.1145/3184407.3184421, http://icpe2018.spec.org
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: International Conference on High Performance Computing, Networking, Storage and Analysis (SC11), pp. 1–12, November 2011. https://doi.org/10.1145/2063384.2063427
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013). https://doi.org/10.1177/1094342013488238
Article Google Scholar
Castain, R.H., Solt, D., Hursey, J., Bouteiller, A.: PMIx: process management for exascale environments. In: European MPI Users’ Group Meeting (EuroMPI 2017), pp. 14:1–14:10, September 2017. https://doi.org/10.1145/3127024.3127027
Chung, J., et al.: Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 58:1–58:11. IEEE Computer Society Press, November 2012. https://doi.org/10.1109/SC.2012.36
Davies, T., Chen, Z.: Correcting soft errors online in LU factorization. In: Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC 2013), pp. 167–178 (2013). https://doi.org/10.1145/2493123.2462920
Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: 28th International Parallel and Distributed Processing Symposium (IPDPS 2014), pp. 1193–1202, May 2014. https://doi.org/10.1109/IPDPS.2014.123
Fiala, D., Mueller, F., Engelmann, C., Ferreira, K., Brightwell, R., Riesen, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 78:1–78:12, November 2012. https://doi.org/10.1109/SC.2012.49, http://sc12.supercomputing.org
Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2017), pp. 44:1–44:12, November 2017. https://doi.org/10.1145/3126908.3126937
Hassani, A., Skjellum, A., Brightwell, R.: Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 750–755, June 2014. https://doi.org/10.1109/DSN.2014.78
Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale (version 1.2). Technical report ORNL/TM-2017/745, Oak Ridge National Laboratory, August 2017. https://doi.org/10.2172/1436045
Hukerikar, S., Lucas, R.F.: Rolex: resilience-oriented language extensions for extreme-scale systems. J. Supercomput. 1–33 (2016). https://doi.org/10.1007/s11227-016-1752-5
Article Google Scholar
Meneses, E., Ni, X., Jones, T., Maxwell, D.: Analyzing the interplay of failures and workload on a leadership-class supercomputer. In: Cray User Group Meeting (CUG 2014), March 2014. https://cug.org/proceedings/cug2015_proceedings/includes/files/pap169.pdf
Nie, B., Xue, J., Gupta, S., Engelmann, C., Smirni, E., Tiwari, D.: Characterizing temperature, power, and soft-error behaviors in data center systems: Insights, challenges, and opportunities. In: International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2017), pp. 22–31, September 2017. https://doi.org/10.1109/MASCOTS.2017.12
Nie, B., et al.: Machine learning models for GPU error prediction in a large scale HPC system. In: International Conference on Dependable Systems and Networks (DSN 2018), pp. 95–106, June 2018. https://doi.org/10.1109/DSN.2018.00022
Pena, A.J., Bland, W., Balaji, P.: VOCL-FT: introducing techniques for efficient soft error coprocessor recovery. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2015), pp. 1–12, November 2015. https://doi.org/10.1145/2807591.2807640
Rezaei, A., Mueller, F., Hargrove, P., Roman, E.: DINO: divergent node cloning for sustained redundancy in HPC. J. Parallel Distrib. Comput. 109, 350–362 (2017). https://doi.org/10.1016/j.jpdc.2017.06.010
Article Google Scholar
Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA 2013), pp. 4:1–4:8, November 2013. https://doi.org/10.1145/2530268.2530272
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. (IJHPCA) 28(2), 127–171 (2014). https://doi.org/10.1177/1094342014522573, http://hpc.sagepub.com
Article Google Scholar
Vazhkudai, S., et al.: The design, deployment, and evaluation of the CORAL pre-exascale systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2018), pp. 52:1–52:12, November 2018. https://doi.org/10.1109/SC.2018.00055
Zimmer, C., Maxwell, D., McNally, S., Atchley, S., Vazhkudai, S.S.: GPU age-aware scheduling to improve the reliability of leadership jobs on Titan. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC 2018), pp. 7:1–7:11, November 2018. https://doi.org/10.1109/SC.2018.00010

Download references

Author information

Authors and Affiliations

Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, TN, 37831, USA
Christian Engelmann & Swaroop Pophale
Sylabs, Inc., 1191 Solano Ave, Unit 6634, Albany, CA, 94706, USA
Geoffroy R. Vallée

Authors

Christian Engelmann
View author publications
You can also search for this author in PubMed Google Scholar
Geoffroy R. Vallée
View author publications
You can also search for this author in PubMed Google Scholar
Swaroop Pophale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Engelmann .

Editor information

Editors and Affiliations

University of Auckland, Auckland, New Zealand
Xing Fan
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
University of Auckland, Auckland, New Zealand
Oliver Sinnen
University of Auckland, Auckland, New Zealand
Nasser Giacaman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Engelmann, C., Vallée, G.R., Pophale, S. (2019). Concepts for OpenMP Target Offload Resilience. In: Fan, X., de Supinski, B., Sinnen, O., Giacaman, N. (eds) OpenMP: Conquering the Full Hardware Spectrum. IWOMP 2019. Lecture Notes in Computer Science(), vol 11718. Springer, Cham. https://doi.org/10.1007/978-3-030-28596-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-28596-8_6
Published: 09 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28595-1
Online ISBN: 978-3-030-28596-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics