Job migration in HPC clusters by means of checkpoint/restart

Rodríguez-Pascual, Manuel; Cao, Jiajun; Moríñigo, José A.; Cooperman, Gene; Mayo-García, Rafael

doi:10.1007/s11227-019-02857-y

Job migration in HPC clusters by means of checkpoint/restart

Published: 23 April 2019

Volume 75, pages 6517–6541, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Manuel Rodríguez-Pascual¹,
Jiajun Cao²,
José A. Moríñigo ORCID: orcid.org/0000-0003-2528-7485¹,
Gene Cooperman² &
…
Rafael Mayo-García¹

547 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Energy efficiency in cloud computing data centers: a survey on software technologies

Article 30 August 2022

A brief introduction to distributed systems

Article Open access 16 August 2016

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

References

Flich J et al. (2017) MANGO: exploring manycore architectures for next-generation HPC systems. In: Kubatova H, Novotny M, Skavhaug A (eds) Euromicro Conferences on Digital System Design (DSD), pp 478–485
Wyngaard J, Inggs M, Collins J, Farrimond B (2013) Towards a many-core architecture for HPC. In: Cardoso JMP, Morrow K, Diniz PC (eds) 23rd International Conference on Field Programmable Logic and Applications (FPL2013)
European technology platform for high performance computing (2017). www.etp4hpc.eu, Strategic Research Agenda
Bailey C, Parry J (2017) Co-design, modelling and simulation challenges: from components to systems. In: Proceedings 23rd International Workshop on Thermal Investigations of ICs and Systems (THERMINIC), pp 1–4
Hill MD, Marty MR (2017) Retrospective on Amdahl’s law in the multicore Era. Computer 50(6):12–14
Article Google Scholar
Martineau M, McIntosh-Smith S (2017) The arch project: physics mini-apps for algorithmic exploration and evaluating programming environments on HPC architectures. In: Proceedings IEEE International Conference on Cluster Computing (CLUSTER2017), pp 850–857
Aupy G et al (2016) Co-scheduling algorithms for high-throughput workload execution. J Sched 19(6):627–640
Article MathSciNet Google Scholar
Rajan M, Doerfler D (2010) HPC application performance and scaling: understanding trends and future challenges with application benchmarks on past, present and future tri-lab computing systems. In Psihoyios G, Tsitouras C (eds) Numerical Analysis and Applied Mathematics, vol I–III (AIP Conference Proceedings 1281), pp 1777–1780
Yoo AB, Jette MA, Grondona M (2003) SLURM: simple linux utility for resource management. In: Feitelson D, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing (JSSPP 2003), vol 2862. Lecture Notes in Computer Science. Springer, Berlin
Google Scholar
Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel & Distributed Processing, Rome, pp 1–12
Tao J, Kolodziej J, Ranjan R, Jayaraman PP, Buyya R (2015) A note on new trends in data-aware scheduling and resource provisioning in modern HPC system. Future Gener Comput Syst 51:45–46
Article Google Scholar
Ge X, Jin H, Leung VCM (2018) Joint opportunistic user scheduling and power allocation: throughput optimisation and fair resource sharing. IET Commun 12(5):634–640
Article Google Scholar
Padua D (2011) Encyclopedia of parallel computing. Springer, New York
Book Google Scholar
Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. J Phys Conf Ser 46:494–499
Article Google Scholar
Sankaran S et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493
Article Google Scholar
Cao J, Arya K, Garg R, Matott S, Panda DK, Subramoni H, Vienne J, Cooperman G (2016) System-level scalable checkpoint-restart for petascale computing. In: Proceedings IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pp 932–941
https://criu.org/Main_Page
Li W, Kanso A, Gherbi A (2015) Leveraging linux containers to achieve high availability for cloud services. In: Proceedings IEEE International Conference on Cloud Engineering, Tempe, AZ, pp 76–83
Vogt D, Giuffrida C, Bos H, Tanenbaum AS (2015) Lightweight memory checkpointing. In: Proceedings 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Rio de Janeiro, pp 474–484
Takizawa H, Amrizal MA, Komatsu K, Egawa R (2017) An application-level incremental checkpointing mechanism with automatic parameter tuning. In: 5th International Symposium on Computing and Networking (CANDAR), Aomori, pp 389–394
Ferreira KB, Riesen R, Bridges P, Arnold D, Brightwell R (2014) Accelerating incremental checkpointing for extreme-scale computing. Future Gener Comput Syst 30:66–77
Article Google Scholar
Moody A, Bronevetsky G, Mohror K, Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11
Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, pp 1–12
Tiemeyer MP, Wong JSK (1998) A task migration algorithm for heterogeneous distributed computing systems. J Syst Softw 41(3):175–188
Article Google Scholar
Tsakalozos K, Verroios V, Roussopoulos M, Delis A (2017) Live VM migration under time-constraints in share-nothing IaaS-clouds. IEEE Trans Parallel Distrib Syst 28(8):2285–2298
Article Google Scholar
Jaswal T, Kaur K (2016) An enhanced hybrid approach for reducing downtime, cost and power consumption of live VM migration. In: Proceedings International Conference on Advances in Information Communication Technology & Computing, vol 72
Bargi A, Sarbazi-Azad H (2011) Task migration in three-dimensional meshes. J Supercomput 56(3):328–352
Article Google Scholar
Kale LV, Krishnan S (1993) CHARM ++: a portable concurrent object oriented system based on C ++. In: Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, pp 91–108
Iserte S, Mayo R, Quintana-Ortí SE, Beltran V, Peña JA (2017) Efficient scalable computing through flexible applications and adaptive workloads. In: 46th International Conference on Parallel Processing Workshops (ICPPW), pp 180–189
Losada N, Martín MJ, González P (2017) J Supercomput 73:316–329
Article Google Scholar
Losada N, Cores I, Martín MJ et al (2017) J Supercomput 73:100
Article Google Scholar
http://research.cs.wisc.edu/htcondor/
Afsharpour S, Patologhy A, Fazeli M (2016) Performance/energy aware task migration algorithm for many-core chips. Comput Digit Tech 10:165–173
Article Google Scholar
Holmbacka S et al (2014) A task migration mechanism for distributed many-core operating systems. J Supercomput 68(3):1141–1162
Article Google Scholar
Sotomayor B, Montero RS, Llorente IM, Foster I (2009) Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput 13(5):14–22
Article Google Scholar
Sefraoui O, Aissaoui M, Eleuldj M (2012) OpenStack: toward an open-source solution for cloud computing. Int J Comput Appl 55(3):38–42
Google Scholar
Boucher R (2016) Cloning running services with docker and CRIU. In: Docker Conference
Villamayor J, Rexachs D, Luque E (2017) A fault tolerance manager with distributed coordinated checkpoints for automatic recovery. In: International Conference on High Performance Computing & Simulation (HPCS), Genoa, pp 452–459
Cabello U, Rodriguez J, Meneses A, Mendoza S, Decouchant D (2014) Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism. In: Proceedings 11th International Conference on Electrical Engineering, Computing Science and Automatic Control
Pascual JA, Navaridas J, Miguel-Alonso J (2009) Effects of topology-aware allocation policies on scheduling performance. Lect Notes Comput Sci 5798:138–144
Article Google Scholar
http://rdgroups.ciemat.es/web/sci-track/intranet
Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, Cambridge
Book Google Scholar
Garg R, Mohan A, Sullivan M, Cooperman G (2018) CRUM: checkpoint-restart support for CUDA’s unified memory. In: IEEE International Conference on Cluster Computing (CLUSTER), pp 302–313
Levy S, Topp B, Ferreira KB, Widener P, Arnold D, Hoefler T (2014) Using simulation to evaluate the performance of resilience strategies and process failures, SANDIA report, SAND2014-0688
Fernández-Anta A et al (2018) Competitive analysis of fundamental scheduling algorithms on a fault-prone machine and the impact of resource augmentation. Future Gener Comput Syst 78:245–256
Article Google Scholar

Download references

Acknowledgements

This work was partially funded by the Spanish State Research Agency projects CODEC2 (TIN2015-63562-R) and CODEC-OSE (RTI2018-096006-B-I00) with FEDER funds and the EU H2020 Project Enerxico (Grant Agreement No 828947) and supported by the RICAP Network (517RT0529) with CYTED funds.

Author information

Authors and Affiliations

Department of Technology, CIEMAT, Avda. Complutense 40, 28840, Madrid, Spain
Manuel Rodríguez-Pascual, José A. Moríñigo & Rafael Mayo-García
Department of Electrical and Computer Engineering, Northeastern University, 360 Huntington Avenue, Boston, MA, 02115, USA
Jiajun Cao & Gene Cooperman

Authors

Manuel Rodríguez-Pascual
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Cao
View author publications
You can also search for this author in PubMed Google Scholar
José A. Moríñigo
View author publications
You can also search for this author in PubMed Google Scholar
Gene Cooperman
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Mayo-García
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José A. Moríñigo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodríguez-Pascual, M., Cao, J., Moríñigo, J.A. et al. Job migration in HPC clusters by means of checkpoint/restart. J Supercomput 75, 6517–6541 (2019). https://doi.org/10.1007/s11227-019-02857-y

Download citation

Published: 23 April 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11227-019-02857-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Job migration in HPC clusters by means of checkpoint/restart

Abstract

Access this article

Similar content being viewed by others

Energy efficiency in cloud computing data centers: a survey on software technologies

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Job migration in HPC clusters by means of checkpoint/restart

Abstract

Access this article

Similar content being viewed by others

Energy efficiency in cloud computing data centers: a survey on software technologies

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation