Skip to main content
Log in

Job migration in HPC clusters by means of checkpoint/restart

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Flich J et al. (2017) MANGO: exploring manycore architectures for next-generation HPC systems. In: Kubatova H, Novotny M, Skavhaug A (eds) Euromicro Conferences on Digital System Design (DSD), pp 478–485

  2. Wyngaard J, Inggs M, Collins J, Farrimond B (2013) Towards a many-core architecture for HPC. In: Cardoso JMP, Morrow K, Diniz PC (eds) 23rd International Conference on Field Programmable Logic and Applications (FPL2013)

  3. European technology platform for high performance computing (2017). www.etp4hpc.eu, Strategic Research Agenda

  4. Bailey C, Parry J (2017) Co-design, modelling and simulation challenges: from components to systems. In: Proceedings 23rd International Workshop on Thermal Investigations of ICs and Systems (THERMINIC), pp 1–4

  5. Hill MD, Marty MR (2017) Retrospective on Amdahl’s law in the multicore Era. Computer 50(6):12–14

    Article  Google Scholar 

  6. Martineau M, McIntosh-Smith S (2017) The arch project: physics mini-apps for algorithmic exploration and evaluating programming environments on HPC architectures. In: Proceedings IEEE International Conference on Cluster Computing (CLUSTER2017), pp 850–857

  7. Aupy G et al (2016) Co-scheduling algorithms for high-throughput workload execution. J Sched 19(6):627–640

    Article  MathSciNet  Google Scholar 

  8. Rajan M, Doerfler D (2010) HPC application performance and scaling: understanding trends and future challenges with application benchmarks on past, present and future tri-lab computing systems. In Psihoyios G, Tsitouras C (eds) Numerical Analysis and Applied Mathematics, vol I–III (AIP Conference Proceedings 1281), pp 1777–1780

  9. Yoo AB, Jette MA, Grondona M (2003) SLURM: simple linux utility for resource management. In: Feitelson D, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing (JSSPP 2003), vol 2862. Lecture Notes in Computer Science. Springer, Berlin

    Google Scholar 

  10. Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel & Distributed Processing, Rome, pp 1–12

  11. Tao J, Kolodziej J, Ranjan R, Jayaraman PP, Buyya R (2015) A note on new trends in data-aware scheduling and resource provisioning in modern HPC system. Future Gener Comput Syst 51:45–46

    Article  Google Scholar 

  12. Ge X, Jin H, Leung VCM (2018) Joint opportunistic user scheduling and power allocation: throughput optimisation and fair resource sharing. IET Commun 12(5):634–640

    Article  Google Scholar 

  13. Padua D (2011) Encyclopedia of parallel computing. Springer, New York

    Book  Google Scholar 

  14. Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. J Phys Conf Ser 46:494–499

    Article  Google Scholar 

  15. Sankaran S et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493

    Article  Google Scholar 

  16. Cao J, Arya K, Garg R, Matott S, Panda DK, Subramoni H, Vienne J, Cooperman G (2016) System-level scalable checkpoint-restart for petascale computing. In: Proceedings IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pp 932–941

  17. https://criu.org/Main_Page

  18. Li W, Kanso A, Gherbi A (2015) Leveraging linux containers to achieve high availability for cloud services. In: Proceedings IEEE International Conference on Cloud Engineering, Tempe, AZ, pp 76–83

  19. Vogt D, Giuffrida C, Bos H, Tanenbaum AS (2015) Lightweight memory checkpointing. In: Proceedings 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Rio de Janeiro, pp 474–484

  20. Takizawa H, Amrizal MA, Komatsu K, Egawa R (2017) An application-level incremental checkpointing mechanism with automatic parameter tuning. In: 5th International Symposium on Computing and Networking (CANDAR), Aomori, pp 389–394

  21. Ferreira KB, Riesen R, Bridges P, Arnold D, Brightwell R (2014) Accelerating incremental checkpointing for extreme-scale computing. Future Gener Comput Syst 30:66–77

    Article  Google Scholar 

  22. Moody A, Bronevetsky G, Mohror K, Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11

  23. Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, pp 1–12

  24. Tiemeyer MP, Wong JSK (1998) A task migration algorithm for heterogeneous distributed computing systems. J Syst Softw 41(3):175–188

    Article  Google Scholar 

  25. Tsakalozos K, Verroios V, Roussopoulos M, Delis A (2017) Live VM migration under time-constraints in share-nothing IaaS-clouds. IEEE Trans Parallel Distrib Syst 28(8):2285–2298

    Article  Google Scholar 

  26. Jaswal T, Kaur K (2016) An enhanced hybrid approach for reducing downtime, cost and power consumption of live VM migration. In: Proceedings International Conference on Advances in Information Communication Technology & Computing, vol 72

  27. Bargi A, Sarbazi-Azad H (2011) Task migration in three-dimensional meshes. J Supercomput 56(3):328–352

    Article  Google Scholar 

  28. Kale LV, Krishnan S (1993) CHARM ++: a portable concurrent object oriented system based on C ++. In: Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, pp 91–108

  29. Iserte S, Mayo R, Quintana-Ortí SE, Beltran V, Peña JA (2017) Efficient scalable computing through flexible applications and adaptive workloads. In: 46th International Conference on Parallel Processing Workshops (ICPPW), pp 180–189

  30. Losada N, Martín MJ, González P (2017) J Supercomput 73:316–329

    Article  Google Scholar 

  31. Losada N, Cores I, Martín MJ et al (2017) J Supercomput 73:100

    Article  Google Scholar 

  32. http://research.cs.wisc.edu/htcondor/

  33. Afsharpour S, Patologhy A, Fazeli M (2016) Performance/energy aware task migration algorithm for many-core chips. Comput Digit Tech 10:165–173

    Article  Google Scholar 

  34. Holmbacka S et al (2014) A task migration mechanism for distributed many-core operating systems. J Supercomput 68(3):1141–1162

    Article  Google Scholar 

  35. Sotomayor B, Montero RS, Llorente IM, Foster I (2009) Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput 13(5):14–22

    Article  Google Scholar 

  36. Sefraoui O, Aissaoui M, Eleuldj M (2012) OpenStack: toward an open-source solution for cloud computing. Int J Comput Appl 55(3):38–42

    Google Scholar 

  37. Boucher R (2016) Cloning running services with docker and CRIU. In: Docker Conference

  38. Villamayor J, Rexachs D, Luque E (2017) A fault tolerance manager with distributed coordinated checkpoints for automatic recovery. In: International Conference on High Performance Computing & Simulation (HPCS), Genoa, pp 452–459

  39. Cabello U, Rodriguez J, Meneses A, Mendoza S, Decouchant D (2014) Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism. In: Proceedings 11th International Conference on Electrical Engineering, Computing Science and Automatic Control

  40. Pascual JA, Navaridas J, Miguel-Alonso J (2009) Effects of topology-aware allocation policies on scheduling performance. Lect Notes Comput Sci 5798:138–144

    Article  Google Scholar 

  41. http://rdgroups.ciemat.es/web/sci-track/intranet

  42. Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  43. Garg R, Mohan A, Sullivan M, Cooperman G (2018) CRUM: checkpoint-restart support for CUDA’s unified memory. In: IEEE International Conference on Cluster Computing (CLUSTER), pp 302–313

  44. Levy S, Topp B, Ferreira KB, Widener P, Arnold D, Hoefler T (2014) Using simulation to evaluate the performance of resilience strategies and process failures, SANDIA report, SAND2014-0688

  45. Fernández-Anta A et al (2018) Competitive analysis of fundamental scheduling algorithms on a fault-prone machine and the impact of resource augmentation. Future Gener Comput Syst 78:245–256

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially funded by the Spanish State Research Agency projects CODEC2 (TIN2015-63562-R) and CODEC-OSE (RTI2018-096006-B-I00) with FEDER funds and the EU H2020 Project Enerxico (Grant Agreement No 828947) and supported by the RICAP Network (517RT0529) with CYTED funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José A. Moríñigo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodríguez-Pascual, M., Cao, J., Moríñigo, J.A. et al. Job migration in HPC clusters by means of checkpoint/restart. J Supercomput 75, 6517–6541 (2019). https://doi.org/10.1007/s11227-019-02857-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02857-y

Keywords

Navigation