Job migration in HPC clusters by means of checkpoint/restart

  • Manuel Rodríguez-Pascual
  • Jiajun Cao
  • José A. MoríñigoEmail author
  • Gene Cooperman
  • Rafael Mayo-García


Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.


Checkpoint–restart DMTCP Dynamic job migration Exascale clusters 



This work was partially funded by the Spanish State Research Agency projects CODEC2 (TIN2015-63562-R) and CODEC-OSE (RTI2018-096006-B-I00) with FEDER funds and the EU H2020 Project Enerxico (Grant Agreement No 828947) and supported by the RICAP Network (517RT0529) with CYTED funds.


  1. 1.
    Flich J et al. (2017) MANGO: exploring manycore architectures for next-generation HPC systems. In: Kubatova H, Novotny M, Skavhaug A (eds) Euromicro Conferences on Digital System Design (DSD), pp 478–485Google Scholar
  2. 2.
    Wyngaard J, Inggs M, Collins J, Farrimond B (2013) Towards a many-core architecture for HPC. In: Cardoso JMP, Morrow K, Diniz PC (eds) 23rd International Conference on Field Programmable Logic and Applications (FPL2013)Google Scholar
  3. 3.
    European technology platform for high performance computing (2017)., Strategic Research Agenda
  4. 4.
    Bailey C, Parry J (2017) Co-design, modelling and simulation challenges: from components to systems. In: Proceedings 23rd International Workshop on Thermal Investigations of ICs and Systems (THERMINIC), pp 1–4Google Scholar
  5. 5.
    Hill MD, Marty MR (2017) Retrospective on Amdahl’s law in the multicore Era. Computer 50(6):12–14CrossRefGoogle Scholar
  6. 6.
    Martineau M, McIntosh-Smith S (2017) The arch project: physics mini-apps for algorithmic exploration and evaluating programming environments on HPC architectures. In: Proceedings IEEE International Conference on Cluster Computing (CLUSTER2017), pp 850–857Google Scholar
  7. 7.
    Aupy G et al (2016) Co-scheduling algorithms for high-throughput workload execution. J Sched 19(6):627–640MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Rajan M, Doerfler D (2010) HPC application performance and scaling: understanding trends and future challenges with application benchmarks on past, present and future tri-lab computing systems. In Psihoyios G, Tsitouras C (eds) Numerical Analysis and Applied Mathematics, vol I–III (AIP Conference Proceedings 1281), pp 1777–1780Google Scholar
  9. 9.
    Yoo AB, Jette MA, Grondona M (2003) SLURM: simple linux utility for resource management. In: Feitelson D, Rudolph L, Schwiegelshohn U (eds) Job scheduling strategies for parallel processing (JSSPP 2003), vol 2862. Lecture Notes in Computer Science. Springer, BerlinGoogle Scholar
  10. 10.
    Ansel J, Arya K, Cooperman G (2009) DMTCP: transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel & Distributed Processing, Rome, pp 1–12Google Scholar
  11. 11.
    Tao J, Kolodziej J, Ranjan R, Jayaraman PP, Buyya R (2015) A note on new trends in data-aware scheduling and resource provisioning in modern HPC system. Future Gener Comput Syst 51:45–46CrossRefGoogle Scholar
  12. 12.
    Ge X, Jin H, Leung VCM (2018) Joint opportunistic user scheduling and power allocation: throughput optimisation and fair resource sharing. IET Commun 12(5):634–640CrossRefGoogle Scholar
  13. 13.
    Padua D (2011) Encyclopedia of parallel computing. Springer, New YorkCrossRefzbMATHGoogle Scholar
  14. 14.
    Hargrove PH, Duell JC (2006) Berkeley lab checkpoint/restart (BLCR) for linux clusters. J Phys Conf Ser 46:494–499CrossRefGoogle Scholar
  15. 15.
    Sankaran S et al (2005) The Lam/Mpi checkpoint/restart framework: system-initiated checkpointing. Int J High Perform Comput Appl 19(4):479–493CrossRefGoogle Scholar
  16. 16.
    Cao J, Arya K, Garg R, Matott S, Panda DK, Subramoni H, Vienne J, Cooperman G (2016) System-level scalable checkpoint-restart for petascale computing. In: Proceedings IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pp 932–941Google Scholar
  17. 17.
  18. 18.
    Li W, Kanso A, Gherbi A (2015) Leveraging linux containers to achieve high availability for cloud services. In: Proceedings IEEE International Conference on Cloud Engineering, Tempe, AZ, pp 76–83Google Scholar
  19. 19.
    Vogt D, Giuffrida C, Bos H, Tanenbaum AS (2015) Lightweight memory checkpointing. In: Proceedings 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Rio de Janeiro, pp 474–484Google Scholar
  20. 20.
    Takizawa H, Amrizal MA, Komatsu K, Egawa R (2017) An application-level incremental checkpointing mechanism with automatic parameter tuning. In: 5th International Symposium on Computing and Networking (CANDAR), Aomori, pp 389–394Google Scholar
  21. 21.
    Ferreira KB, Riesen R, Bridges P, Arnold D, Brightwell R (2014) Accelerating incremental checkpointing for extreme-scale computing. Future Gener Comput Syst 30:66–77CrossRefGoogle Scholar
  22. 22.
    Moody A, Bronevetsky G, Mohror K, Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–11Google Scholar
  23. 23.
    Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, pp 1–12Google Scholar
  24. 24.
    Tiemeyer MP, Wong JSK (1998) A task migration algorithm for heterogeneous distributed computing systems. J Syst Softw 41(3):175–188CrossRefGoogle Scholar
  25. 25.
    Tsakalozos K, Verroios V, Roussopoulos M, Delis A (2017) Live VM migration under time-constraints in share-nothing IaaS-clouds. IEEE Trans Parallel Distrib Syst 28(8):2285–2298CrossRefGoogle Scholar
  26. 26.
    Jaswal T, Kaur K (2016) An enhanced hybrid approach for reducing downtime, cost and power consumption of live VM migration. In: Proceedings International Conference on Advances in Information Communication Technology & Computing, vol 72Google Scholar
  27. 27.
    Bargi A, Sarbazi-Azad H (2011) Task migration in three-dimensional meshes. J Supercomput 56(3):328–352CrossRefGoogle Scholar
  28. 28.
    Kale LV, Krishnan S (1993) CHARM ++: a portable concurrent object oriented system based on C ++. In: Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, pp 91–108Google Scholar
  29. 29.
    Iserte S, Mayo R, Quintana-Ortí SE, Beltran V, Peña JA (2017) Efficient scalable computing through flexible applications and adaptive workloads. In: 46th International Conference on Parallel Processing Workshops (ICPPW), pp 180–189Google Scholar
  30. 30.
    Losada N, Martín MJ, González P (2017) J Supercomput 73:316–329CrossRefGoogle Scholar
  31. 31.
    Losada N, Cores I, Martín MJ et al (2017) J Supercomput 73:100CrossRefGoogle Scholar
  32. 32.
  33. 33.
    Afsharpour S, Patologhy A, Fazeli M (2016) Performance/energy aware task migration algorithm for many-core chips. Comput Digit Tech 10:165–173CrossRefGoogle Scholar
  34. 34.
    Holmbacka S et al (2014) A task migration mechanism for distributed many-core operating systems. J Supercomput 68(3):1141–1162CrossRefGoogle Scholar
  35. 35.
    Sotomayor B, Montero RS, Llorente IM, Foster I (2009) Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput 13(5):14–22CrossRefGoogle Scholar
  36. 36.
    Sefraoui O, Aissaoui M, Eleuldj M (2012) OpenStack: toward an open-source solution for cloud computing. Int J Comput Appl 55(3):38–42Google Scholar
  37. 37.
    Boucher R (2016) Cloning running services with docker and CRIU. In: Docker ConferenceGoogle Scholar
  38. 38.
    Villamayor J, Rexachs D, Luque E (2017) A fault tolerance manager with distributed coordinated checkpoints for automatic recovery. In: International Conference on High Performance Computing & Simulation (HPCS), Genoa, pp 452–459Google Scholar
  39. 39.
    Cabello U, Rodriguez J, Meneses A, Mendoza S, Decouchant D (2014) Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism. In: Proceedings 11th International Conference on Electrical Engineering, Computing Science and Automatic ControlGoogle Scholar
  40. 40.
    Pascual JA, Navaridas J, Miguel-Alonso J (2009) Effects of topology-aware allocation policies on scheduling performance. Lect Notes Comput Sci 5798:138–144CrossRefGoogle Scholar
  41. 41.
  42. 42.
    Feitelson DG (2015) Workload modeling for computer systems performance evaluation. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  43. 43.
    Garg R, Mohan A, Sullivan M, Cooperman G (2018) CRUM: checkpoint-restart support for CUDA’s unified memory. In: IEEE International Conference on Cluster Computing (CLUSTER), pp 302–313Google Scholar
  44. 44.
    Levy S, Topp B, Ferreira KB, Widener P, Arnold D, Hoefler T (2014) Using simulation to evaluate the performance of resilience strategies and process failures, SANDIA report, SAND2014-0688Google Scholar
  45. 45.
    Fernández-Anta A et al (2018) Competitive analysis of fundamental scheduling algorithms on a fault-prone machine and the impact of resource augmentation. Future Gener Comput Syst 78:245–256CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of TechnologyCIEMATMadridSpain
  2. 2.Department of Electrical and Computer EngineeringNortheastern UniversityBostonUSA

Personalised recommendations