In-memory application-level checkpoint-based migration for MPI programs
- 252 Downloads
Process migration provides many benefits for parallel environments including dynamic load balancing, data access locality or fault tolerance. This paper describes an in-memory application-level checkpoint-based migration solution for MPI codes that uses the Hierarchical Data Format 5 (HDF5) to write the checkpoint files. The main features of the proposed solution are transparency for the user, achieved through the use of CPPC (ComPiler for Portable Checkpointing); portability, as the application-level approach makes the solution adequate for any MPI implementation and operating system, and the use of the HDF5 file format enables the restart on different architectures; and high performance, by saving the checkpoint files to memory instead of to disk through the use of the HDF5 in-memory files. Experimental results prove that the in-memory approach reduces significantly the I/O cost of the migration process.
KeywordsCheckpoint Migration MPI HDF5
This research was supported by the Ministry of Science and Innovation of Spain (Project TIN2010-16735) and by the Galician Government (Project 10PXIB 105180PR and consolidation program of competitive reference groups GRC2013/055).
- 1.Cores I, Rodríguez G, González P, Martín MJ (2014) Failure avoidance in MPI applications using an application-level approach. Comput J 57(1):100–114Google Scholar
- 2.Cores I, Rodríguez G, González P, Martín MJ (2012) Reducing application-level checkpoint file sizes: towards scalable fault tolerance solutions. In: Proceedings of ISPA 12, Madrid, Spain, 10–13 July 2012. IEEE Computer Society Press, Los Alamitos, pp 371–378Google Scholar
- 3.Du C, Sun X-H (2006) MPI-Mitten: enabling migration technology in MPI. In: Proceedings of CCGRID 06, Singapore, 16–19 May 2006. IEEE Computer Society Press, Los Alamitos, pp 11–18Google Scholar
- 4.Li M, Vazhkudai SS, Butt AR, Meng F, Ma X, Kim Y, Engelmann C, Shipman GM (2010) Functional partitioning to optimize end-to-end performance on many-core architectures. In: Proceedings of conference on high performance computing networking, storage and analysis, SC 2010, New Orleans, LA, USA, 13–19 Nov 2010, pp 1–12Google Scholar
- 5.National Aeronautics and Space Administration. The NAS parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html. Accessed on July 2013
- 6.Ouyang X, Rajachandrasekar R, Besseron X, Panda DK (2011) High performance pipelined process migration with RDMA. In: Proceedings of CCGRID 11, Newport Beach, CA, USA, 23–26 May 2011. IEEE Computer Society Press, Los Alamitos, pp 314–323Google Scholar
- 7.Rodríguez G, Martín MJ, González P, Touri no J, Doallo R (2010) CPPC: A compiler-assisted tool for portable checkpointing of message-passing applications. Concurr Comput Pract Exp 22(6):749–766Google Scholar
- 8.Singh R, Graham P (2008) Performance driven partial checkpoint/migrate for LAM-MPI. In: Proceedings of HPCS 08, Québec City, Canada, 9–11 June 2008. IEEE Computer Society Press, Los Alamitos, pp 110–116Google Scholar
- 9.The HDF Group. HDF-5: hierarchical data format. http://www.hdfgroup.org/HDF5/. Accessed on July 2013
- 10.The HDF Group. HDF5 File image operations. http://www.hdfgroup.org/HDF5/doc/Advanced/FileImageOperations/HDF5FileImageOperations.pdf. Accessed on July 2013
- 11.Wang C, Mueller F, Engelmann C, Scott SL (2008) Proactive process-level live migration in HPC environments. In: Proceedings of the 21st IEEE/ACM international conference on high performance computing, networking, storage and analysis (SC) 2008, pp 1–12Google Scholar