Automatic Resource-Centric Process Migration for MPI
Conference paper
Abstract
Process migration refers to the ability to move a running process from one node and make it continue on another. The MPI standard prescribes support for process migration, but so far it was implemented mostly via checkpoint-restart. This paper presents an automatic and transparent process migration framework that can be used for MPI processes. This framework is advantageous when migration of individual processes for purposes such as load-balancing is more adequate than checkpointing the whole job. The paper describes this framework for process migration in clusters and multi-clusters, how it was tuned for Open MPI and the performance of migrated MPI processes.
Keywords
Cluster MPI process migration load-balancing checkpointPreview
Unable to display preview. Download preview PDF.
References
- 1.The Message Passing Interface (MPI) standard, http://www.mcs.anl.gov/mpi/
- 2.Berkeley Lab Checkpoint/Restart, http://ftg.lbl.gov/checkpoint
- 3.Barak, A., Shiloh, A.: The MOSIX cluster operating system for high-performance computing on Linux cluster, multi-clusters and clouds (2012), http://www.MOSIX.org/pub/MOSIX_wp.pdf
- 4.Amar, L., Barak, A., Drezner, Z., Okun, M.: Randomized gossip algorithms for maintaining a distributed bulletin board with guaranteed age properties. Concurrency and Computation: Practice and Experience 21, 1907–1927 (2009)CrossRefGoogle Scholar
- 5.Amir, Y., Awerbuch, B., Barak, A., Borgstrom, R.S., Keren, A.: An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Tran. Parallel and Dist. Systems 11(7), 760–768 (2000)CrossRefGoogle Scholar
- 6.Liu, J., Chandrasekaran, B., Yu, W., Wu, J., Buntinas, D., Kini, S.P., Wyckoff, P., Panda, D.K.: Micro-benchmark level performance comparison of high-speed cluster interconnects. Hot Interconnect 11 (2003), http://nowlab.cse.ohio-state.edu/publications/conf-papers/2003/liuj-hoti03.pdf
- 7.Bailey, D., Barszcz, E., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., Weeratunga, S.: The NAS parallel benchmarks. Tech. Report RNR-94-007, NASA (1994)Google Scholar
- 8.Iancu, C., Hofmeyr, S., Blagojevic, F., Zheng, Y.: Oversubscription on multicore processors. In: Proc. 2010 IEEE Int’l Sym. on Parallel and Dist. Processing (2010)Google Scholar
- 9.Corbal, J., Duran, A., Labarta, J.: Dynamic load balancing of MPI+OpenMP applications. In: Proc. Int’l Conf. on Parallel Processing (ICPP), pp. 195–202 (2004)Google Scholar
- 10.Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: Proc. 21st IEEE Int’l Parallel and Dist. Processing Sym. (IPDPS), pp. 1–8 (2007)Google Scholar
- 11.Liu, T., Ma, Z., Ou, Z.: A novel process migration method for MPI applications. In: Proc. 15th IEEE Pacific Rim Int’l Sym. on Dependable Computing, pp. 247–251 (2009)Google Scholar
- 12.Wang, C., Mueller, F., Engelmann, C., Scott, S.: Proactive process-level live migration in HPC environments. In: Proc. 2008 ACM/IEEE Conf. on Supercomputing, SC (2008)Google Scholar
- 13.Roman, E.: A Survey of Checkpoint/Restart implementations. Tech. Report LBNL-54942C, Berkeley Lab. (2002)Google Scholar
- 14.Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for MPI programs over Infiniband. In: Proc. 35th Int’l Conf. on Parallel Processing (ICPP), pp. 471–478 (2006)Google Scholar
- 15.Ouyang, X., Rajachandrasekar, R., Besseron, X., Panda, D.K.: RDMA-based job migration framework for MPI over Infiniband. In: Proc. 2010 IEEE Int’l Conf. on Cluster Computing (CLUSTER), pp. 116–125 (2010)Google Scholar
- 16.Ma, R.K.K., Wang, C., Lau, F.C.M.: M-JavaMPI: A Java-MPI binding with process migration support. In: Proc. 2nd IEEE Int’l Sym. on Cluster Computing and the Grid (CCGRID), p. 255 (2002)Google Scholar
- 17.Huang, C., Zheng, G., Kale, L., Kumar, S.: Performance evaluation of Adaptive MPI. In: Proc. 11th ACM SIGPLAN Sym. on Principles and Practice of Parallel Programming (PPoPP), pp. 12–21 (2006)Google Scholar
- 18.Hursey, J., Mattox, T.I., Lumsdaine, A.: Interconnect agnostic checkpoint/restart in Open MPI. In: Proc. 18th ACM Int’l Sym. on High Performance Dist. Computing (HPDC), pp. 49–58 (2009)Google Scholar
- 19.Keller, J., Majeed, M., Kessler, C.W.: Balancing CPU load for irregular MPI applications. In: Proc. Int’l Conf. on Parallel Computing, ParCo (2011)Google Scholar
Copyright information
© Springer-Verlag Berlin Heidelberg 2012