Advertisement

Automatic Resource-Centric Process Migration for MPI

  • Amnon Barak
  • Alexander Margolin
  • Amnon Shiloh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7490)

Abstract

Process migration refers to the ability to move a running process from one node and make it continue on another.  The MPI standard prescribes support for process migration, but so far it was implemented mostly via checkpoint-restart. This paper presents an automatic and transparent process migration framework that can be used for MPI processes. This framework is advantageous when migration of individual processes for purposes such as load-balancing is more adequate than checkpointing the whole job.  The paper describes this framework for process migration in clusters and multi-clusters, how it was tuned for Open MPI and the performance of migrated MPI processes.

Keywords

Cluster MPI process migration load-balancing checkpoint 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    The Message Passing Interface (MPI) standard, http://www.mcs.anl.gov/mpi/
  2. 2.
    Berkeley Lab Checkpoint/Restart, http://ftg.lbl.gov/checkpoint
  3. 3.
    Barak, A., Shiloh, A.: The MOSIX cluster operating system for high-performance computing on Linux cluster, multi-clusters and clouds (2012), http://www.MOSIX.org/pub/MOSIX_wp.pdf
  4. 4.
    Amar, L., Barak, A., Drezner, Z., Okun, M.: Randomized gossip algorithms for maintaining a distributed bulletin board with guaranteed age properties. Concurrency and Computation: Practice and Experience 21, 1907–1927 (2009)CrossRefGoogle Scholar
  5. 5.
    Amir, Y., Awerbuch, B., Barak, A., Borgstrom, R.S., Keren, A.: An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Tran. Parallel and Dist. Systems 11(7), 760–768 (2000)CrossRefGoogle Scholar
  6. 6.
    Liu, J., Chandrasekaran, B., Yu, W., Wu, J., Buntinas, D., Kini, S.P., Wyckoff, P., Panda, D.K.: Micro-benchmark level performance comparison of high-speed cluster interconnects. Hot Interconnect 11 (2003), http://nowlab.cse.ohio-state.edu/publications/conf-papers/2003/liuj-hoti03.pdf
  7. 7.
    Bailey, D., Barszcz, E., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., Weeratunga, S.: The NAS parallel benchmarks. Tech. Report RNR-94-007, NASA (1994)Google Scholar
  8. 8.
    Iancu, C., Hofmeyr, S., Blagojevic, F., Zheng, Y.: Oversubscription on multicore processors. In: Proc. 2010 IEEE Int’l Sym. on Parallel and Dist. Processing (2010)Google Scholar
  9. 9.
    Corbal, J., Duran, A., Labarta, J.: Dynamic load balancing of MPI+OpenMP applications. In: Proc. Int’l Conf. on Parallel Processing (ICPP), pp. 195–202 (2004)Google Scholar
  10. 10.
    Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: Proc. 21st IEEE Int’l Parallel and Dist. Processing Sym. (IPDPS), pp. 1–8 (2007)Google Scholar
  11. 11.
    Liu, T., Ma, Z., Ou, Z.: A novel process migration method for MPI applications. In: Proc. 15th IEEE Pacific Rim Int’l Sym. on Dependable Computing, pp. 247–251 (2009)Google Scholar
  12. 12.
    Wang, C., Mueller, F., Engelmann, C., Scott, S.: Proactive process-level live migration in HPC environments. In: Proc. 2008 ACM/IEEE Conf. on Supercomputing, SC (2008)Google Scholar
  13. 13.
    Roman, E.: A Survey of Checkpoint/Restart implementations. Tech. Report LBNL-54942C, Berkeley Lab. (2002)Google Scholar
  14. 14.
    Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for MPI programs over Infiniband. In: Proc. 35th Int’l Conf. on Parallel Processing (ICPP), pp. 471–478 (2006)Google Scholar
  15. 15.
    Ouyang, X., Rajachandrasekar, R., Besseron, X., Panda, D.K.: RDMA-based job migration framework for MPI over Infiniband. In: Proc. 2010 IEEE Int’l Conf. on Cluster Computing (CLUSTER), pp. 116–125 (2010)Google Scholar
  16. 16.
    Ma, R.K.K., Wang, C., Lau, F.C.M.: M-JavaMPI: A Java-MPI binding with process migration support. In: Proc. 2nd IEEE Int’l Sym. on Cluster Computing and the Grid (CCGRID), p. 255 (2002)Google Scholar
  17. 17.
    Huang, C., Zheng, G., Kale, L., Kumar, S.: Performance evaluation of Adaptive MPI. In: Proc. 11th ACM SIGPLAN Sym. on Principles and Practice of Parallel Programming (PPoPP), pp. 12–21 (2006)Google Scholar
  18. 18.
    Hursey, J., Mattox, T.I., Lumsdaine, A.: Interconnect agnostic checkpoint/restart in Open MPI. In: Proc. 18th ACM Int’l Sym. on High Performance Dist. Computing (HPDC), pp. 49–58 (2009)Google Scholar
  19. 19.
    Keller, J., Majeed, M., Kessler, C.W.: Balancing CPU load for irregular MPI applications. In: Proc. Int’l Conf. on Parallel Computing, ParCo (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Amnon Barak
    • 1
  • Alexander Margolin
    • 1
  • Amnon Shiloh
    • 1
  1. 1.Department of Computer ScienceThe Hebrew University of JerusalemIsrael

Personalised recommendations