Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 322–332Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol

Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol

  • Xavier Besseron30 &
  • Thierry Gautier31 
  • Conference paper
  • 1077 Accesses

  • 5 Citations

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7156)

Abstract

Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.

Keywords

  • parallel computing
  • checkpoint/rollback
  • over-decomposition
  • global restart
  • partial restart

Download conference paper PDF

References

  1. Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using smpss. Concurr. Comput. : Pract. Exper. (2009)

    Google Scholar 

  2. Besseron, X., Gautier, T.: Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications. In: MCO 2008 (2008)

    Google Scholar 

  3. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. Parallel and Distributed Computing (1996)

    Google Scholar 

  4. Bongo, L.A., Vinter, B., Anshus, O.J., Larsen, T., Bjorndalen, J.M.: Using overdecomposition to overlap communication latencies with computation and take advantage of smt processors. In: ICPP Workshops (2006)

    Google Scholar 

  5. Bouteiller, A., Hérault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI. High Performance Computing Applications (2006)

    Google Scholar 

  6. Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems. In: IPDPS (2004)

    Google Scholar 

  7. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems (1985)

    Google Scholar 

  8. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (2002)

    Google Scholar 

  9. Galilée, F., Roch, J.L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: PACT 1998 (1998)

    Google Scholar 

  10. Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for mpi programs over infiniband. In: ICPP 2006 (2006)

    Google Scholar 

  11. Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)

    Google Scholar 

  12. Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In: IPDPS (2011)

    Google Scholar 

  13. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS (2007)

    Google Scholar 

  14. Jafar, S., Krings, A.W., Gautier, T.: Flexible rollback recovery in dynamic heterogeneous grid computing. IEEE Transactions on Dependable and Secure Computing (2008)

    Google Scholar 

  15. Jafar, S., Pigeon, L., Gautier, T., Roch, J.L.: Self-adaptation of parallel applications in heterogeneous and dynamic architectures. In: ICTTA 2006 (2006)

    Google Scholar 

  16. Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI Runtimes: Experience with MVAPICH. In: PGAS 2010 (2010)

    Google Scholar 

  17. Kale, L.V., Mendes, C., Meneses, E.: Adaptive runtime support for fault tolerance. Talk at Los Alamos Computer Science Symposium 2009 (2009)

    Google Scholar 

  18. Kale, L.V., Zheng, G.: Charm++ and AMPI: Adaptive runtime strategies via migratable objects. In: Advanced Computational Infrastructures for Parallel and Distributed Applications. Wiley-Interscience (2009)

    Google Scholar 

  19. Naik, V.K., Setia, S.K., Squillante, M.S.: Processor allocation in multiprogrammed distributed-memory parallel computer systems. Parallel Distributed Computing (1997)

    Google Scholar 

  20. Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: PDP 2009 (2009)

    Google Scholar 

  21. Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC 2009 (2009)

    Google Scholar 

  22. Tamir, Y., Séquin, C.H.: Error recovery in multicomputers using global checkpoints. In: ICPP 1984 (1984)

    Google Scholar 

  23. Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Cluster Computing (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Dept. of Computer Science and Engineering, The Ohio State University, USA

    Xavier Besseron

  2. MOAIS Project, INRIA, France

    Thierry Gautier

Authors
  1. Xavier Besseron
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Thierry Gautier
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, US

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Besseron, X., Gautier, T. (2012). Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_36

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29740-3_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29739-7

  • Online ISBN: 978-3-642-29740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature