On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Ropars, Thomas; Guermouche, Amina; Uçar, Bora; Meneses, Esteban; Kalé, Laxmikant V.; Cappello, Franck

doi:10.1007/978-3-642-23400-2_53

Thomas Ropars¹⁸,
Amina Guermouche^18,19,
Bora Uçar²⁰,
Esteban Meneses²¹,
Laxmikant V. Kalé²¹ &
…
Franck Cappello^18,21

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6852))

Included in the following conference series:

European Conference on Parallel Processing

1643 Accesses
14 Citations

Abstract

Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.

Download to read the full chapter text

Chapter PDF

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources

Exploring Application-Level Message-Logging in Scalable HPC Programs

Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications

References

Antypas, K., Shalf, J., Wasserman, H.: NERSC-6 Workload Analysis and Benchmark Selection Process. Technical Report LBNL-1014E, Lawrence Berkeley National Laboratory, Berkeley (2008)
Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, University of California, Berkeley (2006)
Google Scholar
Bailey, D., Harris, T., Saphir, W., van der Wilngaart, R., Woo, A., Yarrow, M.: The NAS Parallel Benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center (1995)
Google Scholar
Bui, T.N., Jones, C.: Finding good approximate vertex and edge partitions is NP-hard. Information Processing Letters 42, 153–159 (1992)
Article MathSciNet MATH Google Scholar
Cappello, F.: Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications 23, 212–226 (2009)
Article Google Scholar
Çatalyürek, Ü.V., Aykanat, C.: PaToH: A multilevel hypergraph partitioning tool, version 3.0. Technical Report BU-CE-9915, Bilkent Univ.(1999)
Google Scholar
Cunningham, W.H.: Optimal attack and reinforcement of a network. J. ACM 32, 549–561 (1985)
Article MathSciNet MATH Google Scholar
Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of the 2003 International Conference on Computational Science, ICCS 2003, pp. 3–12. Springer, Heidelberg (2003)
Google Scholar
Elnozahy, E.N(M.), Alvisi, L., Wang, Y.-M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)
MATH Google Scholar
Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications. In: 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), Anchorage, USA (2011)
Google Scholar
Ho, J.C.Y., Wang, C.-L., Lau, F.C.M.: Scalable Group-Based Checkpoint/Restart for Large-Scale Message-Passing Systems. In: 22nd IEEE International Parallel and Distributed Processing Symposium, Miami, USA (2008)
Google Scholar
Kamil, S., Shalf, J., Oliker, L., Skinner, D.: Understanding ultra-scale application communication requirements. In: Proceedings of the 2005 IEEE International Symposium on Workload Characterization, pp. 178–187 (2005)
Google Scholar
Karypis, G., Kumar, V.: MeTiS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 4.0. Univ. Minnesota, Minneapolis (1998)
Google Scholar
Meneses, E., Mendes, C.L., Kale, L.V.: Team-based Message Logging: Preliminary Results. In: 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010) (May 2010)
Google Scholar
Monnet, S., Morin, C., Badrinath, R.: Hybrid Checkpointing for Parallel Applications in Cluster Federations. In: Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2004), pp. 773–782. IEEE Computer Society, Washington, DC, USA (2004)
Chapter Google Scholar
Pellegrini, F.: SCOTCH 5.1 User’s Guide. LaBRI (2008)
Google Scholar
Riesen, R.: Communication Patterns. In: Workshop on Communication Architecture for Clusters CAC 2006, Rhodes Island, Greece, IEEE, Los Alamitos (2006)
Google Scholar
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications 19(1), 49–66 (2005)
Article Google Scholar
Vetter, J.S., Mueller, F.: Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures. Journal of Parallel and Distributed Computing 63, 853–865 (2003)
Article MATH Google Scholar
Yang, J.-M., Li, K.F., Li, W.-W., Zhang, D.-F.: Trading Off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery. Concurrency and Computation: Practice and Experience 21, 819–853 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

INRIA Saclay-Île de France, France
Thomas Ropars, Amina Guermouche & Franck Cappello
Université Paris-Sud, France
Amina Guermouche
CNRS and ENS Lyon, France
Bora Uçar
University of Illinois at Urbana-Champaign, USA
Esteban Meneses, Laxmikant V. Kalé & Franck Cappello

Authors

Thomas Ropars
View author publications
You can also search for this author in PubMed Google Scholar
Amina Guermouche
View author publications
You can also search for this author in PubMed Google Scholar
Bora Uçar
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Meneses
View author publications
You can also search for this author in PubMed Google Scholar
Laxmikant V. Kalé
View author publications
You can also search for this author in PubMed Google Scholar
Franck Cappello
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Emmanuel Jeannot & Raymond Namyst &
Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Jean Roman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ropars, T., Guermouche, A., Uçar, B., Meneses, E., Kalé, L.V., Cappello, F. (2011). On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23400-2_53

Download citation

DOI: https://doi.org/10.1007/978-3-642-23400-2_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23399-9
Online ISBN: 978-3-642-23400-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Abstract

Chapter PDF

Similar content being viewed by others

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources

Exploring Application-Level Message-Logging in Scalable HPC Programs

Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Abstract

Chapter PDF

Similar content being viewed by others

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources

Exploring Application-Level Message-Logging in Scalable HPC Programs

Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation