On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

  • Thomas Ropars
  • Amina Guermouche
  • Bora Uçar
  • Esteban Meneses
  • Laxmikant V. Kalé
  • Franck Cappello
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6852)


Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Antypas, K., Shalf, J., Wasserman, H.: NERSC-6 Workload Analysis and Benchmark Selection Process. Technical Report LBNL-1014E, Lawrence Berkeley National Laboratory, Berkeley (2008)Google Scholar
  2. 2.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, University of California, Berkeley (2006)Google Scholar
  3. 3.
    Bailey, D., Harris, T., Saphir, W., van der Wilngaart, R., Woo, A., Yarrow, M.: The NAS Parallel Benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center (1995)Google Scholar
  4. 4.
    Bui, T.N., Jones, C.: Finding good approximate vertex and edge partitions is NP-hard. Information Processing Letters 42, 153–159 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Cappello, F.: Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications 23, 212–226 (2009)CrossRefGoogle Scholar
  6. 6.
    Çatalyürek, Ü.V., Aykanat, C.: PaToH: A multilevel hypergraph partitioning tool, version 3.0. Technical Report BU-CE-9915, Bilkent Univ.(1999)Google Scholar
  7. 7.
    Cunningham, W.H.: Optimal attack and reinforcement of a network. J. ACM 32, 549–561 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of the 2003 International Conference on Computational Science, ICCS 2003, pp. 3–12. Springer, Heidelberg (2003)Google Scholar
  9. 9.
    Elnozahy, E.N(M.), Alvisi, L., Wang, Y.-M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  10. 10.
    Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)zbMATHGoogle Scholar
  11. 11.
    Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications. In: 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), Anchorage, USA (2011)Google Scholar
  12. 12.
    Ho, J.C.Y., Wang, C.-L., Lau, F.C.M.: Scalable Group-Based Checkpoint/Restart for Large-Scale Message-Passing Systems. In: 22nd IEEE International Parallel and Distributed Processing Symposium, Miami, USA (2008)Google Scholar
  13. 13.
    Kamil, S., Shalf, J., Oliker, L., Skinner, D.: Understanding ultra-scale application communication requirements. In: Proceedings of the 2005 IEEE International Symposium on Workload Characterization, pp. 178–187 (2005)Google Scholar
  14. 14.
    Karypis, G., Kumar, V.: MeTiS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 4.0. Univ. Minnesota, Minneapolis (1998)Google Scholar
  15. 15.
    Meneses, E., Mendes, C.L., Kale, L.V.: Team-based Message Logging: Preliminary Results. In: 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010) (May 2010)Google Scholar
  16. 16.
    Monnet, S., Morin, C., Badrinath, R.: Hybrid Checkpointing for Parallel Applications in Cluster Federations. In: Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2004), pp. 773–782. IEEE Computer Society, Washington, DC, USA (2004)CrossRefGoogle Scholar
  17. 17.
    Pellegrini, F.: SCOTCH 5.1 User’s Guide. LaBRI (2008)Google Scholar
  18. 18.
    Riesen, R.: Communication Patterns. In: Workshop on Communication Architecture for Clusters CAC 2006, Rhodes Island, Greece, IEEE, Los Alamitos (2006)Google Scholar
  19. 19.
    Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications 19(1), 49–66 (2005)CrossRefGoogle Scholar
  20. 20.
    Vetter, J.S., Mueller, F.: Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures. Journal of Parallel and Distributed Computing 63, 853–865 (2003)CrossRefzbMATHGoogle Scholar
  21. 21.
    Yang, J.-M., Li, K.F., Li, W.-W., Zhang, D.-F.: Trading Off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery. Concurrency and Computation: Practice and Experience 21, 819–853 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Thomas Ropars
    • 1
  • Amina Guermouche
    • 1
    • 2
  • Bora Uçar
    • 3
  • Esteban Meneses
    • 4
  • Laxmikant V. Kalé
    • 4
  • Franck Cappello
    • 1
    • 4
  1. 1.INRIA Saclay-Île de FranceFrance
  2. 2.Université Paris-SudFrance
  3. 3.CNRS and ENS LyonFrance
  4. 4.University of Illinois at Urbana-ChampaignUSA

Personalised recommendations