Abstract
Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.
Chapter PDF
Similar content being viewed by others
References
Antypas, K., Shalf, J., Wasserman, H.: NERSC-6 Workload Analysis and Benchmark Selection Process. Technical Report LBNL-1014E, Lawrence Berkeley National Laboratory, Berkeley (2008)
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, University of California, Berkeley (2006)
Bailey, D., Harris, T., Saphir, W., van der Wilngaart, R., Woo, A., Yarrow, M.: The NAS Parallel Benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center (1995)
Bui, T.N., Jones, C.: Finding good approximate vertex and edge partitions is NP-hard. Information Processing Letters 42, 153–159 (1992)
Cappello, F.: Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications 23, 212–226 (2009)
Çatalyürek, Ü.V., Aykanat, C.: PaToH: A multilevel hypergraph partitioning tool, version 3.0. Technical Report BU-CE-9915, Bilkent Univ.(1999)
Cunningham, W.H.: Optimal attack and reinforcement of a network. J. ACM 32, 549–561 (1985)
Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Proceedings of the 2003 International Conference on Computational Science, ICCS 2003, pp. 3–12. Springer, Heidelberg (2003)
Elnozahy, E.N(M.), Alvisi, L., Wang, Y.-M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)
Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications. In: 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), Anchorage, USA (2011)
Ho, J.C.Y., Wang, C.-L., Lau, F.C.M.: Scalable Group-Based Checkpoint/Restart for Large-Scale Message-Passing Systems. In: 22nd IEEE International Parallel and Distributed Processing Symposium, Miami, USA (2008)
Kamil, S., Shalf, J., Oliker, L., Skinner, D.: Understanding ultra-scale application communication requirements. In: Proceedings of the 2005 IEEE International Symposium on Workload Characterization, pp. 178–187 (2005)
Karypis, G., Kumar, V.: MeTiS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 4.0. Univ. Minnesota, Minneapolis (1998)
Meneses, E., Mendes, C.L., Kale, L.V.: Team-based Message Logging: Preliminary Results. In: 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010) (May 2010)
Monnet, S., Morin, C., Badrinath, R.: Hybrid Checkpointing for Parallel Applications in Cluster Federations. In: Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2004), pp. 773–782. IEEE Computer Society, Washington, DC, USA (2004)
Pellegrini, F.: SCOTCH 5.1 User’s Guide. LaBRI (2008)
Riesen, R.: Communication Patterns. In: Workshop on Communication Architecture for Clusters CAC 2006, Rhodes Island, Greece, IEEE, Los Alamitos (2006)
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications 19(1), 49–66 (2005)
Vetter, J.S., Mueller, F.: Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures. Journal of Parallel and Distributed Computing 63, 853–865 (2003)
Yang, J.-M., Li, K.F., Li, W.-W., Zhang, D.-F.: Trading Off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery. Concurrency and Computation: Practice and Experience 21, 819–853 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ropars, T., Guermouche, A., Uçar, B., Meneses, E., Kalé, L.V., Cappello, F. (2011). On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23400-2_53
Download citation
DOI: https://doi.org/10.1007/978-3-642-23400-2_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23399-9
Online ISBN: 978-3-642-23400-2
eBook Packages: Computer ScienceComputer Science (R0)