HADAB: Enabling Fault Tolerance in Parallel Applications Running in Distributed Environments

Boccia, Vania; Carracciuolo, Luisa; Laccetti, Giuliano; Lapegna, Marco; Mele, Valeria

doi:10.1007/978-3-642-31464-3_71

Vania Boccia¹⁹,
Luisa Carracciuolo²⁰,
Giuliano Laccetti¹⁹,
Marco Lapegna¹⁹ &
…
Valeria Mele¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7203))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

2044 Accesses
16 Citations

Abstract

The development of scientific software, reliable and efficient, in distributed computing environments, requires the identification and the analysis of issues related to the design and the deployment of algorithms for high-performance computing architectures and their integration in distributed contexts. In these environments, indeed, resources efficiency and availability can change unexpectedly because of overloading or failure i.e. of both computing nodes and interconnection network. The scenario described above, requires the design of mechanisms enabling the software to “survive” to such unexpected events by ensuring, at the same time, an effective use of the computing resources. Although many researchers are working on these problems for years, fault tolerance, for some classes of applications is an open matter still today. Here we focus on the design and the deployment of a checkpointing/migration system to enable fault tolerance in parallel applications running in distributed environments. In particular we describe details about HADAB, a new hybrid checkpointing strategy, and its deployment in a meaningful case study: the PETSc Conjugate Gradient algortithm implementation. The related testing phase has been performed on the University of Naples distributed infrastructure (S.Co.P.E. infrastructure).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Balay, S., et al.: PETSc Users Manual. ANL-95/11 - Revision 3.1, Argonne National Laboratory (2010)
Google Scholar
Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Building Fault Survivable MPI Programs with FT MPI Using Diskless Checkpointing. In: Proceedings for ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 213–223 (2005)
Google Scholar
Dongarra, J., Bosilca, B., Delmas, R., Langou, J.: Algorithmic Based Fault Tolerance Applied to High Performance Computing. Journal of Parallel and Distributed Computing 69, 410–416 (2009)
Article Google Scholar
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B.W., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004)
Chapter Google Scholar
Geist, A., Engelmann, C.: Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors (2002)
Google Scholar
Engelmann, C., Geist, A.: Super-Scalable Algorithms for Computing on 100,000 Processors. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 313–321. Springer, Heidelberg (2005)
Chapter Google Scholar
Hung, E., Student, M.P.: Fault Tolerance and Checkpointing Schemes for Clusters of Workstations (2008)
Google Scholar
Kofahi, N.A., Al-Bokhitan, S., Journal, A.A.: On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis. Information Technology Journal 4, 367–376 (2005)
Article Google Scholar
Lee, K., Sha, L.: Process resurrection: A fast recovery mechanism for real-time embedded systems. In: Real-Time and Embedded Technology and Applications Symposium, pp. 292–301. IEEE (2005)
Google Scholar
Murli, A., Boccia, V., Carracciuolo, L., D Amore, L., Lapegna, M.: Monitoring and Migration of a PETSc-based Parallel Application for Medical Imaging in a Grid computing PSE. In: Proceedings of IFIP 2.5 WoCo9, vol. 239, pp. 421–432. Springer (2007)
Google Scholar
Plank, J.S., Li, K., Puening, M.A.: Diskless Checkpointing. Technical Report CS-97-380, University of Tennessee (December 1997)
Google Scholar
Silva, L.M., Silva, G.J.: The Performance of Coordinated and Independent Checkpointing. In: Proceedings of the 13th International Symposium on Parallel Processing, pp. 280–284. IEEE Computer Society, Washington, DC (1999)
Google Scholar
Simon, H.D., Heroux, M.A., Raghavan, P.: Faul Tolerance in Large Scale Scientific Computing, ch. 11, pp. 203–220. SIAM Press (2006)
Google Scholar
Song, H., Leangsuksun, C., Nassar, R.: Availability Modeling and Analysis on High Performance Cluster Computing Systems. In: First International Conference on Availability, Reliability and Security, pp. 305–313 (2006)
Google Scholar
Vadhiyar, S.S., Dongarra, J.: SRS - A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems. In: Parallel Processing Letters, pp. 291–312 (2002)
Google Scholar
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In: Parallel and Distributed Processing Symposium (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Applied Mathematics, University of Naples Federico II, Naples, 80126, Complesso Universitario Monte S. Angelo, Via Cintia, Italy
Vania Boccia, Giuliano Laccetti, Marco Lapegna & Valeria Mele
Italian National Research Council, Italy
Luisa Carracciuolo

Authors

Vania Boccia
View author publications
You can also search for this author in PubMed Google Scholar
Luisa Carracciuolo
View author publications
You can also search for this author in PubMed Google Scholar
Giuliano Laccetti
View author publications
You can also search for this author in PubMed Google Scholar
Marco Lapegna
View author publications
You can also search for this author in PubMed Google Scholar
Valeria Mele
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer and Information Science, Czestochowa University of Technology, Dabrowskiego 69, 42-201, Czestochowa, Poland
Roman Wyrzykowski & Konrad Karczewski &
Electrical Engineering and Computer Science Department, University of Tennessee, 1122 Volunteer Blvd, 37996-3450, Knoxville, TN, USA
Jack Dongarra
Department of Informatics and Mathematical Modeling, Technical University of Denmark, Richard Petersens Plads, Building 321, 2800, Kongens Lyngby, Denmark
Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boccia, V., Carracciuolo, L., Laccetti, G., Lapegna, M., Mele, V. (2012). HADAB: Enabling Fault Tolerance in Parallel Applications Running in Distributed Environments. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2011. Lecture Notes in Computer Science, vol 7203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31464-3_71

Download citation

DOI: https://doi.org/10.1007/978-3-642-31464-3_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31463-6
Online ISBN: 978-3-642-31464-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics