Advertisement

HADAB: Enabling Fault Tolerance in Parallel Applications Running in Distributed Environments

  • Vania Boccia
  • Luisa Carracciuolo
  • Giuliano Laccetti
  • Marco Lapegna
  • Valeria Mele
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7203)

Abstract

The development of scientific software, reliable and efficient, in distributed computing environments, requires the identification and the analysis of issues related to the design and the deployment of algorithms for high-performance computing architectures and their integration in distributed contexts. In these environments, indeed, resources efficiency and availability can change unexpectedly because of overloading or failure i.e. of both computing nodes and interconnection network. The scenario described above, requires the design of mechanisms enabling the software to “survive” to such unexpected events by ensuring, at the same time, an effective use of the computing resources. Although many researchers are working on these problems for years, fault tolerance, for some classes of applications is an open matter still today. Here we focus on the design and the deployment of a checkpointing/migration system to enable fault tolerance in parallel applications running in distributed environments. In particular we describe details about HADAB, a new hybrid checkpointing strategy, and its deployment in a meaningful case study: the PETSc Conjugate Gradient algortithm implementation. The related testing phase has been performed on the University of Naples distributed infrastructure (S.Co.P.E. infrastructure).

Keywords

Fault tolerance checkpointing PETSc library HPC distributed environments 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Balay, S., et al.: PETSc Users Manual. ANL-95/11 - Revision 3.1, Argonne National Laboratory (2010)Google Scholar
  2. 2.
    Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Building Fault Survivable MPI Programs with FT MPI Using Diskless Checkpointing. In: Proceedings for ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 213–223 (2005)Google Scholar
  3. 3.
    Dongarra, J., Bosilca, B., Delmas, R., Langou, J.: Algorithmic Based Fault Tolerance Applied to High Performance Computing. Journal of Parallel and Distributed Computing 69, 410–416 (2009)CrossRefGoogle Scholar
  4. 4.
    Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B.W., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Geist, A., Engelmann, C.: Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors (2002)Google Scholar
  6. 6.
    Engelmann, C., Geist, A.: Super-Scalable Algorithms for Computing on 100,000 Processors. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 313–321. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Hung, E., Student, M.P.: Fault Tolerance and Checkpointing Schemes for Clusters of Workstations (2008)Google Scholar
  8. 8.
    Kofahi, N.A., Al-Bokhitan, S., Journal, A.A.: On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis. Information Technology Journal 4, 367–376 (2005)CrossRefGoogle Scholar
  9. 9.
    Lee, K., Sha, L.: Process resurrection: A fast recovery mechanism for real-time embedded systems. In: Real-Time and Embedded Technology and Applications Symposium, pp. 292–301. IEEE (2005)Google Scholar
  10. 10.
    Murli, A., Boccia, V., Carracciuolo, L., D Amore, L., Lapegna, M.: Monitoring and Migration of a PETSc-based Parallel Application for Medical Imaging in a Grid computing PSE. In: Proceedings of IFIP 2.5 WoCo9, vol. 239, pp. 421–432. Springer (2007)Google Scholar
  11. 11.
    Plank, J.S., Li, K., Puening, M.A.: Diskless Checkpointing. Technical Report CS-97-380, University of Tennessee (December 1997)Google Scholar
  12. 12.
    Silva, L.M., Silva, G.J.: The Performance of Coordinated and Independent Checkpointing. In: Proceedings of the 13th International Symposium on Parallel Processing, pp. 280–284. IEEE Computer Society, Washington, DC (1999)Google Scholar
  13. 13.
    Simon, H.D., Heroux, M.A., Raghavan, P.: Faul Tolerance in Large Scale Scientific Computing, ch. 11, pp. 203–220. SIAM Press (2006)Google Scholar
  14. 14.
    Song, H., Leangsuksun, C., Nassar, R.: Availability Modeling and Analysis on High Performance Cluster Computing Systems. In: First International Conference on Availability, Reliability and Security, pp. 305–313 (2006)Google Scholar
  15. 15.
    Vadhiyar, S.S., Dongarra, J.: SRS - A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems. In: Parallel Processing Letters, pp. 291–312 (2002)Google Scholar
  16. 16.
    Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In: Parallel and Distributed Processing Symposium (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Vania Boccia
    • 1
  • Luisa Carracciuolo
    • 2
  • Giuliano Laccetti
    • 1
  • Marco Lapegna
    • 1
  • Valeria Mele
    • 1
  1. 1.Dept. of Applied MathematicsUniversity of Naples Federico IINaplesItaly
  2. 2.Italian National Research CouncilItaly

Personalised recommendations