Advertisement

Extended mpiJava for Distributed Checkpointing and Recovery

  • Emilio Hernández
  • Yudith Cardinale
  • Wilmer Pereira
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4192)

Abstract

In this paper we describe an mpiJava extension that implements a parallel checkpointing/recovery service. This checkpointing/recovery facility is transparent to applications, i.e. no instrumentation is needed. We use a distributed approach for taking the checkpoints, which means that the processes take their local checkpoints independently. This approach reduces communication between processes and there is not need for a central server for checkpoint storage. We present some experiments which suggest that the benefits of this extended MPI functionality do not have a significant performance penalty as a side effect, apart from the well-known penalties related to the local checkpoint generation.

Keywords

Control Information Parallel Application Application Thread Logical Clock High Performance Computing Application 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 375–408 (2002)CrossRefGoogle Scholar
  2. 2.
    Manivannan, D., Singhal, M.: Quasi-Synchronous Checkpointing: Models, Characterization, and Classification. IEEE Transactions on Parallel and Distributed Systems 10, 703–713 (1999)CrossRefGoogle Scholar
  3. 3.
    Cardinale, Y., Hernández, E.: Parallel Checkpointing Facility in a Metasystem. In: Proceedings of Parallel Computing Conference (PARCO 2001), Naples, Italy (2001)Google Scholar
  4. 4.
    Mostefaoui, A., Raynal, M.: Efficient message logging for uncoordinated checkpointing protocols. Technical Report 1018, Institut de recherche en informatique et systemes aleatoires (IRISA) (1996)Google Scholar
  5. 5.
    Helary, J., Mostefaoui, A., Netzer, R., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Technical Report 1105, Institut de recherche en informatique et systemes aleatoires (IRISA) (1997)Google Scholar
  6. 6.
    Stellner, G.: Cocheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium (1996)Google Scholar
  7. 7.
    Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing. International Journal of High Performance Computing Applications 4, 479–493 (2005)CrossRefGoogle Scholar
  8. 8.
    Zhang, Y., Xue, R., Wong, D., Zheng, W.: A Checkpointing/Recovery System for MPI Applications on Cluster of IA-64 Computers. In: ICPP 2005 Workshops. International Conference Workshops, pp. 320–327 (2005)Google Scholar
  9. 9.
    Woo, N., Yeom, H.Y., Park, T.: MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes. IEICE Transactions on Information and Systems, Special Section on Hardware/Software Support for High Performance Scientific and Engineering Computing E87-D, 1820–1828 (2004)Google Scholar
  10. 10.
    Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. In: Proceedings of High Performance Networking and Computing (SC 2003) (2003)Google Scholar
  11. 11.
    Cardinale, Y., Hernández, E.: Parallel Checkpointing on a Grid-Enabled Java Platform. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 741–750. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  12. 12.
    Bouchenak, S.: Making Java applications mobile or persistent. In: Proceedings of 6th USENIX Conference on Object-Oriented Technologies and Systems (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Emilio Hernández
    • 1
  • Yudith Cardinale
    • 1
  • Wilmer Pereira
    • 1
  1. 1.Departamento de Computación y Tecnología de la InformaciónUniversidad Simón BolívarCaracasVenezuela

Personalised recommendations