Abstract
In this paper we describe an mpiJava extension that implements a parallel checkpointing/recovery service. This checkpointing/recovery facility is transparent to applications, i.e. no instrumentation is needed. We use a distributed approach for taking the checkpoints, which means that the processes take their local checkpoints independently. This approach reduces communication between processes and there is not need for a central server for checkpoint storage. We present some experiments which suggest that the benefits of this extended MPI functionality do not have a significant performance penalty as a side effect, apart from the well-known penalties related to the local checkpoint generation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 375–408 (2002)
Manivannan, D., Singhal, M.: Quasi-Synchronous Checkpointing: Models, Characterization, and Classification. IEEE Transactions on Parallel and Distributed Systems 10, 703–713 (1999)
Cardinale, Y., Hernández, E.: Parallel Checkpointing Facility in a Metasystem. In: Proceedings of Parallel Computing Conference (PARCO 2001), Naples, Italy (2001)
Mostefaoui, A., Raynal, M.: Efficient message logging for uncoordinated checkpointing protocols. Technical Report 1018, Institut de recherche en informatique et systemes aleatoires (IRISA) (1996)
Helary, J., Mostefaoui, A., Netzer, R., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Technical Report 1105, Institut de recherche en informatique et systemes aleatoires (IRISA) (1997)
Stellner, G.: Cocheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium (1996)
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing. International Journal of High Performance Computing Applications 4, 479–493 (2005)
Zhang, Y., Xue, R., Wong, D., Zheng, W.: A Checkpointing/Recovery System for MPI Applications on Cluster of IA-64 Computers. In: ICPP 2005 Workshops. International Conference Workshops, pp. 320–327 (2005)
Woo, N., Yeom, H.Y., Park, T.: MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-enabled MPI Processes. IEICE Transactions on Information and Systems, Special Section on Hardware/Software Support for High Performance Scientific and Engineering Computing E87-D, 1820–1828 (2004)
Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. In: Proceedings of High Performance Networking and Computing (SC 2003) (2003)
Cardinale, Y., Hernández, E.: Parallel Checkpointing on a Grid-Enabled Java Platform. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 741–750. Springer, Heidelberg (2005)
Bouchenak, S.: Making Java applications mobile or persistent. In: Proceedings of 6th USENIX Conference on Object-Oriented Technologies and Systems (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hernández, E., Cardinale, Y., Pereira, W. (2006). Extended mpiJava for Distributed Checkpointing and Recovery. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_27
Download citation
DOI: https://doi.org/10.1007/11846802_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39110-4
Online ISBN: 978-3-540-39112-8
eBook Packages: Computer ScienceComputer Science (R0)