ISPA 2004: Parallel and Distributed Processing and Applications pp 594-603 | Cite as
A Novel Checkpoint Mechanism Based on Job Progress Description for Computational Grid
Abstract
In this paper, we argue that application-level uncoordinated checkpointing with user-defined checkpoint data is the favorable in grid environment where heterogeneity is essentially popular. We present a novel application-level uncoordinated checkpoint protocol based on Job Progress Description (JPD) which is composed by a Job Progress Record Object and a group of Job Progress State Objects, these two kinds of objects act as checkpoint data for the job and the methods of them can be used as checkpoint APIs. By extending this protocol with sender-based message logging, it can be used by the message passing applications in computational grid. Emulation with a kind of master-worker message-passing applications shows that using this checkpointing protocol can dramatically reduce the wall-time of the application when failure occurs.
Keywords
Computational Grid Grid Environment Recovery Service Failure Sequence Checkpoint MechanismPreview
Unable to display preview. Download preview PDF.
References
- 1.Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco (1999)Google Scholar
- 2.Foster, I.: The grid: A new infrastructure for 21st century science. Physics Today 54(2) (2002)Google Scholar
- 3.Lynch, N.: Distributed Algorithms. Morgan Kaufmann Publishers, San Francisco (1996)MATHGoogle Scholar
- 4.Hwang, K., Xu, Z.: Scalable Parallel Computing, Technology, Architecture, Programming. McGraw-Hill Companies, Inc., New York (1997)Google Scholar
- 5.Litzkow, J.B.M., Tannenbaum, T., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison (1997)Google Scholar
- 6.Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. Technical Report UT-CS-94-242 (1994)Google Scholar
- 7.GridCPR Working Group: An architecture for grid checkpoint recovery services and a gridcpr api (2003), http://www.gridforum.org/Meetings/ggf7/drafts/GridCPR001.doc
- 8.GridCPR Working Group: Gwd-i: An architecture for grid checkpoint recovery services and a gridcpr api, current draft is version 1.0, http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-Architecture-1.0.pdf
- 9.GridCPR Working Group: Gwd-i: Use cases for grid checkpoint and recovery, current draft is version 1.0, http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-UseCases-1.0.pdf
- 10.DataGrid: European datagrid project, http://www.eu-datagrid.org/
- 11.Gianelle, A., Peluso, R., Sgaravatto, M.: Job partitioning and checkpointing (Technical Report DataGrid-01-TED-0119-0-3)Google Scholar
- 12.Foster, I., Karonis, N.T.: A grid-enabled mpi: Message-passing in heterogeneous distributed computing systems. In: Proceedings of International Conference on High Performance Networking and Computing, SC 1998, IEEE (1998)Google Scholar
- 13.Karonis, N.T., Toonen, B., Foster, I.: Mpich-g2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing (JPDC), 551–563 (2003)Google Scholar
- 14.Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov., A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: Conference on High Performance Networking and Computing archive Proceedings of the 2002 ACM/IEEE conference on Supercomputing (2002)Google Scholar
- 15.Johnson, D., Zwaenepoel, W.: Sender-based message logging. In: Digest of Papers, FTCS-17, The 17th Annual International Symposium on Fault-Tolerant Computing, pp. 14–19 (1987)Google Scholar