A Novel Checkpoint Mechanism Based on Job Progress Description for Computational Grid

  • Chunjiang Lia
  • Xuejun Yang
  • Nong Xiao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3358)

Abstract

In this paper, we argue that application-level uncoordinated checkpointing with user-defined checkpoint data is the favorable in grid environment where heterogeneity is essentially popular. We present a novel application-level uncoordinated checkpoint protocol based on Job Progress Description (JPD) which is composed by a Job Progress Record Object and a group of Job Progress State Objects, these two kinds of objects act as checkpoint data for the job and the methods of them can be used as checkpoint APIs. By extending this protocol with sender-based message logging, it can be used by the message passing applications in computational grid. Emulation with a kind of master-worker message-passing applications shows that using this checkpointing protocol can dramatically reduce the wall-time of the application when failure occurs.

Keywords

Computational Grid Grid Environment Recovery Service Failure Sequence Checkpoint Mechanism 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco (1999)Google Scholar
  2. 2.
    Foster, I.: The grid: A new infrastructure for 21st century science. Physics Today 54(2) (2002)Google Scholar
  3. 3.
    Lynch, N.: Distributed Algorithms. Morgan Kaufmann Publishers, San Francisco (1996)MATHGoogle Scholar
  4. 4.
    Hwang, K., Xu, Z.: Scalable Parallel Computing, Technology, Architecture, Programming. McGraw-Hill Companies, Inc., New York (1997)Google Scholar
  5. 5.
    Litzkow, J.B.M., Tannenbaum, T., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison (1997)Google Scholar
  6. 6.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. Technical Report UT-CS-94-242 (1994)Google Scholar
  7. 7.
    GridCPR Working Group: An architecture for grid checkpoint recovery services and a gridcpr api (2003), http://www.gridforum.org/Meetings/ggf7/drafts/GridCPR001.doc
  8. 8.
    GridCPR Working Group: Gwd-i: An architecture for grid checkpoint recovery services and a gridcpr api, current draft is version 1.0, http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-Architecture-1.0.pdf
  9. 9.
    GridCPR Working Group: Gwd-i: Use cases for grid checkpoint and recovery, current draft is version 1.0, http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-UseCases-1.0.pdf
  10. 10.
    DataGrid: European datagrid project, http://www.eu-datagrid.org/
  11. 11.
    Gianelle, A., Peluso, R., Sgaravatto, M.: Job partitioning and checkpointing (Technical Report DataGrid-01-TED-0119-0-3)Google Scholar
  12. 12.
    Foster, I., Karonis, N.T.: A grid-enabled mpi: Message-passing in heterogeneous distributed computing systems. In: Proceedings of International Conference on High Performance Networking and Computing, SC 1998, IEEE (1998)Google Scholar
  13. 13.
    Karonis, N.T., Toonen, B., Foster, I.: Mpich-g2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing (JPDC), 551–563 (2003)Google Scholar
  14. 14.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov., A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: Conference on High Performance Networking and Computing archive Proceedings of the 2002 ACM/IEEE conference on Supercomputing (2002)Google Scholar
  15. 15.
    Johnson, D., Zwaenepoel, W.: Sender-based message logging. In: Digest of Papers, FTCS-17, The 17th Annual International Symposium on Fault-Tolerant Computing, pp. 14–19 (1987)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Chunjiang Lia
    • 1
  • Xuejun Yang
    • 1
  • Nong Xiao
    • 1
  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina

Personalised recommendations