Skip to main content
Log in

An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids

  • Published:
Wuhan University Journal of Natural Sciences

Abstract

In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment (ChinaGrid) and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Foster I, Kesselman C. The Grid: Blueprint for a New Computing Infrastructure [M]. 2nd edition. San Francisco: Morgan Kaufmann, 2003.

    Google Scholar 

  2. Krauter K, Buyya R, Maheswaran M. A taxonomy and survey of grid resource management systems for distributed computing [J]. Software Practice and Experience, 2002, 32(2): 135–164.

    Article  MATH  Google Scholar 

  3. Hwang S, Kesselman C. Grid workflow: A flexible failure handling framework for the grid [C]//IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Washington: IEEE Press, 2003: 126–137.

    Google Scholar 

  4. Oliner A J, Sahoo R K, Moreira J E, et al. Performance implications of periodic checkpoint on large-scale cluster systems [C]//IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), Washington: IEEE Press, 2005.

    Google Scholar 

  5. Zhang Y, Squillante M S, Sivasubramaniam A, et al. Performance implications of failures in large-scale cluster scheduling [C]//Proceedings of the 10th Workshop on JSSPP, Sigmetrics, New York: IEEE Press, 2004: 233–252.

    Google Scholar 

  6. Ling Y, Mi J, Lin X. A variational calculus approach to optimal checkpoint placement [J]. IEEE Transaction on Computers, 2001, 50(7): 699–708.

    Article  Google Scholar 

  7. Nurmi D, Brevik J, Wolski R. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments [C]//IEEE International Conference on Custer Computing, Boston: IEEE Press, 2005: 1–10.

    Chapter  Google Scholar 

  8. Li H, Groep D, Walters L. Workload characteristics of a multi-cluster supercomputer [C]//Job Scheduling Strategies for Parallel Processing. New York: Springer-Verlag, 2004.

    Google Scholar 

  9. Heath T, Martin R, Nguyen T D. Improving cluster availability using workstation validation [C]//Proceedings of the ACM Sigmetrics, Marina Del Rey: ACM Press, 2002: 217–227.

    Google Scholar 

  10. Brevik J, Nurmi D, Wolski R. Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems [C]//Proceedings of the Cluster Computing and the Grid, Washington: IEEE Press, 2004: 190–199.

    Google Scholar 

  11. Sangho Y, Derrick K, Artur A. Reducing costs of spot instances via checkpointing in the Amazon elastic compute cloud [C]//IEEE International Conference on Cloud Computing, Florida: IEEE Press, 2010: 236–243.

    Google Scholar 

  12. Ouyang X, Gopalakrishnan K, Panda D K. Accelerating checkpoint operation by node-level write aggregation on multicore systems [C]//International Conference on Parallel Processing, Vienna: IEEE Press, 2009: 34–41.

    Chapter  Google Scholar 

  13. GridCPR Workgroup [DB/OL]. [2010-09-01]. https://forge.gridforum.org/projects/gridcpr-wg.

  14. Litzkow M, Tannenbaum T, Basney J, et al. Checkpoint and migration of UNIX processes in the Condor distributed processing system [R]. Madison, Computer Sciences. University of Wisconsin, Technical Report 1346, 1997.

    Google Scholar 

  15. Cappello F, Djilali S, Fedak G, et al. Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid [J]. Future Generation Computer Systems, 2005, 21(3): 417–437.

    Article  Google Scholar 

  16. Gianelle A, Peluso R, Sgaravatto M. Job partitioning and checkpoint [R]. Technical Report DataGrid-01-TED-0119-0_3, European DataGrid Project, 2001.

  17. Buyya R, Abramson D, Giddy J. Nimrod/G: An architecture of a resource management and scheduling system in a global computational grid [C]//Proceedings of High Performance Computing in the Asia-Pacific Region, Los Alamitos: IEEE Press, 2000: 283–289.

    Chapter  Google Scholar 

  18. Kwak S W, Choi B J, Kim B K. An optimal checkpointing-strategy for real-time control systems under transient faults [J]. IEEE Transaction on Reliability, 2001, 50: 293–301.

    Article  Google Scholar 

  19. Andrzejak A, Silva L M, Domingues P. Using checkpointing to enhance turnaround time on institutional desktop grids [C]//IEEE International Conference on E-Science and Grid Computing, Amsterdam: IEEE Press, 2006.

    Google Scholar 

  20. Vaidya N H. Impact on checkpoint latency on overhead Ratio of checkpointing scheme [J]. IEEE Transaction on Computers, 1997, 46(8): 942–947.

    Article  Google Scholar 

  21. Leangsuksun C, Shen L, Liu T, et al. Dependability prediction of high availability OSCAR cluster server [C]//International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas: CSREA Press, 2003.

    Google Scholar 

  22. Schroeder B, Gibson G. A large-scale study of failures in high-performance computing systems [C]//IEEE Conference on Dependable Systems and Networks (DSN), Washington D C: IEEE Press, 2006: 249–258.

    Google Scholar 

  23. Liang Y, Zhang Y, Sivasubramaniam A, et al. BlueGene/L failure analysis and prediction models [C]// IEEE Conference on Dependable Systems and Networks (DSN), Washington D C: IEEE Press, 2006: 425–434.

    Google Scholar 

  24. Young J W. A first order approximation to the optimum checkpoint interval [J]. Communications of the ACM, 1974, 17: 530–531.

    Article  MATH  Google Scholar 

  25. Tao Yongcai, Jin Hai, Shi Xuanhua. DGSS: A dependability guided job scheduling system for grid environment [C]//International Conference on Computational Science (ICCS 2007), Beijing, 2007, 4487: 434–441.

    Google Scholar 

  26. Jin H. China Grid: making grid computing a reality [C]// Digital Libraries: International Collaboration and Cross-Fertilization (Lecture Notes in Computer Science), New York: Springer-Verlag, 2004, 3334: 13–24.

    Google Scholar 

  27. Matlab by Mathworks [DB/OL]. [2010-09-01]. http://www.matlab.com.

  28. Asmussen S, Nerman O, Olsson M. Fitting phase-type distributions via the EM algorithm [J]. Scandinavian Journal of Statistics, 1996, 23: 419–441.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongcai Tao.

Additional information

Foundation item: Supported by the National Natural Science Foundation of China (90412010, 60603058, and 60673174), and the Ministry of Education of China and Program for New Century Excellent Talents in University (NCET-07-0334)

Biography: TAO Yongcai, male, Ph. D., research direction: cluster and grid computing, fault-tolerance, Web services and cloud computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tao, Y., Jin, H., Wu, S. et al. An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids. Wuhan Univ. J. Nat. Sci. 16, 213–222 (2011). https://doi.org/10.1007/s11859-011-0739-6

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11859-011-0739-6

Key words

CLC number

Navigation