An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids

Tao, Yongcai; Jin, Hai; Wu, Song; Shi, Xuanhua

doi:10.1007/s11859-011-0739-6

An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids

Published: 01 June 2011

Volume 16, pages 213–222, (2011)
Cite this article

Wuhan University Journal of Natural Sciences

Yongcai Tao^1,2,
Hai Jin²,
Song Wu² &
…
Xuanhua Shi²

76 Accesses
1 Citation
Explore all metrics

Abstract

In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment (ChinaGrid) and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Checkpoint Data Replication Strategy in Computational Grid

Job Migration Policies for Grid Environment

Article 24 March 2016

Job migration in HPC clusters by means of checkpoint/restart

Article 23 April 2019

References

Foster I, Kesselman C. The Grid: Blueprint for a New Computing Infrastructure [M]. 2nd edition. San Francisco: Morgan Kaufmann, 2003.
Google Scholar
Krauter K, Buyya R, Maheswaran M. A taxonomy and survey of grid resource management systems for distributed computing [J]. Software Practice and Experience, 2002, 32(2): 135–164.
Article MATH Google Scholar
Hwang S, Kesselman C. Grid workflow: A flexible failure handling framework for the grid [C]//IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Washington: IEEE Press, 2003: 126–137.
Google Scholar
Oliner A J, Sahoo R K, Moreira J E, et al. Performance implications of periodic checkpoint on large-scale cluster systems [C]//IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), Washington: IEEE Press, 2005.
Google Scholar
Zhang Y, Squillante M S, Sivasubramaniam A, et al. Performance implications of failures in large-scale cluster scheduling [C]//Proceedings of the 10th Workshop on JSSPP, Sigmetrics, New York: IEEE Press, 2004: 233–252.
Google Scholar
Ling Y, Mi J, Lin X. A variational calculus approach to optimal checkpoint placement [J]. IEEE Transaction on Computers, 2001, 50(7): 699–708.
Article Google Scholar
Nurmi D, Brevik J, Wolski R. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments [C]//IEEE International Conference on Custer Computing, Boston: IEEE Press, 2005: 1–10.
Chapter Google Scholar
Li H, Groep D, Walters L. Workload characteristics of a multi-cluster supercomputer [C]//Job Scheduling Strategies for Parallel Processing. New York: Springer-Verlag, 2004.
Google Scholar
Heath T, Martin R, Nguyen T D. Improving cluster availability using workstation validation [C]//Proceedings of the ACM Sigmetrics, Marina Del Rey: ACM Press, 2002: 217–227.
Google Scholar
Brevik J, Nurmi D, Wolski R. Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems [C]//Proceedings of the Cluster Computing and the Grid, Washington: IEEE Press, 2004: 190–199.
Google Scholar
Sangho Y, Derrick K, Artur A. Reducing costs of spot instances via checkpointing in the Amazon elastic compute cloud [C]//IEEE International Conference on Cloud Computing, Florida: IEEE Press, 2010: 236–243.
Google Scholar
Ouyang X, Gopalakrishnan K, Panda D K. Accelerating checkpoint operation by node-level write aggregation on multicore systems [C]//International Conference on Parallel Processing, Vienna: IEEE Press, 2009: 34–41.
Chapter Google Scholar
GridCPR Workgroup [DB/OL]. [2010-09-01]. https://forge.gridforum.org/projects/gridcpr-wg.
Litzkow M, Tannenbaum T, Basney J, et al. Checkpoint and migration of UNIX processes in the Condor distributed processing system [R]. Madison, Computer Sciences. University of Wisconsin, Technical Report 1346, 1997.
Google Scholar
Cappello F, Djilali S, Fedak G, et al. Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid [J]. Future Generation Computer Systems, 2005, 21(3): 417–437.
Article Google Scholar
Gianelle A, Peluso R, Sgaravatto M. Job partitioning and checkpoint [R]. Technical Report DataGrid-01-TED-0119-0_3, European DataGrid Project, 2001.
Buyya R, Abramson D, Giddy J. Nimrod/G: An architecture of a resource management and scheduling system in a global computational grid [C]//Proceedings of High Performance Computing in the Asia-Pacific Region, Los Alamitos: IEEE Press, 2000: 283–289.
Chapter Google Scholar
Kwak S W, Choi B J, Kim B K. An optimal checkpointing-strategy for real-time control systems under transient faults [J]. IEEE Transaction on Reliability, 2001, 50: 293–301.
Article Google Scholar
Andrzejak A, Silva L M, Domingues P. Using checkpointing to enhance turnaround time on institutional desktop grids [C]//IEEE International Conference on E-Science and Grid Computing, Amsterdam: IEEE Press, 2006.
Google Scholar
Vaidya N H. Impact on checkpoint latency on overhead Ratio of checkpointing scheme [J]. IEEE Transaction on Computers, 1997, 46(8): 942–947.
Article Google Scholar
Leangsuksun C, Shen L, Liu T, et al. Dependability prediction of high availability OSCAR cluster server [C]//International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas: CSREA Press, 2003.
Google Scholar
Schroeder B, Gibson G. A large-scale study of failures in high-performance computing systems [C]//IEEE Conference on Dependable Systems and Networks (DSN), Washington D C: IEEE Press, 2006: 249–258.
Google Scholar
Liang Y, Zhang Y, Sivasubramaniam A, et al. BlueGene/L failure analysis and prediction models [C]// IEEE Conference on Dependable Systems and Networks (DSN), Washington D C: IEEE Press, 2006: 425–434.
Google Scholar
Young J W. A first order approximation to the optimum checkpoint interval [J]. Communications of the ACM, 1974, 17: 530–531.
Article MATH Google Scholar
Tao Yongcai, Jin Hai, Shi Xuanhua. DGSS: A dependability guided job scheduling system for grid environment [C]//International Conference on Computational Science (ICCS 2007), Beijing, 2007, 4487: 434–441.
Google Scholar
Jin H. China Grid: making grid computing a reality [C]// Digital Libraries: International Collaboration and Cross-Fertilization (Lecture Notes in Computer Science), New York: Springer-Verlag, 2004, 3334: 13–24.
Google Scholar
Matlab by Mathworks [DB/OL]. [2010-09-01]. http://www.matlab.com.
Asmussen S, Nerman O, Olsson M. Fitting phase-type distributions via the EM algorithm [J]. Scandinavian Journal of Statistics, 1996, 23: 419–441.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Engineering, Zhengzhou University, Zhengzhou, 450000, Henan, China
Yongcai Tao
Services Computing Technology and System Lab, Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
Yongcai Tao, Hai Jin, Song Wu & Xuanhua Shi

Authors

Yongcai Tao
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar
Song Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xuanhua Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongcai Tao.

Additional information

Foundation item: Supported by the National Natural Science Foundation of China (90412010, 60603058, and 60673174), and the Ministry of Education of China and Program for New Century Excellent Talents in University (NCET-07-0334)

Biography: TAO Yongcai, male, Ph. D., research direction: cluster and grid computing, fault-tolerance, Web services and cloud computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tao, Y., Jin, H., Wu, S. et al. An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids. Wuhan Univ. J. Nat. Sci. 16, 213–222 (2011). https://doi.org/10.1007/s11859-011-0739-6

Download citation

Received: 19 October 2010
Published: 01 June 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s11859-011-0739-6

Key words

CLC number

TP 302.1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids

Abstract

Access this article

Similar content being viewed by others

Dynamic Checkpoint Data Replication Strategy in Computational Grid

Job Migration Policies for Grid Environment

Job migration in HPC clusters by means of checkpoint/restart

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids

Abstract

Access this article

Similar content being viewed by others

Dynamic Checkpoint Data Replication Strategy in Computational Grid

Job Migration Policies for Grid Environment

Job migration in HPC clusters by means of checkpoint/restart

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation