Abstract
In the paper, based on the job characteristics and resources availability, an optimistic checkpoint mechanism for dynamic grids(OCM4G) is proposed. It can determine whether to checkpoint a given job running on a given resource node and establish optimal aperiodic checkpoint intervals by applying the knowledge of job characteristics and resource availability. We evaluate OCM4G over a real grid environment (ChinaGrid) and the results show that OCM4G achieves better performance than the periodic checkpoint and the analytical method of calculating aperiodic checkpoint intervals.
Similar content being viewed by others
References
Foster I, Kesselman C. The Grid: Blueprint for a New Computing Infrastructure [M]. 2nd edition. San Francisco: Morgan Kaufmann, 2003.
Krauter K, Buyya R, Maheswaran M. A taxonomy and survey of grid resource management systems for distributed computing [J]. Software Practice and Experience, 2002, 32(2): 135–164.
Hwang S, Kesselman C. Grid workflow: A flexible failure handling framework for the grid [C]//IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Washington: IEEE Press, 2003: 126–137.
Oliner A J, Sahoo R K, Moreira J E, et al. Performance implications of periodic checkpoint on large-scale cluster systems [C]//IEEE International Parallel and Distributed Processing Symposium (IPDPS 2005), Washington: IEEE Press, 2005.
Zhang Y, Squillante M S, Sivasubramaniam A, et al. Performance implications of failures in large-scale cluster scheduling [C]//Proceedings of the 10th Workshop on JSSPP, Sigmetrics, New York: IEEE Press, 2004: 233–252.
Ling Y, Mi J, Lin X. A variational calculus approach to optimal checkpoint placement [J]. IEEE Transaction on Computers, 2001, 50(7): 699–708.
Nurmi D, Brevik J, Wolski R. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments [C]//IEEE International Conference on Custer Computing, Boston: IEEE Press, 2005: 1–10.
Li H, Groep D, Walters L. Workload characteristics of a multi-cluster supercomputer [C]//Job Scheduling Strategies for Parallel Processing. New York: Springer-Verlag, 2004.
Heath T, Martin R, Nguyen T D. Improving cluster availability using workstation validation [C]//Proceedings of the ACM Sigmetrics, Marina Del Rey: ACM Press, 2002: 217–227.
Brevik J, Nurmi D, Wolski R. Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems [C]//Proceedings of the Cluster Computing and the Grid, Washington: IEEE Press, 2004: 190–199.
Sangho Y, Derrick K, Artur A. Reducing costs of spot instances via checkpointing in the Amazon elastic compute cloud [C]//IEEE International Conference on Cloud Computing, Florida: IEEE Press, 2010: 236–243.
Ouyang X, Gopalakrishnan K, Panda D K. Accelerating checkpoint operation by node-level write aggregation on multicore systems [C]//International Conference on Parallel Processing, Vienna: IEEE Press, 2009: 34–41.
GridCPR Workgroup [DB/OL]. [2010-09-01]. https://forge.gridforum.org/projects/gridcpr-wg.
Litzkow M, Tannenbaum T, Basney J, et al. Checkpoint and migration of UNIX processes in the Condor distributed processing system [R]. Madison, Computer Sciences. University of Wisconsin, Technical Report 1346, 1997.
Cappello F, Djilali S, Fedak G, et al. Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid [J]. Future Generation Computer Systems, 2005, 21(3): 417–437.
Gianelle A, Peluso R, Sgaravatto M. Job partitioning and checkpoint [R]. Technical Report DataGrid-01-TED-0119-0_3, European DataGrid Project, 2001.
Buyya R, Abramson D, Giddy J. Nimrod/G: An architecture of a resource management and scheduling system in a global computational grid [C]//Proceedings of High Performance Computing in the Asia-Pacific Region, Los Alamitos: IEEE Press, 2000: 283–289.
Kwak S W, Choi B J, Kim B K. An optimal checkpointing-strategy for real-time control systems under transient faults [J]. IEEE Transaction on Reliability, 2001, 50: 293–301.
Andrzejak A, Silva L M, Domingues P. Using checkpointing to enhance turnaround time on institutional desktop grids [C]//IEEE International Conference on E-Science and Grid Computing, Amsterdam: IEEE Press, 2006.
Vaidya N H. Impact on checkpoint latency on overhead Ratio of checkpointing scheme [J]. IEEE Transaction on Computers, 1997, 46(8): 942–947.
Leangsuksun C, Shen L, Liu T, et al. Dependability prediction of high availability OSCAR cluster server [C]//International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas: CSREA Press, 2003.
Schroeder B, Gibson G. A large-scale study of failures in high-performance computing systems [C]//IEEE Conference on Dependable Systems and Networks (DSN), Washington D C: IEEE Press, 2006: 249–258.
Liang Y, Zhang Y, Sivasubramaniam A, et al. BlueGene/L failure analysis and prediction models [C]// IEEE Conference on Dependable Systems and Networks (DSN), Washington D C: IEEE Press, 2006: 425–434.
Young J W. A first order approximation to the optimum checkpoint interval [J]. Communications of the ACM, 1974, 17: 530–531.
Tao Yongcai, Jin Hai, Shi Xuanhua. DGSS: A dependability guided job scheduling system for grid environment [C]//International Conference on Computational Science (ICCS 2007), Beijing, 2007, 4487: 434–441.
Jin H. China Grid: making grid computing a reality [C]// Digital Libraries: International Collaboration and Cross-Fertilization (Lecture Notes in Computer Science), New York: Springer-Verlag, 2004, 3334: 13–24.
Matlab by Mathworks [DB/OL]. [2010-09-01]. http://www.matlab.com.
Asmussen S, Nerman O, Olsson M. Fitting phase-type distributions via the EM algorithm [J]. Scandinavian Journal of Statistics, 1996, 23: 419–441.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Supported by the National Natural Science Foundation of China (90412010, 60603058, and 60673174), and the Ministry of Education of China and Program for New Century Excellent Talents in University (NCET-07-0334)
Biography: TAO Yongcai, male, Ph. D., research direction: cluster and grid computing, fault-tolerance, Web services and cloud computing.
Rights and permissions
About this article
Cite this article
Tao, Y., Jin, H., Wu, S. et al. An optimistic checkpoint mechanism based on job characteristics and resource availability for dynamic grids. Wuhan Univ. J. Nat. Sci. 16, 213–222 (2011). https://doi.org/10.1007/s11859-011-0739-6
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11859-011-0739-6