Lightweight Virtual Machine Checkpoint and Rollback for Long-running Applications

Cui, Lei; Hao, Zhiyu; Li, Lun; Fei, Haiqiang; Ding, Zhenquan; Li, Bo; Liu, Peng

doi:10.1007/978-3-319-27137-8_42

Lei Cui¹⁷,
Zhiyu Hao¹⁷,
Lun Li¹⁷,
Haiqiang Fei¹⁷,
Zhenquan Ding¹⁷,
Bo Li¹⁸ &
…
Peng Liu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1866 Accesses
1 Citations

Abstract

Checkpoint/rollback is an effective approach to guarantee that the long-running applications can be completed in the face of failures. However, it does not come for free. The application suffers from long downtime and performance penalty when it is being checkpointed or rolled back, which result in extra overhead on application execution time. This problem would get worse in virtualized environment mainly due to the heavyweight of virtual machine. This paper proposes warmCR, a lightweight checkpoint/rollback system for virtual machine, which aims to reduce its own extra overhead on application execution time. First, warmCR employs the redirect-on-write approach to create disk checkpoint and leverages the copy-on-write method to lively create memory checkpoint, so that both the downtime and checkpoint duration are reduced. Second, we propose a working set based rollback approach to provide short downtime without compromising application performance. Third, workload-aware batched processing is proposed to achieve trade-off between downtime and performance loss. In addition to presenting warmCR, we detail its implementation, and provide extensive experimental results to prove its efficiency and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amazon EC2. http://aws.amazon.com/ec2/
ElasticSearch. http://www.elasticsearch.org/
Vallee, G., Naughton, T., Ong, H., et al.: Checkpoint/restart of virtual machines based on Xen. In: HAPCW (2006)
Google Scholar
Ford, D., Labelle, F., Popovici, F.I., et al.: Availability in globally distributed storage systems. In: OSDI, pp. 1–14 (2010)
Google Scholar
Plank, J.S., Beck, M., Kingsley, G., et al.: Libckpt: transparent checkpointing under Unix. Computer Science Department (1994)
Google Scholar
Li, J., Liu, H., Cui, L., Li, B., Wo, T.: iROW: an efficient live snapshot system for virtual machine disk. In: ICPADS, pp. 376–383 (2012)
Google Scholar
Vaidya, N.H.: Impact of checkpoint latency on overhead ratio of a checkpointing scheme. TOC 46(8), 942–947 (1997)
Google Scholar
Zhang, I., Garthwaite, A., Baskakov, Y., et al.: Fast restore of checkpointed memory using working set estimation. In: VEE, pp. 87–98 (2011)
Google Scholar
Song, X., Shi, J., Liu, R., et al.: Parallelizing live migration of virtual machines. In: VEE, pp. 85–96 (2013)
Google Scholar
Lee, M., Krishnakumar, A.S., Krishnan, P., et al.: Hypervisor-assisted application checkpointing in virtualized environments. In: DSN, pp. 371–382 (2011)
Google Scholar
Arunagiri, S., Seelam, S., Oldfield, R.A., et al.: Impact of checkpoint latency on the optimal checkpoint interval and execution time (2008)
Google Scholar
Young, J.M.: A first order approximation to the optimal checkpoint interval. Comm. ACM 17(9), 530–531 (1974)
Article MATH Google Scholar
Tantawi, A.N., Ruschitzka, M.: Performance analysis of checkpointing strategies. TOC 2(2), 123–144 (1984)
Article Google Scholar
Duda, A.: The effects of checkpointing on program execution time. Inf. Process. Lett. 16(5), 221–229 (1983)
Article MathSciNet MATH Google Scholar
Kourai, K., Chiba, S.: Fast software rejuvenation of virtual machine monitors. TDSC 8(6), 839–851 (2011)
Google Scholar
Leners, J.B., Wu, H., Hung, W.L., et al.: Detecting failures in distributed systems with the FALCON spy network. In: SOSP, pp. 279–294 (2011)
Google Scholar
Garg, S., et al.: Minimizing completion time of a program by checkpointing and rejuvenation. In: SIGMETRICS, pp. 252–261 (1996)
Google Scholar
Kangarlou, A., Eugster, P., Xu, D.: VNsnap: taking snapshots of virtual networked environments with minimal downtime. In: DSN, pp. 524–533 (2009)
Google Scholar
Sun, M.H., Blough, D.M.: Fast, Lightweight Virtual Machine Checkpointing (2010)
Google Scholar
Liu, H.K., Jin, H., Liao, X.F., et al.: VMckpt: lightweight and live virtual machine checkpointing. Sci. China Inf. Sci. 55(12), 2865–2880 (2012)
Article Google Scholar
Garg, R., Sodha, K., Cooperman, G.: A generic checkpoint-restart mechanism for virtual machines (2012). arXiv preprint. arXiv:1212.1787
Hibler, M., Ricci, R., Stoller, L., Duerig, J., et al.: Large-scale virtualization in the emulab network testbed. In: ATC, pp. 113–128 (2008)
Google Scholar
Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)
Google Scholar
Maoz, T., Barak, A., Amar, L.: Combining virtual machine migration with process migration for HPC on multi-clusters and grids. In: Cluster, pp. 89–98 (2008)
Google Scholar
Waldspurger, C.A.: Memory resource management in VMware ESX server. In: OSDI, pp. 181–194 (2002)
Google Scholar
Jin, H., Deng, L., Wu, S.: Live virtual machine migration with adaptive memory compression. In: CLUSTER, pp. 1–10 (2009)
Google Scholar
Hines, M.R., Gopalan, K.: Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning. In: VEE, pp. 51–60 (2009)
Google Scholar
Park, E., Egger, B., Lee, J.: Fast and space-efficient virtual machine checkpointing. In: VEE, pp. 75–85 (2011)
Google Scholar
Chiang, J.-H., Li, H.-L., Chiueh, T.-C.: Introspection-based memory de-duplication and migration. In: VEE, pp. 51–62 (2013)
Google Scholar
Gray, J.: Why do computers stop and what can be done about it? In: German Association for Computing Machinery Conference on Office Automation (1985)
Google Scholar

Download references

Acknowledgement

We would like to thank the anonymous reviewers for their valuable comments and help in improving this paper. This work is supported by National Key Technology Support Program under grant No. 2012BAH46B02.

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 100093, China
Lei Cui, Zhiyu Hao, Lun Li, Haiqiang Fei & Zhenquan Ding
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Bo Li & Peng Liu

Authors

Lei Cui
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyu Hao
View author publications
You can also search for this author in PubMed Google Scholar
Lun Li
View author publications
You can also search for this author in PubMed Google Scholar
Haiqiang Fei
View author publications
You can also search for this author in PubMed Google Scholar
Zhenquan Ding
View author publications
You can also search for this author in PubMed Google Scholar
Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Peng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyu Hao .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University, Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, L. et al. (2015). Lightweight Virtual Machine Checkpoint and Rollback for Long-running Applications. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-27137-8_42
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27136-1
Online ISBN: 978-3-319-27137-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics