Lightweight Virtual Machine Checkpoint and Rollback for Long-running Applications

  • Lei Cui
  • Zhiyu HaoEmail author
  • Lun Li
  • Haiqiang Fei
  • Zhenquan Ding
  • Bo Li
  • Peng Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9530)


Checkpoint/rollback is an effective approach to guarantee that the long-running applications can be completed in the face of failures. However, it does not come for free. The application suffers from long downtime and performance penalty when it is being checkpointed or rolled back, which result in extra overhead on application execution time. This problem would get worse in virtualized environment mainly due to the heavyweight of virtual machine. This paper proposes warmCR, a lightweight checkpoint/rollback system for virtual machine, which aims to reduce its own extra overhead on application execution time. First, warmCR employs the redirect-on-write approach to create disk checkpoint and leverages the copy-on-write method to lively create memory checkpoint, so that both the downtime and checkpoint duration are reduced. Second, we propose a working set based rollback approach to provide short downtime without compromising application performance. Third, workload-aware batched processing is proposed to achieve trade-off between downtime and performance loss. In addition to presenting warmCR, we detail its implementation, and provide extensive experimental results to prove its efficiency and effectiveness.


Checkpoint Rollback Virtual machine Reliability Long-running application 



We would like to thank the anonymous reviewers for their valuable comments and help in improving this paper. This work is supported by National Key Technology Support Program under grant No. 2012BAH46B02.


  1. 1.
  2. 2.
  3. 3.
    Vallee, G., Naughton, T., Ong, H., et al.: Checkpoint/restart of virtual machines based on Xen. In: HAPCW (2006)Google Scholar
  4. 4.
    Ford, D., Labelle, F., Popovici, F.I., et al.: Availability in globally distributed storage systems. In: OSDI, pp. 1–14 (2010)Google Scholar
  5. 5.
    Plank, J.S., Beck, M., Kingsley, G., et al.: Libckpt: transparent checkpointing under Unix. Computer Science Department (1994)Google Scholar
  6. 6.
    Li, J., Liu, H., Cui, L., Li, B., Wo, T.: iROW: an efficient live snapshot system for virtual machine disk. In: ICPADS, pp. 376–383 (2012)Google Scholar
  7. 7.
    Vaidya, N.H.: Impact of checkpoint latency on overhead ratio of a checkpointing scheme. TOC 46(8), 942–947 (1997)Google Scholar
  8. 8.
    Zhang, I., Garthwaite, A., Baskakov, Y., et al.: Fast restore of checkpointed memory using working set estimation. In: VEE, pp. 87–98 (2011)Google Scholar
  9. 9.
    Song, X., Shi, J., Liu, R., et al.: Parallelizing live migration of virtual machines. In: VEE, pp. 85–96 (2013)Google Scholar
  10. 10.
    Lee, M., Krishnakumar, A.S., Krishnan, P., et al.: Hypervisor-assisted application checkpointing in virtualized environments. In: DSN, pp. 371–382 (2011)Google Scholar
  11. 11.
    Arunagiri, S., Seelam, S., Oldfield, R.A., et al.: Impact of checkpoint latency on the optimal checkpoint interval and execution time (2008)Google Scholar
  12. 12.
    Young, J.M.: A first order approximation to the optimal checkpoint interval. Comm. ACM 17(9), 530–531 (1974)CrossRefzbMATHGoogle Scholar
  13. 13.
    Tantawi, A.N., Ruschitzka, M.: Performance analysis of checkpointing strategies. TOC 2(2), 123–144 (1984)CrossRefGoogle Scholar
  14. 14.
    Duda, A.: The effects of checkpointing on program execution time. Inf. Process. Lett. 16(5), 221–229 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Kourai, K., Chiba, S.: Fast software rejuvenation of virtual machine monitors. TDSC 8(6), 839–851 (2011)Google Scholar
  16. 16.
    Leners, J.B., Wu, H., Hung, W.L., et al.: Detecting failures in distributed systems with the FALCON spy network. In: SOSP, pp. 279–294 (2011)Google Scholar
  17. 17.
    Garg, S., et al.: Minimizing completion time of a program by checkpointing and rejuvenation. In: SIGMETRICS, pp. 252–261 (1996)Google Scholar
  18. 18.
    Kangarlou, A., Eugster, P., Xu, D.: VNsnap: taking snapshots of virtual networked environments with minimal downtime. In: DSN, pp. 524–533 (2009)Google Scholar
  19. 19.
    Sun, M.H., Blough, D.M.: Fast, Lightweight Virtual Machine Checkpointing (2010)Google Scholar
  20. 20.
    Liu, H.K., Jin, H., Liao, X.F., et al.: VMckpt: lightweight and live virtual machine checkpointing. Sci. China Inf. Sci. 55(12), 2865–2880 (2012)CrossRefGoogle Scholar
  21. 21.
    Garg, R., Sodha, K., Cooperman, G.: A generic checkpoint-restart mechanism for virtual machines (2012). arXiv preprint. arXiv:1212.1787
  22. 22.
    Hibler, M., Ricci, R., Stoller, L., Duerig, J., et al.: Large-scale virtualization in the emulab network testbed. In: ATC, pp. 113–128 (2008)Google Scholar
  23. 23.
    Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)Google Scholar
  24. 24.
    Maoz, T., Barak, A., Amar, L.: Combining virtual machine migration with process migration for HPC on multi-clusters and grids. In: Cluster, pp. 89–98 (2008)Google Scholar
  25. 25.
    Waldspurger, C.A.: Memory resource management in VMware ESX server. In: OSDI, pp. 181–194 (2002)Google Scholar
  26. 26.
    Jin, H., Deng, L., Wu, S.: Live virtual machine migration with adaptive memory compression. In: CLUSTER, pp. 1–10 (2009)Google Scholar
  27. 27.
    Hines, M.R., Gopalan, K.: Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning. In: VEE, pp. 51–60 (2009)Google Scholar
  28. 28.
    Park, E., Egger, B., Lee, J.: Fast and space-efficient virtual machine checkpointing. In: VEE, pp. 75–85 (2011)Google Scholar
  29. 29.
    Chiang, J.-H., Li, H.-L., Chiueh, T.-C.: Introspection-based memory de-duplication and migration. In: VEE, pp. 51–62 (2013)Google Scholar
  30. 30.
    Gray, J.: Why do computers stop and what can be done about it? In: German Association for Computing Machinery Conference on Office Automation (1985)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Lei Cui
    • 1
  • Zhiyu Hao
    • 1
    Email author
  • Lun Li
    • 1
  • Haiqiang Fei
    • 1
  • Zhenquan Ding
    • 1
  • Bo Li
    • 2
  • Peng Liu
    • 2
  1. 1.Institute of Information EngineeringChinese Academy of SciencesBeijingChina
  2. 2.School of Computer Science and EngineeringBeihang UniversityBeijingChina

Personalised recommendations