Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Asymmetric virtual machine replication for low latency and high available service

  • 66 Accesses


Providing fault tolerance support to client-to-server applications is critical in the data center and cloud computing environments. Virtualization provides a direct way of achieving high availability by encapsulating the protected applications into the virtual machine and by periodically checkpointing the entire virtual machine (VM) state to the backup replication. However, existing VM replication solutions suffer from either excessive checkpointing overhead and network latency or unnecessary CPU resources consumption in backup replication. In this study, we exploit the ingredients of output packets and consider that the replication system maintains external consistency if the pre-released packets originate the already synchronized states. Furthermore, we transform the active-active primary and slave VM combination into an active-semiactive one by shrinking the number of active virtual CPUs (vCPUs) in the slave VM. The former optimization mechanism improves the performance in read-mostly client-to-server networked applications, whereas the latter one relieves the problem of double scheduling in the slave host. Therefore, we proposed the COLO++ system which is built over COLO and is a non-stop service solution with coarse-grained lock-stepping VMs for client-to-server systems. The two plus signs represent two of the optimizations. Experimental results using COLO++ implemented on KVM and Linux depict that it achieves nearly native VM performance under read-mostly workloads, as well as lower scheduling overhead in backup replication.

This is a preview of subscription content, log in to check access.


  1. 1

    Jiang B, Ravindran B, Kim C. Lightweight live migration for high availability cluster service. In: Proceedings of the 12th International Conference on Stabilization, Safety, and Security of Distributed Systems, New York, 2010. 420–434

  2. 2

    Mullender S. Distributed systems. United States of America: ACM Press, 1993: 12

  3. 3

    Kivity A, Kamay Y, Laor D, et al. Kvm: the Linux virtual machine monitor. In: Proceedings of the Linux Symposium, Ottawa, 2007. 1: 225–230

  4. 4

    Barham P, Dragovic B, Fraser K, et al. Xen and the art of virtualization. In: Proceedings of the ACM SIGOPS Operating Systems Review, New York, 2003. 164–177

  5. 5

    Bressoud T C, Schneider F B. Hypervisor-based fault tolerance. ACM Trans Comput Syst, 1996, 14: 80–107

  6. 6

    Cully B, Lefebvre G, Meyer D, et al. Remus: high availability via asynchronous virtual machine replication. In: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, San Francisco, 2008. 161–174

  7. 7

    Dong Y Z, Ye W, Jiang Y H, et al. Colo: coarse-grained lock-stepping virtual machines for non-stop service. In: Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, 2013. 3

  8. 8

    Clark C, Fraser K, Hand S, et al. Live migration of virtual machines. In: Proceedings of the 2nd Symposium on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2005. 2: 273–286

  9. 9

    Elnozahy E N M, Alvisi L, Wang Y M, et al. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv, 2002, 34: 375–408

  10. 10

    Friebel T, Biemueller S. How to deal with lock holder preemption. In: Proceedings of Xen Summit North America, Boston, 2008. 164

  11. 11

    Enck W, Gilbert P, Han S, et al. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst, 2014, 32: 5

  12. 12

    Song X, Shi J, Chen H, et al. Schedule processes, not VCPUs. In: Proceedings of the 4th Asia-Pacific Workshop on Systems, New York, 2013. 1

  13. 13

    Cheng L, Rao J, Lau F. vScale: automatic and efficient processor scaling for SMP virtual machines. In: Proceedings of the 11th European Conference on Computer Systems, New York, 2016. 2

  14. 14

    Russell R. Virtio: towards a de-facto standard for virtual I/O devices. ACM SIGOPS Oper Syst Rev, 2008, 42: 95–103

  15. 15

    Intel R. Page modification logging for virtual machine monitor white paper. Intel Whitepaper, 2015. https://doi.org/www.intel.com/content/www/us/en/processors/page-modification-logging-vmm-white-paper.html

  16. 16

    Intel R 82576 and 82599 Gigabit Ethernet controller datashee. Intel Whitepaper, 2002. https://doi.org/www.intel.com/content/www/us/en/embedded/products/networking/82599-10-gbe-controller-datasheet.html

  17. 17

    Fitzpatrick B. Distributed caching with memcached. Linux J, 2004, 2004: 5

  18. 18

    Kopytov A. SysBench: a system performance benchmark. https://doi.org/sysbench.sourceforge.net, 2004

  19. 19

    Bienia C, Kumar S, Singh J P, et al. The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, New York, 2008. 72–81

  20. 20

    Castro M, Liskov B. Practical byzantine fault tolerance and proactive recovery. ACM Trans Comput Syst, 2002, 20: 398–461

  21. 21

    Lamport L, Shostak R, Pease M. The Byzantine generals problem. ACM Trans Program Lang Syst, 1982, 4: 382–401

  22. 22

    Schneider F B. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput Surv, 1990, 22: 299–319

  23. 23

    Bernick D, Bruckert B, Vigna P D, et al. NonStop/spl reg/advanced architecture. In: Proceedings of the International Conference on Dependable Systems and Networks, Yokohama, 2005. 12–21

  24. 24

    Webber S, Beirne J. The stratus architecture. In: Proceedings of the 21st International Symposium on Fault-Tolerant Computing, Montr´eal, 1991. 79–85

  25. 25

    Jeffery C M, Figueiredo R J O. A flexible approach to improving system reliability with virtual lockstep. IEEE Trans Dependable Secure Comput, 2012, 9: 2–15

  26. 26

    Scales D J, Nelson M, Venkitachalam G. The design of a practical system for fault-tolerant virtual machines. ACM SIGOPS Operat Syst Rev, 2010, 44: 30–39

  27. 27

    Reiser H P, Kapitza R. Hypervisor-based efficient proactive recovery. In: Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems. Washington: IEEE Computer Society, 2007. 83–92

  28. 28

    Minhas U F, Rajagopalan S, Cully B, et al. RemusDB: transparent high availability for database systems. VLDB J, 2013, 22: 29–45

  29. 29

    Lu M, Chiueh T. Fast memory state synchronization for virtualization-based fault tolerance. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks, Lisbon, 2009. 534–543

  30. 30

    Zhu J, Dong W, Jiang Z F, et al. Improving the performance of hypervisor-based fault tolerance. In: Proceedings of the International Symposium on Parallel and Distributed Processing, Atlanta, 2010. 1–10

  31. 31

    Huang D, He B, Miao C. A survey of resource management in multi-tier web applications. IEEE Commun Surv Tut, 2014, 16: 1574–1590

  32. 32

    Liu H, He B. VMbuddies: coordinating live migration of multi-tier applications in cloud environments. IEEE Trans Parallel Distrib Syst, 2015, 26: 1192–1205

Download references

Author information

Correspondence to Haibo Chen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, R., Chen, H. Asymmetric virtual machine replication for low latency and high available service. Sci. China Inf. Sci. 61, 092110 (2018). https://doi.org/10.1007/s11432-017-9292-9

Download citation


  • virtualization
  • fault tolerance
  • VM replication
  • memory
  • CPU scheduling