Empirical Software Engineering

, Volume 23, Issue 3, pp 1490–1518 | Cite as

Empirical study on the discrepancy between performance testing results from virtual and physical environments

  • Muhammad Moiz ArifEmail author
  • Weiyi Shang
  • Emad Shihab


Large software systems often undergo performance tests to ensure their capability to handle expected loads. These performance tests often consume large amounts of computing resources and time since heavy loads need to be generated. Making it worse, the ever evolving field requires frequent updates to the performance testing environment. In practice, virtual machines (VMs) are widely exploited to provide flexible and less costly environments for performance tests. However, the use of VMs may introduce confounding overhead (e.g., a higher than expected memory utilization with unstable I/O traffic) to the testing environment and lead to unrealistic performance testing results. Yet, little research has studied the impact on test results of using VMs in performance testing activities. To evaluate the discrepancy between the performance testing results from virtual and physical environments, we perform a case study on two open source systems – namely Dell DVD Store (DS2) and CloudStore. We conduct the same performance tests in both virtual and physical environments and compare the performance testing results based on the three aspects that are typically examined for performance testing results: 1) single performance metric (e.g. CPU Time from virtual environment vs. CPU Time from physical environment), 2) the relationship among performance metrics (e.g. correlation between CPU and I/O) and 3) performance models that are built to predict system performance. Our results show that 1) A single metric from virtual and physical environments do not follow the same distribution, hence practitioners cannot simply use a scaling factor to compare the performance between environments, 2) correlations among performance metrics in virtual environments are different from those in physical environments 3) statistical models built based on the performance metrics from virtual environments are different from the models built from physical environments suggesting that practitioners cannot use the performance testing results across virtual and physical environments. In order to assist the practitioners leverage performance testing results in both environments, we investigate ways to reduce the discrepancy. We find that such discrepancy can be reduced by normalizing performance metrics based on deviance. Overall, we suggest that practitioners should not use the performance testing results from virtual environment with the simple assumption of straightforward performance overhead. Instead, practitioners should consider leveraging normalization techniques to reduce the discrepancy before examining performance testing results from virtual and physical environments.


Software performance engineering Software performance analysis and testing on virtual environments 


  1. Ahmed TM, Bezemer CP, Chen TH, Hassan AE, Shang W (2016) Studying the effectiveness of application performance management (apm) tools for detecting performance regressions for web applications: an experience report. In: MSR 2016: proceedings of the 13th working conference on mining software repositoriesGoogle Scholar
  2. Andale (2012) Statistics how to - coefficient of determination (r squared). Accessed: 2017-04-04
  3. Apache (2007) Tomcat. Accessed: 2015-06-01
  4. Apache (2008) Jmeter. Accessed: 2015-06-01
  5. Blackberry (2014) Blackberry enterprise server. Accessed: 2017-04-04
  6. Bodík P, Goldszmidt M, Fox A (2008) Hilighter: automatically building robust signatures of performance behavior for small- and large-scale systems. In: Proceedings of the third conference on tackling computer systems problems with machine learning techniques, SysML’08, pp 3–3Google Scholar
  7. Brosig F, Gorsler F, Huber N, Kounev S (2013) Evaluating approaches for performance prediction in virtualized environments. In: 2013 IEEE 21st international symposium on modelling, analysis and simulation of computer and telecommunication systems. IEEE, pp 404–408Google Scholar
  8. CA Technologies (2011) The avoidable cost of downtime.
  9. Chambers J, Hastie T, Pregibon D (1990) Statistical models in S. In: Compstat: proceedings in computational statistics, 9th symposium held at Dubrovnik, Yugoslavia, 1990. Physica-Verlag HD, Heidelberg, pp 317–321Google Scholar
  10. Chen PM, Noble BD (2001) When virtual is better than real [operating system relocation to virtual machines]. In: Proceedings of the eighth workshop on hot topics in operating systems, 2001, pp 133–138Google Scholar
  11. Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: an empirical study on software development for the cloud. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015, pp 393–403Google Scholar
  12. CloudScale-Project (2014) Cloudstore. Accessed: 2015-06-01
  13. Cohen I, Goldszmidt M, Kelly T, Symons J, Chase JS (2004) Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, OSDI’04, vol 6, pp 16–16Google Scholar
  14. Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the twentieth ACM symposium on operating systems principles, SOSP ’05, pp 105–118Google Scholar
  15. Costantini D (2015) How to configure a pass-through disk with hyper-v. Accessed: 2017-04-04
  16. Dean J, Barroso LA (2013) The tail at scale. Commun ACM 56:74–80CrossRefGoogle Scholar
  17. Dee (2014) Performance-testing systems on virtual machines that normally run on physical machines. Accessed: 2017-04-04
  18. Eeton K (2012) How one second could cost amazon $1.6 billion in sales. Accessed: 2016-03-11
  19. Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (2010) Mining performance regression testing repositories for automated performance analysis. In: 10th international conference on quality software (QSIC), 2010, pp 32–41Google Scholar
  20. Freedman D (2009) Statistical models: theory and practice. Cambridge University PressGoogle Scholar
  21. Harrell FE (2001) Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. SpringerGoogle Scholar
  22. Heger C, Happe J, Farahbod R (2013) Automated root cause isolation of performance regressions during software development. In: ICPE ’13: proceedings of the 4th ACM/SPEC international conference on performance engineering, pp 27–38Google Scholar
  23. Huber N, von Quast M, Hauck M, Kounev S (2011) Evaluating and modeling virtualization performance overhead for cloud environments. In: Proceedings of the 1st international conference on cloud computing and services science, pp 563–573Google Scholar
  24. Jaffe D, Muirhead T (2011) Dell dvd store. Accessed: 2015-06-01
  25. Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. WileyGoogle Scholar
  26. Jiang M, Munawar M, Reidemeister T, Ward P (2009a) Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Proceedings of 2009 IEEE/IFIP international conference on dependable systems networks, pp 285–294Google Scholar
  27. Jiang M, Munawar MA, Reidemeister T, Ward PA (2009b) System monitoring with metric-correlation models: problems and solutions. In: Proceedings of the 6th international conference on autonomic computing, pp 13–22Google Scholar
  28. Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: IEEE International conference on software maintenance, 2009. ICSM 2009, pp 125–134Google Scholar
  29. Jin G, Song L, Shi X, Scherpelz J, Lu S (2012) Understanding and detecting real-world performance bugs. In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and implementation, PLDI ’12. ACM, pp 77–88Google Scholar
  30. Kabacoff RI (2011) R in action. In: R in action. Manning Publications Co., Staten Island, NY , pp 207–213Google Scholar
  31. Kearon S (2012) Can you use a virtual machine to performance test an application? Accessed: 2017-04-04
  32. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, IJCAI’95, vol 2, pp 1137–1143Google Scholar
  33. Kraft S, Casale G, Krishnamurthy D, Greer D, Kilpatrick P (2011) Io performance prediction in consolidated virtualized environments. SIGSOFT Softw Eng Notes 36(5):295–306CrossRefGoogle Scholar
  34. Kuhn M (2008) Building predictive models in r using the caret package. J Stat Softw Articles 28(5):1–26Google Scholar
  35. Leitner P, Cito J (2016) Patterns in the chaos—a study of performance variation and predictability in public iaas clouds. ACM Trans Internet Technol 16(3):15:1–15:23CrossRefGoogle Scholar
  36. Luo Q, Poshyvanyk D, Grechanik M (2016) Mining performance regression inducing code changes in evolving software. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16, pp 25–36Google Scholar
  37. Malik H, Adams B, Hassan AE (2010a) Pinpointing the subsystems responsible for the performance deviations in a load test. In: 2010 IEEE 21st international symposium on software reliability engineering, pp 201–210Google Scholar
  38. Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010b) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: CSMR ’10: proceedings of the 2010 14th European conference on software maintenance and reengineering, pp 222–231Google Scholar
  39. Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010c) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: 2010 14th European conference on software maintenance and reengineering, pp 222–231Google Scholar
  40. Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: 2013 35th international conference on software engineering (ICSE), pp 1012–1021Google Scholar
  41. Mcintosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empirical Softw Engg 21(5):2146–2189CrossRefGoogle Scholar
  42. Menon A, Santos JR, Turner Y, Janakiraman GJ, Zwaenepoel W (2005) Diagnosing performance overheads in the xen virtual machine environment. In: Proceedings of the 1st ACM/USENIX international conference on virtual execution environments, pp 13–23Google Scholar
  43. Merrill CL (2009) Load testing sugarcrm in a virtual machine. Accessed: 2017-04-04
  44. Microsoft Technet (2007) Windows performance counters. Accessed: 2015-06-01
  45. Netto MA, Menon S, Vieira HV, Costa LT, De Oliveira FM, Saad R, Zorzo A (2011) Evaluating load generation in virtualized environments for software performance testing. In: IEEE International symposium on parallel and distributed processing workshops and phd forum (IPDPSW), 2011. IEEE, pp 993–1000Google Scholar
  46. Nguyen TH, Adams B, Jiang ZM, Hassan AE, Nasser M, Flora P (2012) Automated detection of performance regressions using statistical process control techniques. In: Proceedings of the 3rd ACM/SPEC international conference on performance engineering, ICPE ’12, pp 299–310Google Scholar
  47. Nistor A, Jiang T, Tan L (2013a) Discovering, reporting, and fixing performance bugs. In: 2013 10th working conference on mining software repositories (MSR), pp 237–246Google Scholar
  48. Nistor A, Song L, Marinov D, Lu S (2013b) Toddler: detecting performance problems via similar memory-access patterns. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, Piscataway, NJ, USA, pp 562–571Google Scholar
  49. NIST/SEMATECH (2003) E-handbook of statistical methods. Accessed: 2015-06-01
  50. Oracle (1998) MYSQL Server 5.6. Accessed: 2015-06-01
  51. Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. MA, Boston, pp 532–538Google Scholar
  52. Rodola G (2009) Psutil. Accessed: 2015-06-01
  53. Shang W, Hassan AE, Nasser M, Flora P (2015) Automated detection of performance regressions using regression models on clustered performance counters. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, ICPE ’15, pp 15–26Google Scholar
  54. Shewhart WA (1931) Economic control of quality of manufactured product, vol 509. ASQ Quality PressGoogle Scholar
  55. Srion E (2015) The time for hyper-v pass-through disks has passed. Accessed: 2017-04-04
  56. Stapleton JH (2008) Models for probability and statistical inference: theory and applications. WileyGoogle Scholar
  57. SugarCRM (2017) Sugarcrm. Accessed: 2017-04-04
  58. Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser M, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: 29th IEEE international conference on software maintenance (ICSM ’13), pp 110–119Google Scholar
  59. Syer MD, Shang W, Jiang ZM, Hassan AE (2017) Continuous validation of performance test workloads. Autom Softw Eng 24(1):189–231CrossRefGoogle Scholar
  60. TPC (2001) TPC-W. Accessed: 2015-06-01
  61. Tsakiltsidis S, Miranskyy A, Mazzawi E (2016) On automatic detection of performance bugs. In: 2016 IEEE international symposium on software reliability engineering workshops (ISSREW) , pp 132–139Google Scholar
  62. Tyson J (2001) How network address translation works. Accessed: 2017-04-04
  63. VMWare (2016) Accelerate software development and testing with the vmware virtualization platform. Accessed: 2016-03-16
  64. Walker HM (1929) Studies in the history of statistical method: with special reference to certain educational problems. Williams & Wilkins CoGoogle Scholar
  65. Woodside M, Franks G, Petriu DC (2007) The future of software performance engineering. In: Future of software engineering, 2007, pp 171–187Google Scholar
  66. Xiong P, Pu C, Zhu X, Griffith R (2013) Vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. In: Proceedings of the 4th ACM/SPEC international conference on performance engineering, ICPE ’13, pp 271–282Google Scholar
  67. Zaman S, Adams B, Hassan AE (2012) A qualitative study on performance bugs. In: 2012 9th IEEE working conference on mining software repositories (MSR), pp 199–208Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Department of Computer Science and Software EngineeringConcordia UniversityMontrealCanada

Personalised recommendations