Skip to main content
Log in

Empirical study on the discrepancy between performance testing results from virtual and physical environments

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Large software systems often undergo performance tests to ensure their capability to handle expected loads. These performance tests often consume large amounts of computing resources and time since heavy loads need to be generated. Making it worse, the ever evolving field requires frequent updates to the performance testing environment. In practice, virtual machines (VMs) are widely exploited to provide flexible and less costly environments for performance tests. However, the use of VMs may introduce confounding overhead (e.g., a higher than expected memory utilization with unstable I/O traffic) to the testing environment and lead to unrealistic performance testing results. Yet, little research has studied the impact on test results of using VMs in performance testing activities. To evaluate the discrepancy between the performance testing results from virtual and physical environments, we perform a case study on two open source systems – namely Dell DVD Store (DS2) and CloudStore. We conduct the same performance tests in both virtual and physical environments and compare the performance testing results based on the three aspects that are typically examined for performance testing results: 1) single performance metric (e.g. CPU Time from virtual environment vs. CPU Time from physical environment), 2) the relationship among performance metrics (e.g. correlation between CPU and I/O) and 3) performance models that are built to predict system performance. Our results show that 1) A single metric from virtual and physical environments do not follow the same distribution, hence practitioners cannot simply use a scaling factor to compare the performance between environments, 2) correlations among performance metrics in virtual environments are different from those in physical environments 3) statistical models built based on the performance metrics from virtual environments are different from the models built from physical environments suggesting that practitioners cannot use the performance testing results across virtual and physical environments. In order to assist the practitioners leverage performance testing results in both environments, we investigate ways to reduce the discrepancy. We find that such discrepancy can be reduced by normalizing performance metrics based on deviance. Overall, we suggest that practitioners should not use the performance testing results from virtual environment with the simple assumption of straightforward performance overhead. Instead, practitioners should consider leveraging normalization techniques to reduce the discrepancy before examining performance testing results from virtual and physical environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The complete results, data and scripts are shared online at http://das.encs.concordia.ca/members/moiz-arif/

References

  • Ahmed TM, Bezemer CP, Chen TH, Hassan AE, Shang W (2016) Studying the effectiveness of application performance management (apm) tools for detecting performance regressions for web applications: an experience report. In: MSR 2016: proceedings of the 13th working conference on mining software repositories

  • Andale (2012) Statistics how to - coefficient of determination (r squared). http://www.statisticshowto.com/what-is-a-coefficient-of-determination/. Accessed: 2017-04-04

  • Apache (2007) Tomcat. http://tomcat.apache.org/. Accessed: 2015-06-01

  • Apache (2008) Jmeter. http://jmeter.apache.org/. Accessed: 2015-06-01

  • Blackberry (2014) Blackberry enterprise server. https://ca.blackberry.com/enterprise. Accessed: 2017-04-04

  • Bodík P, Goldszmidt M, Fox A (2008) Hilighter: automatically building robust signatures of performance behavior for small- and large-scale systems. In: Proceedings of the third conference on tackling computer systems problems with machine learning techniques, SysML’08, pp 3–3

  • Brosig F, Gorsler F, Huber N, Kounev S (2013) Evaluating approaches for performance prediction in virtualized environments. In: 2013 IEEE 21st international symposium on modelling, analysis and simulation of computer and telecommunication systems. IEEE, pp 404–408

  • CA Technologies (2011) The avoidable cost of downtime. http://www3.ca.com/~/media/files/articles/avoidable_cost_of_downtime_part_2_ita.aspx

  • Chambers J, Hastie T, Pregibon D (1990) Statistical models in S. In: Compstat: proceedings in computational statistics, 9th symposium held at Dubrovnik, Yugoslavia, 1990. Physica-Verlag HD, Heidelberg, pp 317–321

    Chapter  Google Scholar 

  • Chen PM, Noble BD (2001) When virtual is better than real [operating system relocation to virtual machines]. In: Proceedings of the eighth workshop on hot topics in operating systems, 2001, pp 133–138

  • Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: an empirical study on software development for the cloud. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015, pp 393–403

  • CloudScale-Project (2014) Cloudstore. https://github.com/cloudscale-project/cloudstore. Accessed: 2015-06-01

  • Cohen I, Goldszmidt M, Kelly T, Symons J, Chase JS (2004) Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, OSDI’04, vol 6, pp 16–16

  • Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the twentieth ACM symposium on operating systems principles, SOSP ’05, pp 105–118

  • Costantini D (2015) How to configure a pass-through disk with hyper-v. http://thesolving.com/virtualization/how-to-configure-a-pass-through-disk-with-hyper-v/. Accessed: 2017-04-04

  • Dean J, Barroso LA (2013) The tail at scale. Commun ACM 56:74–80

    Article  Google Scholar 

  • Dee (2014) Performance-testing systems on virtual machines that normally run on physical machines. http://sqa.stackexchange.com/questions/7709/performance-testing-systems-on-virtual-machines-that-normally-run-on-physical-ma. Accessed: 2017-04-04

  • Eeton K (2012) How one second could cost amazon $1.6 billion in sales. http://www.fastcompany.com/1825005/how-one-second-could-cost-amazon-16-billion-sales. Accessed: 2016-03-11

  • Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (2010) Mining performance regression testing repositories for automated performance analysis. In: 10th international conference on quality software (QSIC), 2010, pp 32–41

  • Freedman D (2009) Statistical models: theory and practice. Cambridge University Press

  • Harrell FE (2001) Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer

  • Heger C, Happe J, Farahbod R (2013) Automated root cause isolation of performance regressions during software development. In: ICPE ’13: proceedings of the 4th ACM/SPEC international conference on performance engineering, pp 27–38

  • Huber N, von Quast M, Hauck M, Kounev S (2011) Evaluating and modeling virtualization performance overhead for cloud environments. In: Proceedings of the 1st international conference on cloud computing and services science, pp 563–573

  • Jaffe D, Muirhead T (2011) Dell dvd store. http://linux.dell.com/dvdstore/. Accessed: 2015-06-01

  • Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley

  • Jiang M, Munawar M, Reidemeister T, Ward P (2009a) Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Proceedings of 2009 IEEE/IFIP international conference on dependable systems networks, pp 285–294

  • Jiang M, Munawar MA, Reidemeister T, Ward PA (2009b) System monitoring with metric-correlation models: problems and solutions. In: Proceedings of the 6th international conference on autonomic computing, pp 13–22

  • Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: IEEE International conference on software maintenance, 2009. ICSM 2009, pp 125–134

  • Jin G, Song L, Shi X, Scherpelz J, Lu S (2012) Understanding and detecting real-world performance bugs. In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and implementation, PLDI ’12. ACM, pp 77–88

  • Kabacoff RI (2011) R in action. In: R in action. Manning Publications Co., Staten Island, NY , pp 207–213

  • Kearon S (2012) Can you use a virtual machine to performance test an application? http://stackoverflow.com/questions/8906954/can-you-use-a-virtual-machine-to-performance-test-an-application. Accessed: 2017-04-04

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, IJCAI’95, vol 2, pp 1137–1143

  • Kraft S, Casale G, Krishnamurthy D, Greer D, Kilpatrick P (2011) Io performance prediction in consolidated virtualized environments. SIGSOFT Softw Eng Notes 36(5):295–306

    Article  Google Scholar 

  • Kuhn M (2008) Building predictive models in r using the caret package. J Stat Softw Articles 28(5):1–26

    Google Scholar 

  • Leitner P, Cito J (2016) Patterns in the chaos—a study of performance variation and predictability in public iaas clouds. ACM Trans Internet Technol 16(3):15:1–15:23

    Article  Google Scholar 

  • Luo Q, Poshyvanyk D, Grechanik M (2016) Mining performance regression inducing code changes in evolving software. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16, pp 25–36

  • Malik H, Adams B, Hassan AE (2010a) Pinpointing the subsystems responsible for the performance deviations in a load test. In: 2010 IEEE 21st international symposium on software reliability engineering, pp 201–210

  • Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010b) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: CSMR ’10: proceedings of the 2010 14th European conference on software maintenance and reengineering, pp 222–231

  • Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010c) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: 2010 14th European conference on software maintenance and reengineering, pp 222–231

  • Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: 2013 35th international conference on software engineering (ICSE), pp 1012–1021

  • Mcintosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empirical Softw Engg 21(5):2146–2189

    Article  Google Scholar 

  • Menon A, Santos JR, Turner Y, Janakiraman GJ, Zwaenepoel W (2005) Diagnosing performance overheads in the xen virtual machine environment. In: Proceedings of the 1st ACM/USENIX international conference on virtual execution environments, pp 13–23

  • Merrill CL (2009) Load testing sugarcrm in a virtual machine. http://www.webperformance.com/library/reports/virtualization2/. Accessed: 2017-04-04

  • Microsoft Technet (2007) Windows performance counters. https://technet.microsoft.com/en-us/library/cc780836(v=ws.10).aspx. Accessed: 2015-06-01

  • Netto MA, Menon S, Vieira HV, Costa LT, De Oliveira FM, Saad R, Zorzo A (2011) Evaluating load generation in virtualized environments for software performance testing. In: IEEE International symposium on parallel and distributed processing workshops and phd forum (IPDPSW), 2011. IEEE, pp 993–1000

  • Nguyen TH, Adams B, Jiang ZM, Hassan AE, Nasser M, Flora P (2012) Automated detection of performance regressions using statistical process control techniques. In: Proceedings of the 3rd ACM/SPEC international conference on performance engineering, ICPE ’12, pp 299–310

  • Nistor A, Jiang T, Tan L (2013a) Discovering, reporting, and fixing performance bugs. In: 2013 10th working conference on mining software repositories (MSR), pp 237–246

  • Nistor A, Song L, Marinov D, Lu S (2013b) Toddler: detecting performance problems via similar memory-access patterns. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, Piscataway, NJ, USA, pp 562–571

  • NIST/SEMATECH (2003) E-handbook of statistical methods. http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm. Accessed: 2015-06-01

  • Oracle (1998) MYSQL Server 5.6. https://www.mysql.com/. Accessed: 2015-06-01

  • Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. MA, Boston, pp 532–538

  • Rodola G (2009) Psutil. https://github.com/giampaolo/psutil. Accessed: 2015-06-01

  • Shang W, Hassan AE, Nasser M, Flora P (2015) Automated detection of performance regressions using regression models on clustered performance counters. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, ICPE ’15, pp 15–26

  • Shewhart WA (1931) Economic control of quality of manufactured product, vol 509. ASQ Quality Press

  • Srion E (2015) The time for hyper-v pass-through disks has passed. http://www.altaro.com/hyper-v/hyper-v-pass-through-disks/. Accessed: 2017-04-04

  • Stapleton JH (2008) Models for probability and statistical inference: theory and applications. Wiley

  • SugarCRM (2017) Sugarcrm. https://www.sugarcrm.com/. Accessed: 2017-04-04

  • Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser M, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: 29th IEEE international conference on software maintenance (ICSM ’13), pp 110–119

  • Syer MD, Shang W, Jiang ZM, Hassan AE (2017) Continuous validation of performance test workloads. Autom Softw Eng 24(1):189–231

    Article  Google Scholar 

  • Tintin (2011) Performance test is not reliable on virtual machine? https://social.technet.microsoft.com/forums/windowsserver/en-US/06c0e09b-c5b4-4e2c-90e3-61b06483fe5b/performance-test-is-not-reliable-on-virtual-machine?forum=winserverhyperv . Accessed: 2017-04-04

  • TPC (2001) TPC-W. http://www.tpc.org/tpcw. Accessed: 2015-06-01

  • Tsakiltsidis S, Miranskyy A, Mazzawi E (2016) On automatic detection of performance bugs. In: 2016 IEEE international symposium on software reliability engineering workshops (ISSREW) , pp 132–139

  • Tyson J (2001) How network address translation works. http://computer.howstuffworks.com/nat2.htm. Accessed: 2017-04-04

  • VMWare (2016) Accelerate software development and testing with the vmware virtualization platform. http://www.vmware.com/pdf/development_testing.pdf. Accessed: 2016-03-16

  • Walker HM (1929) Studies in the history of statistical method: with special reference to certain educational problems. Williams & Wilkins Co

  • Woodside M, Franks G, Petriu DC (2007) The future of software performance engineering. In: Future of software engineering, 2007, pp 171–187

  • Xiong P, Pu C, Zhu X, Griffith R (2013) Vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. In: Proceedings of the 4th ACM/SPEC international conference on performance engineering, ICPE ’13, pp 271–282

  • Zaman S, Adams B, Hassan AE (2012) A qualitative study on performance bugs. In: 2012 9th IEEE working conference on mining software repositories (MSR), pp 199–208

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Moiz Arif.

Additional information

Communicated by: Mark Grechanik

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arif, M.M., Shang, W. & Shihab, E. Empirical study on the discrepancy between performance testing results from virtual and physical environments. Empir Software Eng 23, 1490–1518 (2018). https://doi.org/10.1007/s10664-017-9553-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-017-9553-x

Keywords

Navigation