Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to > 100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Abedi A, Brecht T (2017) Conducting repeatable experiments in highly variable cloud computing environments. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 287–292. https://doi.org/10.1145/3030207.3030229
Arif MM, Shang W, Shihab E (2017) Empirical study on the discrepancy between performance testing results from virtual and physical environments. Empirical Software Engineering. https://doi.org/10.1007/s10664-017-9553-x
Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th ACM international conference on autonomic computing. ACM, New York, pp 91–100. https://doi.org/10.1145/1998582.1998598
Bulej L, Horký V, Tůma P (2017) Do we teach useful statistics for performance evaluation?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering companion. ICPE ’17 Companion. ACM, New York, pp 185–189. https://doi.org/10.1145/3053600.3053638
Davison AC, Hinkley D (1997) Bootstrap methods and their application 94
Chen J, Shang W (2017) An exploratory study of performance regression introducing code changes. In: Proceedings of the 33rd international conference on software maintenance and evolution. ICSME ’17, New York
Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: An empirical study on software development for the cloud. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ESEC/FSE, vol 2015. ACM, New York, pp 393–403. https://doi.org/10.1145/2786805.2786826
Cliff N (1996) Ordinal methods for behavioral data analysis, 1st edn. Psychology Press, London
Farley B, Juels A, Varadarajan V, Ristenpart T, Bowers KD, Swift MM (2012) More for your money: exploiting performance heterogeneity in public clouds. In: Proceedings of the 3rd ACM symposium on cloud computing. SoCC ’12. ACM, New York, pp 20:1–20:14, DOI https://doi.org/10.1145/2391229.2391249
Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering, vol 2. IEEE Press, Piscataway, pp 159–168. http://dl.acm.org/citation.cfm?id=2819009.2819034
Georges A, Buytaert D, Eeckhout L (2007) Statistically rigorous java performance evaluation. In: Proceedings of the 22Nd annual ACM SIGPLAN conference on object-oriented programming systems and applications. OOPSLA ’07. ACM, New York, pp 57–76. https://doi.org/10.1145/1297027.1297033
Gillam L, Li B, O’Loughlin J, Tomar APS (2013) Fair benchmarking for cloud computing systems. J Cloud Comput Adv Syst Appl 2(1):6. https://doi.org/10.1186/2192-113X-2-6
Grechanik M, Fu C, Xie Q (2012) Automatically finding performance problems with feedback-directed learning software testing. In: Proceedings of the 34th international conference on software engineering. IEEE Press, Piscataway, pp 156–166. http://dl.acm.org/citation.cfm?id=2337223.2337242
Hesterberg TC (2015) What teachers should know about the bootstrap: Resampling in the undergraduate statistics curriculum. Am Stat 69(4):371–386. https://doi.org/10.1080/00031305.2015.1089789
Horky V, Libic P, Marek L, Steinhauser A, Tuma P (2015) Utilizing performance unit tests to increase performance awareness. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering. ICPE ’15. ACM, New York, pp 289–300. https://doi.org/10.1145/2668930.2688051
Iosup A, Yigitbasi N, Epema D (2011) On the performance variability of production cloud services. In: Proceedings of the 2011 11th IEEE/ACM international symposium on cluster, cloud and grid computing. CCGRID ’11. IEEE Computer Society, Washington, pp 104–113. https://doi.org/10.1109/CCGrid.2011.22
Jain R (1991) The art of computer systems performance analysis. Wiley, New York
Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Softw Eng 41(11):1091–1118. https://doi.org/10.1109/TSE.2015.2445340
John LK, Eeckhout L (2005) Performance evaluation and benchmarking, 1st edn. CRC Press, Boca Raton
Kalibera T, Jones R (2012) Quantifying performance changes with effect size confidence intervals. Technical Report 4–12, University of Kent. http://www.cs.kent.ac.uk/pubs/2012/3233
Kalibera T, Jones R (2013) Rigorous benchmarking in reasonable time. In: Proceedings of the 2013 international symposium on memory management. ACM, New York, pp 63–74. https://doi.org/10.1145/2464157.2464160
Laaber C, Leitner P (2018) An evaluation of open-source software microbenchmark suites for continuous performance assessment. In: MSR ’18: 15th international conference on mining software repositories. ACM, New York. https://doi.org/10.1145/3196398.3196407
Laaber C, Scheuner J, Leitner P (2019) Dataset, scripts, and online appendix “Software microbenchmarking in the cloud. How bad is it really?” https://doi.org/10.6084/m9.figshare.7546703
Leitner P, Bezemer CP (2017) An exploratory study of the state of practice of performance testing in java-based open source projects. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 373–384. https://doi.org/10.1145/3030207.3030213
Leitner P, Cito J (2016) Patterns in the chaos–a study of performance variation and predictability in public iaas clouds. ACM Trans Internet Technol 16(3):15:1–15:23. https://doi.org/10.1145/2885497
Mell P, Grance T (2011) The nist definition of cloud computing. Tech. Rep. 800-145 National Institute of Standards and Technology (NIST). MD, Gaithersburg
Menascė DA (2002) Load testing of web sites. IEEE Internet Comput 6(4):70–74. https://doi.org/10.1109/MIC.2002.1020328
Mytkowicz T, Diwan A, Hauswirth M (2009) Producing wrong data without doing anything obviously wrong!. In: Proceedings of the 14th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 265–276. https://doi.org/10.1145/1508244.1508275
Nguyen THD, Nagappan M, Hassan AE, Nasser M, Flora P (2014) An industrial case study of automatically identifying performance regression-causes. In: Proceedings of the 11th working conference on mining software repositories, vol 2014. ACM, New York, pp 232–241. https://doi.org/10.1145/2597073.2597092
Ou Z, Zhuang H, Nurminen JK, Ylä-Jääski A, Hui P (2012) Exploiting hardware heterogeneity within the same instance type of amazon ec2. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing (HotCloud’12). USENIX Association, Berkeley, pp 4–4. http://dl.acm.org/citation.cfm?id=2342763.2342767
Ren S, Lai H, Tong W, Aminzadeh M, Hou X, Lai S (2010) Nonparametric bootstrapping for hierarchical data. J Appl Stat 37(9):1487–1498. https://doi.org/10.1080/02664760903046102
Romano J, Kromrey J, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys? In: Annual Meeting of the Florida Association of Institutional Research, pp 1–3
Scheuner J, Leitner P, Cito J, Gall H (2014) Cloud work bench – infrastructure-as-code based cloud benchmarking. In: Proceedings of the 2014 IEEE 6th international conference on cloud computing technology and science. IEEE Computer Society, Washington, pp 246–253. https://doi.org/10.1109/CloudCom.2014.98
Stefan P, Horky V, Bulej L, Tuma P (2017) Unit testing performance in java projects: are we there yet?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 401–412. https://doi.org/10.1145/3030207.3030226
Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: issues, an approach, and case study. IEEE Trans Softw Eng 26(12):1147–1156. https://doi.org/10.1109/32.888628
Woodside M, Franks G, Petriu DC (2007) The future of software performance engineering. In: 2007 future of software engineering. IEEE Computer Society, Washington, pp 171–187. https://doi.org/10.1109/FOSE.2007.32
The research leading to these results has received funding from the Swiss National Science Foundation (SNF) under project MINCA – Models to Increase the Cost Awareness of Cloud Developers (no. 165546), the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and Chalmer’s ICT Area of Advance. Further, we are grateful for the anonymous reviewers and their feedback, which helped to significantly improve this paper’s quality.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Communicated by: Vittorio Cortellessa
About this article
Cite this article
Laaber, C., Scheuner, J. & Leitner, P. Software microbenchmarking in the cloud. How bad is it really?. Empir Software Eng 24, 2469–2508 (2019). https://doi.org/10.1007/s10664-019-09681-1
- Performance testing
- Performance-regression detection