Software microbenchmarking in the cloud. How bad is it really?


Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to > 100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments.

This is a preview of subscription content, log in to check access.

Listing 1
Listing 2
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.

  7. 7.

  8. 8.

  9. 9.

  10. 10.


  1. Abedi A, Brecht T (2017) Conducting repeatable experiments in highly variable cloud computing environments. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 287–292.

  2. Arif MM, Shang W, Shihab E (2017) Empirical study on the discrepancy between performance testing results from virtual and physical environments. Empirical Software Engineering.

  3. Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th ACM international conference on autonomic computing. ACM, New York, pp 91–100.

  4. Bulej L, Horký V, Tůma P (2017) Do we teach useful statistics for performance evaluation?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering companion. ICPE ’17 Companion. ACM, New York, pp 185–189.

  5. Davison AC, Hinkley D (1997) Bootstrap methods and their application 94

  6. Chen J, Shang W (2017) An exploratory study of performance regression introducing code changes. In: Proceedings of the 33rd international conference on software maintenance and evolution. ICSME ’17, New York

  7. Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: An empirical study on software development for the cloud. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ESEC/FSE, vol 2015. ACM, New York, pp 393–403.

  8. Cliff N (1996) Ordinal methods for behavioral data analysis, 1st edn. Psychology Press, London

    Google Scholar 

  9. Farley B, Juels A, Varadarajan V, Ristenpart T, Bowers KD, Swift MM (2012) More for your money: exploiting performance heterogeneity in public clouds. In: Proceedings of the 3rd ACM symposium on cloud computing. SoCC ’12. ACM, New York, pp 20:1–20:14, DOI

  10. Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering, vol 2. IEEE Press, Piscataway, pp 159–168.

  11. Georges A, Buytaert D, Eeckhout L (2007) Statistically rigorous java performance evaluation. In: Proceedings of the 22Nd annual ACM SIGPLAN conference on object-oriented programming systems and applications. OOPSLA ’07. ACM, New York, pp 57–76.

  12. Gillam L, Li B, O’Loughlin J, Tomar APS (2013) Fair benchmarking for cloud computing systems. J Cloud Comput Adv Syst Appl 2(1):6.

    Article  Google Scholar 

  13. Grechanik M, Fu C, Xie Q (2012) Automatically finding performance problems with feedback-directed learning software testing. In: Proceedings of the 34th international conference on software engineering. IEEE Press, Piscataway, pp 156–166.

  14. Hesterberg TC (2015) What teachers should know about the bootstrap: Resampling in the undergraduate statistics curriculum. Am Stat 69(4):371–386.

    MathSciNet  Article  Google Scholar 

  15. Horky V, Libic P, Marek L, Steinhauser A, Tuma P (2015) Utilizing performance unit tests to increase performance awareness. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering. ICPE ’15. ACM, New York, pp 289–300.

  16. Iosup A, Yigitbasi N, Epema D (2011) On the performance variability of production cloud services. In: Proceedings of the 2011 11th IEEE/ACM international symposium on cluster, cloud and grid computing. CCGRID ’11. IEEE Computer Society, Washington, pp 104–113.

  17. Jain R (1991) The art of computer systems performance analysis. Wiley, New York

    Google Scholar 

  18. Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Softw Eng 41(11):1091–1118.

    Article  Google Scholar 

  19. John LK, Eeckhout L (2005) Performance evaluation and benchmarking, 1st edn. CRC Press, Boca Raton

    Google Scholar 

  20. Kalibera T, Jones R (2012) Quantifying performance changes with effect size confidence intervals. Technical Report 4–12, University of Kent.

  21. Kalibera T, Jones R (2013) Rigorous benchmarking in reasonable time. In: Proceedings of the 2013 international symposium on memory management. ACM, New York, pp 63–74.

  22. Laaber C, Leitner P (2018) An evaluation of open-source software microbenchmark suites for continuous performance assessment. In: MSR ’18: 15th international conference on mining software repositories. ACM, New York.

  23. Laaber C, Scheuner J, Leitner P (2019) Dataset, scripts, and online appendix “Software microbenchmarking in the cloud. How bad is it really?”

  24. Leitner P, Bezemer CP (2017) An exploratory study of the state of practice of performance testing in java-based open source projects. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 373–384.

  25. Leitner P, Cito J (2016) Patterns in the chaos–a study of performance variation and predictability in public iaas clouds. ACM Trans Internet Technol 16(3):15:1–15:23.

    Article  Google Scholar 

  26. Mell P, Grance T (2011) The nist definition of cloud computing. Tech. Rep. 800-145 National Institute of Standards and Technology (NIST). MD, Gaithersburg

    Google Scholar 

  27. Menascė DA (2002) Load testing of web sites. IEEE Internet Comput 6(4):70–74.

    Article  Google Scholar 

  28. Mytkowicz T, Diwan A, Hauswirth M (2009) Producing wrong data without doing anything obviously wrong!. In: Proceedings of the 14th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 265–276.

  29. Nguyen THD, Nagappan M, Hassan AE, Nasser M, Flora P (2014) An industrial case study of automatically identifying performance regression-causes. In: Proceedings of the 11th working conference on mining software repositories, vol 2014. ACM, New York, pp 232–241.

  30. Ou Z, Zhuang H, Nurminen JK, Ylä-Jääski A, Hui P (2012) Exploiting hardware heterogeneity within the same instance type of amazon ec2. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing (HotCloud’12). USENIX Association, Berkeley, pp 4–4.

  31. Ren S, Lai H, Tong W, Aminzadeh M, Hou X, Lai S (2010) Nonparametric bootstrapping for hierarchical data. J Appl Stat 37(9):1487–1498.

    MathSciNet  Article  Google Scholar 

  32. Romano J, Kromrey J, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys? In: Annual Meeting of the Florida Association of Institutional Research, pp 1–3

  33. Scheuner J, Leitner P, Cito J, Gall H (2014) Cloud work bench – infrastructure-as-code based cloud benchmarking. In: Proceedings of the 2014 IEEE 6th international conference on cloud computing technology and science. IEEE Computer Society, Washington, pp 246–253.

  34. Stefan P, Horky V, Bulej L, Tuma P (2017) Unit testing performance in java projects: are we there yet?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 401–412.

  35. Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: issues, an approach, and case study. IEEE Trans Softw Eng 26(12):1147–1156.

    Article  Google Scholar 

  36. Woodside M, Franks G, Petriu DC (2007) The future of software performance engineering. In: 2007 future of software engineering. IEEE Computer Society, Washington, pp 171–187.

Download references


The research leading to these results has received funding from the Swiss National Science Foundation (SNF) under project MINCA – Models to Increase the Cost Awareness of Cloud Developers (no. 165546), the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and Chalmer’s ICT Area of Advance. Further, we are grateful for the anonymous reviewers and their feedback, which helped to significantly improve this paper’s quality.

Author information



Corresponding author

Correspondence to Christoph Laaber.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Vittorio Cortellessa

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Laaber, C., Scheuner, J. & Leitner, P. Software microbenchmarking in the cloud. How bad is it really?. Empir Software Eng 24, 2469–2508 (2019).

Download citation


  • Performance testing
  • Microbenchmarking
  • Cloud
  • Performance-regression detection