Software microbenchmarking in the cloud. How bad is it really?

Laaber, Christoph; Scheuner, Joel; Leitner, Philipp

doi:10.1007/s10664-019-09681-1

Software microbenchmarking in the cloud. How bad is it really?

Published: 17 April 2019

Volume 24, pages 2469–2508, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

2198 Accesses
33 Citations
5 Altmetric
Explore all metrics

Abstract

Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to > 100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking Serverless Computing Platforms

Article 07 July 2020

Horácio Martins, Filipe Araujo & Paulo Rupino da Cunha

A configurable method for benchmarking scalability of cloud-native applications

Article Open access 06 August 2022

Sören Henning & Wilhelm Hasselbring

A Reusable Architecture for Dependability and Performance Benchmarking of Cloud Services

Notes

References

Abedi A, Brecht T (2017) Conducting repeatable experiments in highly variable cloud computing environments. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 287–292. https://doi.org/10.1145/3030207.3030229
Arif MM, Shang W, Shihab E (2017) Empirical study on the discrepancy between performance testing results from virtual and physical environments. Empirical Software Engineering. https://doi.org/10.1007/s10664-017-9553-x
Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th ACM international conference on autonomic computing. ACM, New York, pp 91–100. https://doi.org/10.1145/1998582.1998598
Bulej L, Horký V, Tůma P (2017) Do we teach useful statistics for performance evaluation?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering companion. ICPE ’17 Companion. ACM, New York, pp 185–189. https://doi.org/10.1145/3053600.3053638
Davison AC, Hinkley D (1997) Bootstrap methods and their application 94
Chen J, Shang W (2017) An exploratory study of performance regression introducing code changes. In: Proceedings of the 33rd international conference on software maintenance and evolution. ICSME ’17, New York
Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: An empirical study on software development for the cloud. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ESEC/FSE, vol 2015. ACM, New York, pp 393–403. https://doi.org/10.1145/2786805.2786826
Cliff N (1996) Ordinal methods for behavioral data analysis, 1st edn. Psychology Press, London
Google Scholar
Farley B, Juels A, Varadarajan V, Ristenpart T, Bowers KD, Swift MM (2012) More for your money: exploiting performance heterogeneity in public clouds. In: Proceedings of the 3rd ACM symposium on cloud computing. SoCC ’12. ACM, New York, pp 20:1–20:14, DOI https://doi.org/10.1145/2391229.2391249
Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering, vol 2. IEEE Press, Piscataway, pp 159–168. http://dl.acm.org/citation.cfm?id=2819009.2819034
Georges A, Buytaert D, Eeckhout L (2007) Statistically rigorous java performance evaluation. In: Proceedings of the 22Nd annual ACM SIGPLAN conference on object-oriented programming systems and applications. OOPSLA ’07. ACM, New York, pp 57–76. https://doi.org/10.1145/1297027.1297033
Gillam L, Li B, O’Loughlin J, Tomar APS (2013) Fair benchmarking for cloud computing systems. J Cloud Comput Adv Syst Appl 2(1):6. https://doi.org/10.1186/2192-113X-2-6
Article Google Scholar
Grechanik M, Fu C, Xie Q (2012) Automatically finding performance problems with feedback-directed learning software testing. In: Proceedings of the 34th international conference on software engineering. IEEE Press, Piscataway, pp 156–166. http://dl.acm.org/citation.cfm?id=2337223.2337242
Hesterberg TC (2015) What teachers should know about the bootstrap: Resampling in the undergraduate statistics curriculum. Am Stat 69(4):371–386. https://doi.org/10.1080/00031305.2015.1089789
Article MathSciNet Google Scholar
Horky V, Libic P, Marek L, Steinhauser A, Tuma P (2015) Utilizing performance unit tests to increase performance awareness. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering. ICPE ’15. ACM, New York, pp 289–300. https://doi.org/10.1145/2668930.2688051
Iosup A, Yigitbasi N, Epema D (2011) On the performance variability of production cloud services. In: Proceedings of the 2011 11th IEEE/ACM international symposium on cluster, cloud and grid computing. CCGRID ’11. IEEE Computer Society, Washington, pp 104–113. https://doi.org/10.1109/CCGrid.2011.22
Jain R (1991) The art of computer systems performance analysis. Wiley, New York
MATH Google Scholar
Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Softw Eng 41(11):1091–1118. https://doi.org/10.1109/TSE.2015.2445340
Article Google Scholar
John LK, Eeckhout L (2005) Performance evaluation and benchmarking, 1st edn. CRC Press, Boca Raton
Google Scholar
Kalibera T, Jones R (2012) Quantifying performance changes with effect size confidence intervals. Technical Report 4–12, University of Kent. http://www.cs.kent.ac.uk/pubs/2012/3233
Kalibera T, Jones R (2013) Rigorous benchmarking in reasonable time. In: Proceedings of the 2013 international symposium on memory management. ACM, New York, pp 63–74. https://doi.org/10.1145/2464157.2464160
Laaber C, Leitner P (2018) An evaluation of open-source software microbenchmark suites for continuous performance assessment. In: MSR ’18: 15th international conference on mining software repositories. ACM, New York. https://doi.org/10.1145/3196398.3196407
Laaber C, Scheuner J, Leitner P (2019) Dataset, scripts, and online appendix “Software microbenchmarking in the cloud. How bad is it really?” https://doi.org/10.6084/m9.figshare.7546703
Leitner P, Bezemer CP (2017) An exploratory study of the state of practice of performance testing in java-based open source projects. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 373–384. https://doi.org/10.1145/3030207.3030213
Leitner P, Cito J (2016) Patterns in the chaos–a study of performance variation and predictability in public iaas clouds. ACM Trans Internet Technol 16(3):15:1–15:23. https://doi.org/10.1145/2885497
Article Google Scholar
Mell P, Grance T (2011) The nist definition of cloud computing. Tech. Rep. 800-145 National Institute of Standards and Technology (NIST). MD, Gaithersburg
Google Scholar
Menascė DA (2002) Load testing of web sites. IEEE Internet Comput 6(4):70–74. https://doi.org/10.1109/MIC.2002.1020328
Article Google Scholar
Mytkowicz T, Diwan A, Hauswirth M (2009) Producing wrong data without doing anything obviously wrong!. In: Proceedings of the 14th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 265–276. https://doi.org/10.1145/1508244.1508275
Nguyen THD, Nagappan M, Hassan AE, Nasser M, Flora P (2014) An industrial case study of automatically identifying performance regression-causes. In: Proceedings of the 11th working conference on mining software repositories, vol 2014. ACM, New York, pp 232–241. https://doi.org/10.1145/2597073.2597092
Ou Z, Zhuang H, Nurminen JK, Ylä-Jääski A, Hui P (2012) Exploiting hardware heterogeneity within the same instance type of amazon ec2. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing (HotCloud’12). USENIX Association, Berkeley, pp 4–4. http://dl.acm.org/citation.cfm?id=2342763.2342767
Ren S, Lai H, Tong W, Aminzadeh M, Hou X, Lai S (2010) Nonparametric bootstrapping for hierarchical data. J Appl Stat 37(9):1487–1498. https://doi.org/10.1080/02664760903046102
Article MathSciNet Google Scholar
Romano J, Kromrey J, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys? In: Annual Meeting of the Florida Association of Institutional Research, pp 1–3
Scheuner J, Leitner P, Cito J, Gall H (2014) Cloud work bench – infrastructure-as-code based cloud benchmarking. In: Proceedings of the 2014 IEEE 6th international conference on cloud computing technology and science. IEEE Computer Society, Washington, pp 246–253. https://doi.org/10.1109/CloudCom.2014.98
Stefan P, Horky V, Bulej L, Tuma P (2017) Unit testing performance in java projects: are we there yet?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 401–412. https://doi.org/10.1145/3030207.3030226
Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: issues, an approach, and case study. IEEE Trans Softw Eng 26(12):1147–1156. https://doi.org/10.1109/32.888628
Article Google Scholar
Woodside M, Franks G, Petriu DC (2007) The future of software performance engineering. In: 2007 future of software engineering. IEEE Computer Society, Washington, pp 171–187. https://doi.org/10.1109/FOSE.2007.32

Download references

Acknowledgements

The research leading to these results has received funding from the Swiss National Science Foundation (SNF) under project MINCA – Models to Increase the Cost Awareness of Cloud Developers (no. 165546), the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and Chalmer’s ICT Area of Advance. Further, we are grateful for the anonymous reviewers and their feedback, which helped to significantly improve this paper’s quality.

Author information

Authors and Affiliations

Department of Informatics, University of Zurich, Zurich, Switzerland
Christoph Laaber
Software Engineering Division, Chalmers | University of Gothenburg, Gothenburg, Sweden
Joel Scheuner & Philipp Leitner

Authors

Christoph Laaber
View author publications
You can also search for this author in PubMed Google Scholar
Joel Scheuner
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Leitner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Laaber.

Additional information

Communicated by: Vittorio Cortellessa

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laaber, C., Scheuner, J. & Leitner, P. Software microbenchmarking in the cloud. How bad is it really?. Empir Software Eng 24, 2469–2508 (2019). https://doi.org/10.1007/s10664-019-09681-1

Download citation

Published: 17 April 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s10664-019-09681-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software microbenchmarking in the cloud. How bad is it really?

Abstract

Access this article

Similar content being viewed by others

Benchmarking Serverless Computing Platforms

A configurable method for benchmarking scalability of cloud-native applications

A Reusable Architecture for Dependability and Performance Benchmarking of Cloud Services

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Benchmarking Serverless Computing Platforms

A configurable method for benchmarking scalability of cloud-native applications

A Reusable Architecture for Dependability and Performance Benchmarking of Cloud Services

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation