Advertisement

Empirical Software Engineering

, Volume 24, Issue 4, pp 2469–2508 | Cite as

Software microbenchmarking in the cloud. How bad is it really?

  • Christoph LaaberEmail author
  • Joel Scheuner
  • Philipp Leitner
Article

Abstract

Rigorous performance engineering traditionally assumes measuring on bare-metal environments to control for as many confounding factors as possible. Unfortunately, some researchers and practitioners might not have access, knowledge, or funds to operate dedicated performance-testing hardware, making public clouds an attractive alternative. However, shared public cloud environments are inherently unpredictable in terms of the system performance they provide. In this study, we explore the effects of cloud environments on the variability of performance test results and to what extent slowdowns can still be reliably detected even in a public cloud. We focus on software microbenchmarks as an example of performance tests and execute extensive experiments on three different well-known public cloud services (AWS, GCE, and Azure) using three different cloud instance types per service. We also compare the results to a hosted bare-metal offering from IBM Bluemix. In total, we gathered more than 4.5 million unique microbenchmarking data points from benchmarks written in Java and Go. We find that the variability of results differs substantially between benchmarks and instance types (by a coefficient of variation from 0.03% to > 100%). However, executing test and control experiments on the same instances (in randomized order) allows us to detect slowdowns of 10% or less with high confidence, using state-of-the-art statistical tests (i.e., Wilcoxon rank-sum and overlapping bootstrapped confidence intervals). Finally, our results indicate that Wilcoxon rank-sum manages to detect smaller slowdowns in cloud environments.

Keywords

Performance testing Microbenchmarking Cloud Performance-regression detection 

Notes

Acknowledgements

The research leading to these results has received funding from the Swiss National Science Foundation (SNF) under project MINCA – Models to Increase the Cost Awareness of Cloud Developers (no. 165546), the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, and Chalmer’s ICT Area of Advance. Further, we are grateful for the anonymous reviewers and their feedback, which helped to significantly improve this paper’s quality.

References

  1. Abedi A, Brecht T (2017) Conducting repeatable experiments in highly variable cloud computing environments. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 287–292.  https://doi.org/10.1145/3030207.3030229
  2. Arif MM, Shang W, Shihab E (2017) Empirical study on the discrepancy between performance testing results from virtual and physical environments. Empirical Software Engineering.  https://doi.org/10.1007/s10664-017-9553-x
  3. Barna C, Litoiu M, Ghanbari H (2011) Autonomic load-testing framework. In: Proceedings of the 8th ACM international conference on autonomic computing. ACM, New York, pp 91–100.  https://doi.org/10.1145/1998582.1998598
  4. Bulej L, Horký V, Tůma P (2017) Do we teach useful statistics for performance evaluation?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering companion. ICPE ’17 Companion. ACM, New York, pp 185–189.  https://doi.org/10.1145/3053600.3053638
  5. Davison AC, Hinkley D (1997) Bootstrap methods and their application 94Google Scholar
  6. Chen J, Shang W (2017) An exploratory study of performance regression introducing code changes. In: Proceedings of the 33rd international conference on software maintenance and evolution. ICSME ’17, New YorkGoogle Scholar
  7. Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: An empirical study on software development for the cloud. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering. ESEC/FSE, vol 2015. ACM, New York, pp 393–403.  https://doi.org/10.1145/2786805.2786826
  8. Cliff N (1996) Ordinal methods for behavioral data analysis, 1st edn. Psychology Press, LondonGoogle Scholar
  9. Farley B, Juels A, Varadarajan V, Ristenpart T, Bowers KD, Swift MM (2012) More for your money: exploiting performance heterogeneity in public clouds. In: Proceedings of the 3rd ACM symposium on cloud computing. SoCC ’12. ACM, New York, pp 20:1–20:14, DOI  https://doi.org/10.1145/2391229.2391249
  10. Foo KC, Jiang ZMJ, Adams B, Hassan AE, Zou Y, Flora P (2015) An industrial case study on the automated detection of performance regressions in heterogeneous environments. In: Proceedings of the 37th international conference on software engineering, vol 2. IEEE Press, Piscataway, pp 159–168. http://dl.acm.org/citation.cfm?id=2819009.2819034
  11. Georges A, Buytaert D, Eeckhout L (2007) Statistically rigorous java performance evaluation. In: Proceedings of the 22Nd annual ACM SIGPLAN conference on object-oriented programming systems and applications. OOPSLA ’07. ACM, New York, pp 57–76.  https://doi.org/10.1145/1297027.1297033
  12. Gillam L, Li B, O’Loughlin J, Tomar APS (2013) Fair benchmarking for cloud computing systems. J Cloud Comput Adv Syst Appl 2(1):6.  https://doi.org/10.1186/2192-113X-2-6 CrossRefGoogle Scholar
  13. Grechanik M, Fu C, Xie Q (2012) Automatically finding performance problems with feedback-directed learning software testing. In: Proceedings of the 34th international conference on software engineering. IEEE Press, Piscataway, pp 156–166. http://dl.acm.org/citation.cfm?id=2337223.2337242
  14. Hesterberg TC (2015) What teachers should know about the bootstrap: Resampling in the undergraduate statistics curriculum. Am Stat 69(4):371–386.  https://doi.org/10.1080/00031305.2015.1089789 MathSciNetCrossRefGoogle Scholar
  15. Horky V, Libic P, Marek L, Steinhauser A, Tuma P (2015) Utilizing performance unit tests to increase performance awareness. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering. ICPE ’15. ACM, New York, pp 289–300.  https://doi.org/10.1145/2668930.2688051
  16. Iosup A, Yigitbasi N, Epema D (2011) On the performance variability of production cloud services. In: Proceedings of the 2011 11th IEEE/ACM international symposium on cluster, cloud and grid computing. CCGRID ’11. IEEE Computer Society, Washington, pp 104–113.  https://doi.org/10.1109/CCGrid.2011.22
  17. Jain R (1991) The art of computer systems performance analysis. Wiley, New YorkzbMATHGoogle Scholar
  18. Jiang ZM, Hassan AE (2015) A survey on load testing of large-scale software systems. IEEE Trans Softw Eng 41(11):1091–1118.  https://doi.org/10.1109/TSE.2015.2445340 CrossRefGoogle Scholar
  19. John LK, Eeckhout L (2005) Performance evaluation and benchmarking, 1st edn. CRC Press, Boca RatonGoogle Scholar
  20. Kalibera T, Jones R (2012) Quantifying performance changes with effect size confidence intervals. Technical Report 4–12, University of Kent. http://www.cs.kent.ac.uk/pubs/2012/3233
  21. Kalibera T, Jones R (2013) Rigorous benchmarking in reasonable time. In: Proceedings of the 2013 international symposium on memory management. ACM, New York, pp 63–74.  https://doi.org/10.1145/2464157.2464160
  22. Laaber C, Leitner P (2018) An evaluation of open-source software microbenchmark suites for continuous performance assessment. In: MSR ’18: 15th international conference on mining software repositories. ACM, New York.  https://doi.org/10.1145/3196398.3196407
  23. Laaber C, Scheuner J, Leitner P (2019) Dataset, scripts, and online appendix “Software microbenchmarking in the cloud. How bad is it really?”  https://doi.org/10.6084/m9.figshare.7546703
  24. Leitner P, Bezemer CP (2017) An exploratory study of the state of practice of performance testing in java-based open source projects. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 373–384.  https://doi.org/10.1145/3030207.3030213
  25. Leitner P, Cito J (2016) Patterns in the chaos–a study of performance variation and predictability in public iaas clouds. ACM Trans Internet Technol 16(3):15:1–15:23.  https://doi.org/10.1145/2885497 CrossRefGoogle Scholar
  26. Mell P, Grance T (2011) The nist definition of cloud computing. Tech. Rep. 800-145 National Institute of Standards and Technology (NIST). MD, GaithersburgGoogle Scholar
  27. Menascė DA (2002) Load testing of web sites. IEEE Internet Comput 6(4):70–74.  https://doi.org/10.1109/MIC.2002.1020328 CrossRefGoogle Scholar
  28. Mytkowicz T, Diwan A, Hauswirth M (2009) Producing wrong data without doing anything obviously wrong!. In: Proceedings of the 14th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 265–276.  https://doi.org/10.1145/1508244.1508275
  29. Nguyen THD, Nagappan M, Hassan AE, Nasser M, Flora P (2014) An industrial case study of automatically identifying performance regression-causes. In: Proceedings of the 11th working conference on mining software repositories, vol 2014. ACM, New York, pp 232–241.  https://doi.org/10.1145/2597073.2597092
  30. Ou Z, Zhuang H, Nurminen JK, Ylä-Jääski A, Hui P (2012) Exploiting hardware heterogeneity within the same instance type of amazon ec2. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing (HotCloud’12). USENIX Association, Berkeley, pp 4–4. http://dl.acm.org/citation.cfm?id=2342763.2342767
  31. Ren S, Lai H, Tong W, Aminzadeh M, Hou X, Lai S (2010) Nonparametric bootstrapping for hierarchical data. J Appl Stat 37(9):1487–1498.  https://doi.org/10.1080/02664760903046102 MathSciNetCrossRefGoogle Scholar
  32. Romano J, Kromrey J, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys? In: Annual Meeting of the Florida Association of Institutional Research, pp 1–3Google Scholar
  33. Scheuner J, Leitner P, Cito J, Gall H (2014) Cloud work bench – infrastructure-as-code based cloud benchmarking. In: Proceedings of the 2014 IEEE 6th international conference on cloud computing technology and science. IEEE Computer Society, Washington, pp 246–253.  https://doi.org/10.1109/CloudCom.2014.98
  34. Stefan P, Horky V, Bulej L, Tuma P (2017) Unit testing performance in java projects: are we there yet?. In: Proceedings of the 8th ACM/SPEC on international conference on performance engineering. ICPE ’17. ACM, New York, pp 401–412.  https://doi.org/10.1145/3030207.3030226
  35. Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: issues, an approach, and case study. IEEE Trans Softw Eng 26(12):1147–1156.  https://doi.org/10.1109/32.888628 CrossRefGoogle Scholar
  36. Woodside M, Franks G, Petriu DC (2007) The future of software performance engineering. In: 2007 future of software engineering. IEEE Computer Society, Washington, pp 171–187.  https://doi.org/10.1109/FOSE.2007.32

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of InformaticsUniversity of ZurichZurichSwitzerland
  2. 2.Software Engineering DivisionChalmers | University of GothenburgGothenburgSweden

Personalised recommendations