Automated Software Engineering

, Volume 24, Issue 1, pp 189–231 | Cite as

Continuous validation of performance test workloads

  • Mark D. SyerEmail author
  • Weiyi Shang
  • Zhen Ming Jiang
  • Ahmed E. Hassan


The rise of large-scale software systems poses many new challenges for the software performance engineering field. Failures in these systems are often associated with performance issues, rather than with feature bugs. Therefore, performance testing has become essential to ensuring the problem-free operation of these systems. However, the performance testing process is faced with a major challenge: evolving field workloads, in terms of evolving feature sets and usage patterns, often lead to “outdated” tests that are not reflective of the field. Hence performance analysts must continually validate whether their tests are still reflective of the field. Such validation may be performed by comparing execution logs from the test and the field. However, the size and unstructured nature of execution logs makes such a comparison unfeasible without automated support. In this paper, we propose an automated approach to validate whether a performance test resembles the field workload and, if not, determines how they differ. Performance analysts can then update their tests to eliminate such differences, hence creating more realistic tests. We perform six case studies on two large systems: one open-source system and one enterprise system. Our approach identifies differences between performance tests and the field with a precision of 92 % compared to only 61 % for the state-of-the-practice and 19 % for a conventional statistical comparison.


Performance testing Continuous testing Workload characterization Workload comparison Execution logs 



We would like to thank BlackBerry for providing access to the enterprise system used in our case study. The findings and opinions expressed in this paper are those of the authors and do not necessarily represent or reflect those of BlackBerry and/or its subsidiaries and affiliates. Moreover, our results do not reflect the quality of BlackBerry’s products. We would also like to thank Microsoft Azure for (1) providing us access to a large-scale deployment and (2) working closely with us to setup and troubleshoot our deployment.


  1. Adam K.: Process a million songs with apache pig. (2012). Accessed 28 Oct 2015
  2. Ausick, P.: NASDAQ gets off cheap in Facebook IPO SNAFU. (2012). Accessed 09 Dec 2014
  3. Avritzer, A., Weyuker, E.J.: Generating test suites for software load testing. In: Proceedings of the International Symposium on Software Testing and Analysis, pp. 44–57 (1994)Google Scholar
  4. Avritzer, A., Weyuker, E.J.: The automatic generation of load test suites and the assessment of the resulting software. Trans. Softw. Eng. 21(9), 705–716 (1995)CrossRefGoogle Scholar
  5. Barros, M.D., Shiau, J., Shang, C., Gidewall, K., Shi, H., Forsmann, J.: Web services wind tunnel: on performance testing large-scale stateful web services. In: International Conference on Dependable Systems and Networks, pp. 612–617 (2007)Google Scholar
  6. Bataille, J.: Operational progress report. (2013). Accessed 01 Jun 2014
  7. Benoit, D.: Nasdaqs blow-by-blow on what happened to Facebook. (2013). Accessed 05 May 2014
  8. Bernat, A.R., Miller B.P.: Anywhere, any-time binary instrumentation. In: Proceedings of the Workshop on Program Analysis for Software Tools, pp. 9–16 (2011)Google Scholar
  9. Bertolotti, L., Calzarossa, M.C.: Models of mail server workloads. Perform. Eval. 46(2–3), 65–76 (2001)CrossRefzbMATHGoogle Scholar
  10. Cai, Y., Grundy, J., Hosking, J.: Synthesizing client load models for performance engineering via web crawling. In: Proceedings of the International Conference on Automated Software Engineering, pp. 353–362 (2007)Google Scholar
  11. Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)MathSciNetzbMATHGoogle Scholar
  12. Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. Int. J Math. Models Methods Appl. Sci. 1(4), 300–307 (2007)MathSciNetGoogle Scholar
  13. Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. Proc. VLDB Endow. 5(12), 1802–1813 (2012)CrossRefGoogle Scholar
  14. Cheng, J.: Steve jobs on MobileMe. (2008). Accessed 25 Jan 2014
  15. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge, New York (1988)zbMATHGoogle Scholar
  16. Coleman P.: The avoidable cost of downtime. (2011). Accessed 14 Apr 2014
  17. Cornelissen, B., Zaidman, A., van Deursen, A., Moonen, L., Koschke, R.: A systematic survey of program comprehension through dynamic analysis. Trans. Softw. Eng. 35(5), 684–702 (2009)CrossRefGoogle Scholar
  18. Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)CrossRefGoogle Scholar
  19. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  20. Draheim, D., Grundy, J., Hosking, J., Lutteroth, C., Weber, G.: Realistic load testing of web applications. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 57–68 (2006)Google Scholar
  21. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, 1st edn. Wiley, New York (1973)zbMATHGoogle Scholar
  22. Elliott, A.C.: Statistical Analysis Quick Reference Guidebook, 1st edn. Sage, Thousand Oaks (2006)Google Scholar
  23. Frades, I., Matthiesen, R.: Overview on techniques in cluster analysis. Bioinform. Methods Clin. Res. 593, 81–107 (2009)CrossRefGoogle Scholar
  24. Fulekar, M.H.: Bioinformatics: Applications in Life and Environmental Sciences, 1st edn. Springer, New York (2008)Google Scholar
  25. Greenwood, D., Lyell, M., Mallya, A., Suguri, H.: The IEEE FIPA approach to integrating software agents and web services. In: Proceedings of the International Joint Conference on Autonomous-Agents and Multiagent Systems, pp. 1412–1418 (2007)Google Scholar
  26. Hadoop: (2014). Accessed 17 Apr 2013
  27. Hadoop-LZO: (2011). Accessed 28 Oct 2015
  28. Harris, C.: IT downtime costs \({\$}\)26.5 billion in lost revenue.$265-billion-in-lost-revenue/d/d-id/1097919? (2011). Accessed 25 Jan 2014
  29. Hassan, A.E., Flora, P.: Performance engineering in industry: current practices and adoption challenges. In: Proceedings of the International Workshop on Software and Performance, pp. 209–209 (2007)Google Scholar
  30. Hassan, A.E., Martin, D.J., Flora, P., Mansfield, P., Dietz, D.: An industrial case study of customizing operational profiles using log compression. In: Proceedings of the 30th International Conference on Software Engineering, pp. 713–723 (2008)Google Scholar
  31. Howell Jr., T., Dinan, S.: Price of fixing, upgrading obamacare website rises to \$121 million. (2014). Accessed 09 Dec 2014
  32. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the New Zealand Computer Science Research Student Conference, pp. 44–56 (2008)Google Scholar
  33. Jiang Z.M.: Automated analysis of load testing results. PhD thesis, Queen’s University (2013)Google Scholar
  34. Jiang, Z.M., Hassan, A.E., Hamann, G., Flora, P.: An automated approach for abstracting execution logs to execution events. J. Softw. Maint. Evol. 20(4), 249–267 (2008a)CrossRefGoogle Scholar
  35. Jiang, Z.M., Hassan, A.E., Hamann, G., Flora, P.: Automatic identification of load testing problems. In: Proceedings of the International Conference on Software Maintenance, pp. 307–316 (2008b)Google Scholar
  36. Jiang, Z.M., Hassan, A.E., Hamann, G., Flora, P.: Automated performance analysis of load tests. In: Proceedings of the International Conference on Software Maintenance, pp. 125–134 (2009)Google Scholar
  37. Kampenes, V.B., Dybå, T., Hannay, J.E., Sjøberg, D.I.K.: A systematic review of effect size in software engineering experiments. Inform. Softw. Technol. 49(11–12), 1073–1086 (2007)CrossRefGoogle Scholar
  38. Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production mapreduce cluster. In: Proceedings of the International Conference on Cluster, Cloud and Grid Computing, pp. 94–103 (2010)Google Scholar
  39. Klose, O.: Hadoop on Linux on Azure. (2014). Accessed 28 Oct 2015
  40. Kremenek, T., Engler, D.: Z-ranking: using statistical analysis to counter the impact of static analysis approximations. In: Proceedings of the International Conference on Static Analysis, pp. 295–315 (2003)Google Scholar
  41. Krishnamurthy, D., Rolia, J.A., Majumdar, S.: A synthetic workload generation technique for stress testing session-based systems. Trans. Softw. Eng. 32(11), 868–882 (2006)CrossRefGoogle Scholar
  42. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)CrossRefzbMATHGoogle Scholar
  43. Laurenzano, M.A., Peraza, J., Carrington, L., Tiwari Jr., A., Ward, W., Campbell, R.: Pebil: binary instrumentation for practical data-intensive program analysis. Clust. Comput. 1(18), 1–14 (2015)CrossRefGoogle Scholar
  44. MapReduce Tutorial: (2014). Accessed 16 Jun 2014
  45. Meira, J.A., de Almeida, E.C., Traon, Y.L., Sunye, G.: Peer-to-peer load testing. In: Proceedings of the International Conference on Software Testing, Verification and Validation, pp. 642–647 (2012)Google Scholar
  46. Menascé, D.A.: Load testing of web sites. IEEE Internet Comput. 6(4), 70–74 (2002)CrossRefGoogle Scholar
  47. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)CrossRefGoogle Scholar
  48. Million Song Dataset: (2011). Accessed 28 Oct 2015
  49. Million Song Dataset: (2012). Accessed 28 Oct 2015
  50. Mojena, R.: Hierarchical grouping methods and stopping rules: an evaluation. Comput. J. 20(4), 353–363 (1977)CrossRefzbMATHGoogle Scholar
  51. Nagappan, M., Wu, K., Vouk M.A.: Efficiently extracting operational profiles from execution logs using suffix arrays. In: Proceedings of the International Symposium on Software Reliability Engineering, pp. 41–50 (2009)Google Scholar
  52. Parnas, D.L.: Software aging. In: Proceedings of the International Conference on Software Engineering, pp. 279–287 (1994)Google Scholar
  53. PerfMon: (2014). Accessed 26 Jan 2014
  54. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)CrossRefzbMATHGoogle Scholar
  55. Sandhya, N., Govardhan, A.: Analysis of similarity measures with wordnet based text document clustering. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications, pp. 703–714 (2012)Google Scholar
  56. Shang, W.: Log engineering: towards systematic log mining to support the development of ultra-large scale systems. PhD thesis, Queen’s University (2014)Google Scholar
  57. Shang, W., Jiang, Z.M., Adams, B., Hassan, A.E., Godfrey, M.W., Nasser, M., Flora, P.: An exploratory study of the evolution of communicated information about the execution of large software systems. In: Proceedings of the Working Conference on Reverse Engineering, pp. 335–344 (2011)Google Scholar
  58. Shang, W., Jiang, Z.M., Hemmati, H., Adams, B., Hassan, A.E., Martin, P.: Assisting developers of big data analytics applications when deploying on hadoop clouds. In: Proceedings of the International Conference on Software Engineering, pp. 402–411 (2013)Google Scholar
  59. Shang, W., Nagappan, M., Hassan, A.E.: Studying the relationship between logging characteristics and the code quality of platform software. Empir. Softw. Eng. 20(1), 20:1–20:27 (2015)Google Scholar
  60. SiliconBeat: Firefox download stunt sets record for quickest meltdown. (2008). Accessed 25 Jan 2014
  61. Software Engineering Institute: Ultra-Large-Scale Systems: The Software Challenge of the Future. Carnegie Mellon University, Pittsburgh (2006)Google Scholar
  62. Sokal, R.R., Rohlf, F.J.: Biometry: The Principles and Practice of Statistics in Biological Research, 4th edn. W. H. Freeman, New York (2011)zbMATHGoogle Scholar
  63. Student: The probable error of a mean. Biometrika 6(1), 1–25 (1908)Google Scholar
  64. Syer, M.D., Adams, B., Hassan A.E.: Identifying performance deviations in thread pools. In: Proceedings of the International Conference on Software Maintenance, pp. 83–92 (2011a)Google Scholar
  65. Syer, M.D., Adams, B., Hassan A.E.: Industrial case study on supporting the comprehension of system behaviour. In: Proceedings of the International Conference on Program Comprehension, pp. 215–216 (2011b)Google Scholar
  66. Syer, M.D., Jiang, Z.M., Nagappan, M., Hassan, A.E., Nasser, M., Flora, P.: Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: Proceedings of the International Conference on Software Maintenance, pp. 110–119 (2013)Google Scholar
  67. Syer, M.D., Jiang, Z.M., Nagappan, M., Hassan, A.E., Nasser, M., Flora, P.: Continuous validation of load test suites. In: Proceedings of the International Conference on Performance Engineering, pp. 259–270 (2014)Google Scholar
  68. Tan, P.N., Steinbach, M., Kumar, V.: Cluster Analysis: Basic Concepts and Algorithms, 1st edn. Addison-Wesley Longman Publishing Co., Inc, Boston (2005)Google Scholar
  69. The Sarbanes-Oxley Act 2002: (2014). Accessed 28 Jan 2014
  70. Twitter: New Tweets per second record, and how! (2013). Accessed 12 Dec 2014
  71. Uh, G.R., Cohn, R., Yadavalli, B., Peri, R., Ayyagari, R.: Analyzing dynamic binary instrumentation overhead. In: Proceedings of the Workshop on Binary Instrumentation and Applications, pp. 56–64 (2006)Google Scholar
  72. Voas, J.: Will the real operational profile please stand up? IEEE Softw. 17(2), 87–89 (2000)Google Scholar
  73. Welch, B.L.: The generalization of “student’s” problem when several different population variances are involved. Biometrika 34(1–2), 28–35 (1997)MathSciNetzbMATHGoogle Scholar
  74. Weyuker, E., Vokolos, F.: Experience with performance testing of software systems: issues, an approach, and case study. Trans. Softw. Eng. 26(12), 1147–1156 (2000)CrossRefGoogle Scholar
  75. Williams, A.: Amazon web services outage caused by memory leak and failure in monitoring alarm. (2012). Accessed 09 Dec 2014
  76. Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G.R., Zhao, X., Zhang, Y., Jain, P.U., Stumm, M.: Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In: Proceedings of the Conference on Operating Systems Design and Implementation, pp. 249–265 (2014)Google Scholar
  77. Zhang, J., Cheung, S.C.: Automated test case generation for the stress testing of multimedia systems. Softw. Pract. Exp. 32, 1411–1435 (2002)CrossRefzbMATHGoogle Scholar
  78. Zhang, Z., Cherkasova, L., Loo B.T. Benchmarking approach for designing a mapreduce performance model. In: Proceedings of the International Conference on Performance Engineering, pp. 253–258 (2013)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Mark D. Syer
    • 1
    Email author
  • Weiyi Shang
    • 1
  • Zhen Ming Jiang
    • 2
  • Ahmed E. Hassan
    • 1
  1. 1.Software Analysis and Intelligence Lab (SAIL), School of ComputingQueen’s UniversityKingstonCanada
  2. 2.Department of Electrical Engineering & Computer ScienceYork UniversityTorontoCanada

Personalised recommendations