Computer comparisons in the presence of performance variation

  • Samuel Irving
  • Bin Li
  • Shaoming Chen
  • Lu Peng
  • Weihua ZhangEmail author
  • Lide Duan
Research Article


Performance variability, stemming from nondeterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics and is slated to become even more of a concern as interest in Big Data analytic increases. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering this variability which may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that for both PARSEC and high-variance BigDataBench benchmarks 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g., ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance due to the variability of computer systems.We further propose using empirical distribution to evaluate computer performance and a five-number-summary to summarize computer performance. We use published SPEC 2006 results to investigate the sources of performance variation by predicting performance and relative variation for 8,236 machines. We achieve a correlation of predicted performances of 0.992 and a correlation of predicted and measured relative variation of 0.5. Finally, we propose the utilization of a novel biplotting technique to visualize the effectiveness of benchmarks and cluster machines by behavior. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples.


performance of systems variation performance attributes measurement evaluation modeling simulation of multiple-processor systems experimental design Big Data 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



This work was supported in part by the National High Technology Research and Development Program of China (2015AA015303), the National Natural Science Foundation of China (Grant No. 61672160), and Shanghai Science and Technology Development Funds (17511102200), National Science Foundation (NSF) (CCF-1017961, CCF- 1422408, and CNS-1527318). We acknowledge the computing resources provided by the Louisiana Optical Network Initiative (LONI) HPC team. Finally, we appreciate invaluable comments from anonymous reviewers.

Supplementary material

11704_2018_7319_MOESM1_ESM.pdf (200 kb)
Supplementary material, approximately 201 KB.


  1. 1.
    Alameldeen A R, Wood D A. Variability in architectural simulations of multi-threaded workloads. In: Proceedings of the 9th IEEE International Symposium on High Performance Computer Architecture. 2003, 7–18Google Scholar
  2. 2.
    George A, Buytaer D, Eeckhout L. Statistically rigorous java performance evaluation. ACM SIGPLAN Notices, 2007, 42(10): 57–76CrossRefGoogle Scholar
  3. 3.
    Mytkowicz T, Diwan A, Hauswirth M, Sweeney P F. Producing wrong data without doing anything obviously wrong. In: Proceedings of ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2009, 265–276Google Scholar
  4. 4.
    Krishnamurthi S, Vitek J. The real software crisis: repeatability as a core value. Communications of ACM, 2015, 58(3): 34–36CrossRefGoogle Scholar
  5. 5.
    Chen T, Guo Q, Temam O, Wu Y, Bao Y, Xu Z, Chen Y. Statistical performance comparisons of computers. IEEE Transactions on Computers, 2015, 64(5): 1442–1455MathSciNetCrossRefGoogle Scholar
  6. 6.
    Freund R J, Mohr D,Wilson WJ. Statistical Methods. 3rd ed. London: Academic Press, 2010Google Scholar
  7. 7.
    Chen T, Chen Y, Guo Q, Temam O, Wu Y, Hu W. Statistical performance comparisons of computers. In: Proceedings of the 18th IEEE International Symposium On High Performance Computer Architecture. 2012, 1–12Google Scholar
  8. 8.
    Hollander M, Wolfe D A. Nonparametric Statistical Methods. 2nd ed. New York: John Wiley & Sons, 1999zbMATHGoogle Scholar
  9. 9.
    Moore D, McCabe G P, Craig B. Introduction to the Practice of Statistics. 7th ed. New York: W. H. Freeman Press, 2010Google Scholar
  10. 10.
    Edgington E S. Randomization Tests. 3rd ed. New York: Marcel- Dekker, 1995zbMATHGoogle Scholar
  11. 11.
    Davison A C, Hinkley D V. Bootstrap Methods and Their Application. New York: Cambridge University Press, 1997CrossRefzbMATHGoogle Scholar
  12. 12.
    Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y. Bigdatabench: a big data benchmark suite from internet services. In: Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture. 2014, 488–499Google Scholar
  13. 13.
    Gower J C, Lubbe S G, Roux N L. Understanding Biplots. Hoboken: John Wiley & Sons, 2011CrossRefGoogle Scholar
  14. 14.
    Efron B, Tibshirani R J. An Introduction to the Bootstrap. New York: Chapman and Hall/CRC, 1994zbMATHGoogle Scholar
  15. 15.
    Fleming P J, Wallace J J. How not to lie with statistics: the correct way to summarize benchmark results. Communications of the ACM, 1986, 29(3): 218–221CrossRefGoogle Scholar
  16. 16.
    Johnson R A. Statistics: Principles and Methods. 6th ed. New York: John Wiley & Sons, 2009Google Scholar
  17. 17.
    Bienia C, Kumar S, Singh J P, Li K. The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 2008, 72–81CrossRefGoogle Scholar
  18. 18.
    Citron D, Hurani A, Gnadrey A. The harmonic or geometric mean: does it really matter? ACM SIGARCH Computer Architecture News, 2006, 34(4): 18–25CrossRefGoogle Scholar
  19. 19.
    Iqbal M F, John L K. Confusion by all means. In: Proceedings of the 6th International Workshop on Unique chips and Systems. 2010, 1–6Google Scholar
  20. 20.
    Mashey J R. War of the benchmark means: time for a truce. ACM SIGARCH Computer Architecture News, 2004, 32(4): 1–14CrossRefGoogle Scholar
  21. 21.
    Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach. 4th ed. Walthan: Morgan Kaufmann, 2007Google Scholar
  22. 22.
    Eeckhout L. Computer Architecture Performance Evaluation Methods. California: Morgan & Claypool Press, 2010CrossRefGoogle Scholar
  23. 23.
    Lilja D J. Measuring Computer Performance: A Practitioner’s Guide. New York: Cambridge University Press, 2000CrossRefGoogle Scholar
  24. 24.
    Oliveira A, Fischmeister S, Diwan A, Hauswirth M, Sweeney P F. Why you should care about quantile regression. In: Proceedings of ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2013, 207–218Google Scholar
  25. 25.
    Patil S, Lilja D J. Using resampling techniques to compute confidence intervals for the harmonic mean of rate-based performance metrics. IEEE Computer Architecture Letters, 2010, 9(1): 1–4CrossRefGoogle Scholar
  26. 26.
    Iosup A, Yigitbasi N, Epema D H J. On the performance variability of production cloud services. In: Proceedings of IEEE/ACMInternational Symposium on Cluster, Cloud and Grid Computing, Newport Beach. 2011, 104–113Google Scholar
  27. 27.
    Leitner P, Cito J. Patterns in the chaos—a study of performance variation and predictability in public IaaS clouds. ACM Transactions on Internet Technology, 2016, 16(3): 15CrossRefGoogle Scholar
  28. 28.
    Zhang W, Ji X, Song B, Yu S, Chen H, Li T, Yew P, Zhao W. Varcatcher: a pramework for tackling performance variability of parallel workloads on multi-core. IEEE Transactions on Parallel and Distributed Systems, 2016, 28: 1215–1228CrossRefGoogle Scholar
  29. 29.
    Pusukuri K K, Gupta R, Bhuyan A N. Thread tranquilizer: dynamically reducing performance variation. ACM Transactions on Architecture and Code Optimization, 2012, 8(4): 46–66CrossRefGoogle Scholar
  30. 30.
    Jimenez I, Maltzahn C, Lofstead J, Moody A, Mohror K, Arpaci- Dusseau R, Arpaci-Dusseau A. Characterizing and reducing crossplatform performance variability using OS-level virtualization. In: Proceedings of the 1st IEEE International Workshop on Variability in Parallel and Distributed Systems. 2016, 1077–1080Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Samuel Irving
    • 1
    • 2
  • Bin Li
    • 1
  • Shaoming Chen
    • 1
  • Lu Peng
    • 1
  • Weihua Zhang
    • 2
    • 3
    • 4
    Email author
  • Lide Duan
    • 5
  1. 1.Louisiana State UniversityBaton RougeUSA
  2. 2.Shanghai Institute of Intelligent Electronics & SystemsShanghaiChina
  3. 3.Software SchoolFudan UniversityShanghaiChina
  4. 4.Shanghai Key Laboratory of Data ScienceFudan UniversityShanghaiChina
  5. 5.University of Texas at San AntonioSan AntonioUSA

Personalised recommendations