Reliable benchmarking: requirements and solutions

  • Dirk Beyer
  • Stefan Löwe
  • Philipp WendlerEmail author
Regular Paper


Benchmarking is a widely used method in experimental computer science, in particular, for the comparative evaluation of tools and algorithms. As a consequence, a number of questions need to be answered in order to ensure proper benchmarking, resource measurement, and presentation of results, all of which is essential for researchers, tool developers, and users, as well as for tool competitions. We identify a set of requirements that are indispensable for reliable benchmarking and resource measurement of time and memory usage of automatic solvers, verifiers, and similar tools, and discuss limitations of existing methods and benchmarking tools. Fulfilling these requirements in a benchmarking framework can (on Linux systems) currently only be done by using the cgroup and namespace features of the kernel. We developed BenchExec, a ready-to-use, tool-independent, and open-source implementation of a benchmarking framework that fulfills all presented requirements, making reliable benchmarking and resource measurement easy. Our framework is able to work with a wide range of different tools, has proven its reliability and usefulness in the International Competition on Software Verification, and is used by several research groups worldwide to ensure reliable benchmarking. Finally, we present guidelines on how to present measurement results in a scientifically valid and comprehensible way.


Benchmarking Resource measurement Process control Process isolation Container Competition 



We thank Hubert Garavel, Jiri Slaby, and Aaron Stump for their helpful comments regarding BenchKit, cgroups, and StarExec, respectively, Armin Größlinger for his ideas on what to investigate regarding the performance influence of using multiple CPUs in Sect. 4.8, and all contributors to BenchExec55.


  1. 1.
    Balyo, T., Heule, M.J.H., Järvisalo, M.: SAT competition 2016: recent developments. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 5061–5063. AAAI Press (2017)Google Scholar
  2. 2.
    Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB standard: version 2.5. Technical report, University of Iowa (2015).
  3. 3.
    Beyer, D.: Competition on software verification (SV-COMP). In: Proceedings of TACAS, LNCS 7214, pp. 504–524. Springer (2012)Google Scholar
  4. 4.
    Beyer, D.: Second competition on software verification (Summary of SV-COMP 2013). In: Proceedings of TACAS, LNCS 7795, pp. 594–609. Springer (2013)Google Scholar
  5. 5.
    Beyer, D.: Software verification and verifiable witnesses (Report on SV-COMP 2015). In: Proceedings of TACAS, LNCS 9035, pp. 401–416. Springer (2015)Google Scholar
  6. 6.
    Beyer, D.: Reliable and reproducible competition results with BenchExec and witnesses (Report on SV-COMP 2016). In: Proceedings of TACAS, LNCS 9636, pp. 887–904. Springer (2016)Google Scholar
  7. 7.
    Beyer, D.: Software verification with validation of results (Report on SV-COMP 2017). In: Proceedings of TACAS, LNCS 10206, pp. 331–349. Springer (2017)Google Scholar
  8. 8.
    Beyer, D., Dresler, G., Wendler, P.: Software verification in the Google App-Engine cloud. In: Proceedings of CAV, LNCS 8559, pp. 327–333. Springer (2014)Google Scholar
  9. 9.
    Beyer, D.. Löwe, S., Novikov, E., Stahlbauer, A., Wendler, P.: Precision reuse for efficient regression verification. In: Proceedings of FSE, pp. 389–399. ACM (2013)Google Scholar
  10. 10.
    Beyer, D., Löwe, S., Wendler, P.: Benchmarking and resource measurement. In: Proceedings of SPIN, LNCS 9232, pp. 160–178. Springer (2015)Google Scholar
  11. 11.
    Brooks, A., Roper, M., Wood, M., Daly, J., Miller, J.: Replication’s role in software engineering. In: Guide to Advanced Empirical Software Engineering, pp. 365–379. Springer (2008)Google Scholar
  12. 12.
    Charwat, G., Ianni, G., Krennwallner, T., Kronegger, M., Pfandler, A., Redl, C., Schwengerer, M., Spendier, L., Wallner, J., Xiao, G.: VCWC: a versioning competition workflow compiler. In: Proceedings of LPNMR, LNCS 8148, pp. 233–238. Springer (2013)Google Scholar
  13. 13.
    Cok, D.R., Déharbe, D., Weber, T.: The 2014 SMT competition. JSAT 9, 207–242 (2016)MathSciNetGoogle Scholar
  14. 14.
    Collberg, C.S., Proebsting, T.A.: Repeatability in computer-systems research. Commun. ACM 59(3), 62–69 (2016)CrossRefGoogle Scholar
  15. 15.
    de Oliveira, A.B., Petkovich, J.-C., Fischmeister, S.: How much does memory layout impact performance? A wide study. In: Proceedings of REPRODUCE (2014)Google Scholar
  16. 16.
    Gu, D., Verbrugge, C., Gagnon, E.: Code layout as a source of noise in JVM performance. Stud. Inform. Univ. 4(1), 83–99 (2005)Google Scholar
  17. 17.
    Handigol, N., Heller, B., Jeyakumar, V., Lantz, B., McKeown, N.: Reproducible network experiments using container-based emulation. In: Proceedings of CoNEXT, pp. 253–264. ACM (2012)Google Scholar
  18. 18.
    Hocko, M., Kalibera, T.: Reducing performance non-determinism via cache-aware page allocation strategies. In: Proceedings of ICPE, pp. 223–234. ACM (2010)Google Scholar
  19. 19.
    JCGM Working Group 2. International vocabulary of metrology—basic and general concepts and associated terms (VIM), 3rd edition. Technical Report JCGM 200:2012, BIPM (2012)Google Scholar
  20. 20.
    Juristo, N., Gómez, O.S.: Replication of software engineering experiments. In: Empirical Software Engineering and Verification, pp. 60–88. Springer (2012)Google Scholar
  21. 21.
    Kalibera, T., Bulej, L., Tuma, P.: Benchmark precision and random initial state. In: Proceedings of SPECTS, pp. 484–490. SCS (2005)Google Scholar
  22. 22.
    Kordon, F., Hulin-Hubard, F.: BenchKit, a tool for massive concurrent benchmarking. In: Proceedings of ACSD, pp. 159–165. IEEE (2014)Google Scholar
  23. 23.
    Krishnamurthi, S., Vitek, J.: The real software crisis: repeatability as a core value. Commun. ACM 58(3), 34–36 (2015)CrossRefGoogle Scholar
  24. 24.
    Mytkowicz, T., Diwan, A., Hauswirth, M., Sweeney, P.F.: Producing wrong data without doing anything obviously wrong! In: Proceedings of ASPLOS, pp. 265–276. ACM (2009)Google Scholar
  25. 25.
    Petkovich, J., de Oliveira, A.B., Zhang, Y., Reidemeister, T., Fischmeister, S.: DataMill: a distributed heterogeneous infrastructure for robust experimentation. Softw. Pract. Exp. 46(10), 1411–1440 (2016)Google Scholar
  26. 26.
    Rizzi, E.F., Elbaum, S., Dwyer, M.B.: On the techniques we create, the tools we build, and their misalignments: a study of Klee. In: Proceedings of ICSE, pp. 132–143. ACM (2016)Google Scholar
  27. 27.
    Roussel, O.: Controlling a solver execution with the runsolver tool. JSAT 7, 139–144 (2011)MathSciNetzbMATHGoogle Scholar
  28. 28.
    Singh, B., Srinivasan, V.: Containers: challenges with the memory resource controller and its performance. In: Proceedings of Ottawa Linux Symposium (OLS), pp. 209–222 (2007)Google Scholar
  29. 29.
    Stump, A., Sutcliffe, G., Tinelli, C.: StarExec: a cross-community infrastructure for logic solving. In: Proceedings of IJCAR, LNCS 8562, pp. 367–373. Springer (2014)Google Scholar
  30. 30.
    Suh, Y.-K., Snodgrass, R .T., Kececioglu, J .D., Downey, P .J., Maier, R .S., Yi, C.: EMP: execution time measurement protocol for compute-bound programs. Softw. Pract. Exp. 47(4), 559–597 (2017)CrossRefGoogle Scholar
  31. 31.
    Tichy, W.F.: Should computer scientists experiment more? IEEE Comput. 31(5), 32–40 (1998)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Visser, W., Geldenhuys, J., Dwyer, M.B.: Green: reducing, reusing and recycling constraints in program analysis. In: Proceedings of FSE, pp. 58:1–58:11. ACM (2012)Google Scholar
  33. 33.
    Vitek, J., Kalibera, T.: Repeatability, reproducibility, and rigor in systems research. In: Proceedings of EMSOFT, pp. 33–38. ACM (2011)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.LMU MunichMunichGermany
  2. 2.One LogicPassauGermany

Personalised recommendations