Validation Benchmarks and Related Metrics

  • Nicole J. SaamEmail author
Part of the Simulation Foundations, Methods and Applications book series (SFMA)


This chapter proposes benchmarking as an important, versatile and promising method in the process of validating simulation models with an empirical target. This excludes simulation models which only explore consequences of theoretical assumptions. A conceptual framework and descriptive theory of benchmarking in simulation validation is developed. Sources of benchmarks are outstanding experimental or observational data, stylized facts or other characteristics of the target. They are outstanding because they are more effective, more reliable or more efficient than other such data, stylized facts or characteristics. Benchmarks are set in a benchmarking process which offers a pathway to support the establishment of norms and standards in simulation validation. Benchmarks are indispensable in maintaining large simulation systems, e.g. for automatic quality checking of large-scale forecasts and when forecasting system upgrades are made.


Validation benchmarks Touchstone Yardstick Engineering reference standard Benchmarking Benchmarking metrics 



The author thanks Claus Beisbart and William Oberkampf for helpful discussions concerning this manuscript.


  1. Beven, K. J. (2006). A manifesto for the equifinality thesis. Journal of Hydrology, 320, 18–36.CrossRefGoogle Scholar
  2. Brandenburger, A. M., & Nalebuff, B. J. (1998). Co-opetition: A revolutionary mindset that combines competition and co-operation. New York: Currency Doubleday.Google Scholar
  3. Bruno, I. (2009). The ‘indefinite discipline’ of competitiveness benchmarking as a neoliberal technology of government. Minerva, 47, 261–280.CrossRefGoogle Scholar
  4. Caldwell, S., & Morrison, R. J. (2000). Validation of longitudinal dynamic microsimulation models. Experience with CORSIM and DYNACAN. In L. Mitton, H. Sutherland & M. J. Weeks (Eds.), Microsimulation modelling for policy analysis. Challenges and innovations (pp. 200–225). Cambridge: Cambridge University Press.Google Scholar
  5. Fewtrell, T. J., Duncan, A., Sampson, C. C., Neal, J. C., & Bates, P. D. (2011). Benchmarking urban flood models of varying complexity and scale using high resolution terrestrial LiDAR data. Physics and Chemistry of the Earth, 36, 281–291.CrossRefGoogle Scholar
  6. Foucault, M. (2008). The birth of biopolitics: Lectures at the College de France, 1978–1979. Basingstoke: Palgrave Macmillan.Google Scholar
  7. Fougner, T. (2008). Neoliberal governance of states: The role of competitiveness indexing and country benchmarking. Millennium: Journal of International Studies, 37, 303–326.CrossRefGoogle Scholar
  8. Gneiting, T. (2011). Evaluating point forecasts. Journal of the American Statistical Association, 106, 746–762.MathSciNetCrossRefGoogle Scholar
  9. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction and estimation. Journal of the American Statistical Association, 102, 359–378.MathSciNetCrossRefGoogle Scholar
  10. Granger, C. W. J., & Jeon, Y. (2003). A time-distance criterion for evaluating forecasting models. International Journal of Forecasting, 19, 199–215.CrossRefGoogle Scholar
  11. Harding, A., Keegan, M., & Kelly, S. (2010). Validating a dynamic population microsimulation model: Recent experience in Australia. International Journal of Microsimulation, 3, 46–64.Google Scholar
  12. Hartmann, S. (1996). The world as a process: Simulation in the natural and social sciences. In R. Hegselmann, U. Müller, & K. G. Troitzsch (Eds.), Modelling and simulation in the social sciences from the philosophy of science point of view (pp. 77–100). Dordrecht: Kluwer.CrossRefGoogle Scholar
  13. Hersbach, H. (2000). Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather Forecasting, 15, 559–570.CrossRefGoogle Scholar
  14. Hoffman, F.M., et al. (2017). International land model benchmarking (ILAMB) 2016 Workshop Report. DOE/SC-0186, U.S. Department of Energy, Office of Science, Germantown, Maryland, USA.
  15. Jolliffe, I. T., & Stephenson, D. B. (Eds.). (2011). Forecast verification: A practitioner’s guide in atmospheric science. Sussex/Oxford: Wiley-Blackwell.Google Scholar
  16. Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263–329.CrossRefGoogle Scholar
  17. Liu, Y., Chen, W., Arendt, P., & Huang, H. -Z. (2011). Towards a better understanding of model validation metrics. Journal of Mechanical Design, 133.Google Scholar
  18. Lund, M. E., de Zee, M., Andersen, M. S., & Rasmussen, J. (2012). On validation of multibody musculoskeletal models. Journal of Engineering in Medicine, 226, 82–94.CrossRefGoogle Scholar
  19. Luo, Y. Q., et al. (2012). A framework for benchmarking land models. Biogeosciences, 9, 3857–3874.CrossRefGoogle Scholar
  20. McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. New York: Wiley.CrossRefGoogle Scholar
  21. Murphy, A. H. (1993). What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather Forecasting, 8, 281–293.CrossRefGoogle Scholar
  22. Murphy, A. H., & Winkler, R. L. (1987). A general framework for forecast verification. Monthly Weather Review, 115, 1330–1338.CrossRefGoogle Scholar
  23. Nambiar, R., et al. (2014). TPC state of the council 2013. In R. Nambiar & M. Poess (Eds.), Performance characterization and benchmarking, TPCTC 2013 (pp. 1–15). Cham: Springer.Google Scholar
  24. Nicolle, P., et al. (2014). Benchmarking hydrological models for low-flow simulation and forecasting on French catchments. Hydrology and Earth System Sciences, 18, 2829–2857.CrossRefGoogle Scholar
  25. Oberkampf, W. L., & Barone, M. F. (2006). Measures of agreement between computation and experiment: validation metrics. Journal of Computational Physics, 217, 5–36.CrossRefGoogle Scholar
  26. Oberkampf, W. L., & Trucano, T. G. (2008). Verification and validation benchmarks. Nuclear Engineering and Design, 238, 716–743.CrossRefGoogle Scholar
  27. Oberkampf, W. L., Trucano, T. G., & Hirsch, C. (2004). Verification, validation and predictive capability in computational engineering and physics. Appl. Mech. Review, 57, 345–384.CrossRefGoogle Scholar
  28. Oreskes, N. (2003). The role of quantitative models in science. In C. D. Canham, J. J. Cole, & W. K. Lauenroth (Eds.), Models in ecosystem science (pp. 13–31). Princeton University Press: Princeton.Google Scholar
  29. Pappenberger, F., et al. (2015). How do i know if my forecasts are better? Using benchmarks in hydrological ensemble prediction. Journal of Hydrology, 522, 697–713.CrossRefGoogle Scholar
  30. Perrin, C., Andreassian, V., & Michel, C. (2006). Simple benchmark models as a basis for model efficiency criteria. Arch. Hydrobiol. Suppl., 161, 221–244.Google Scholar
  31. Robert, D. (2018). Expected comparative utility theory. A new theory of rational choice. The Philosophical Forum, 49, 19–37.CrossRefGoogle Scholar
  32. Schlesinger, S., et al. (1979). Terminology for model credibility. Simulation, 32, 103–104.CrossRefGoogle Scholar
  33. Schwalm, C.R., et al. (2010). A model-data intercomparison of CO2 exchange across North America: Results from the North American Carbon program site synthesis. Journal of Geophysical Research, 115, G00H05,
  34. Seibert, J. (2001). On the need for benchmarks in hydrological modelling. Hydrological Processes, 15, 1063–1064.CrossRefGoogle Scholar
  35. Stratton, J.A., et al. (2012). Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report. IMPACT-12-01. University of Illinois at Urbana-Champaign: Center for Reliable and High-Performance Computing.Google Scholar
  36. Sundberg, M. (2011). The dynamics of coordinated comparisons: how simulationists in astrophysics, oceanography and meteorology create standards for results. Social Studies of Science, 41, 107–125.CrossRefGoogle Scholar
  37. Tay, A. S., & Wallis, K. F. (2000). Density forecasting: A survey. Journal of Forecasting, 19, 235–254.CrossRefGoogle Scholar
  38. Taylor, K. E. (2001). Summarizing multiple aspects of model performance in a single diagram. Journal of Geophysical Research, 106, 7183–7192.CrossRefGoogle Scholar
  39. Triantafillou, P. (2004). Addressing network governance through the concepts of governance and normalization. Administrative Theory and Practice, 26, 489–508.CrossRefGoogle Scholar
  40. Vieira, M., & H. Madeira (2009). From performance to dependability benchmarking: A mandatory path. In R. Nambiar & M. Poess (Eds.), Performance evaluation and benchmarking, TPCTC 2009 (pp. 67–83). Heidelberg: Springer.Google Scholar
  41. Weber, M. (1978[1921]). Economy and society. Tr. by G. Roth and C. Wittich. Berkeley: University of California Press.Google Scholar
  42. Wedgwood, R. (2017). Must rational intentions maximize utility? Philosophical Explorations, 20, 1–20.CrossRefGoogle Scholar
  43. Wedgwood, R. (2013). Gandalf’s solution to the newcomb problem. Synthese, 190, 2643–2675.MathSciNetCrossRefGoogle Scholar
  44. Wilks, D. (2011). Statistical methods in the atmospheric sciences. Oxford: Elsevier.Google Scholar
  45. Wolfram Cox, J. R., Mann, L., & Samson, D. (1997). Benchmarking as a mixed metaphor. disentangling assumptions of competition and collaboration. Journal of Management Studies, 34, 285–314.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Institute for Sociology, Friedrich-Alexander-Universität Erlangen-NürnbergErlangenGermany

Personalised recommendations