Many artificial intelligence algorithms or models are ultimately designed for prediction. A prediction algorithm, wherever it may reside—in a computer, or in a forecaster's head—is subject to a set of tests aimed at assessing its goodness. The specific choice of the tests is contingent on many factors, including the nature of the problem, and the specific facet of goodness. This chapter will discuss some of these tests. For a more in-depth exposure, the reader is directed to the references, and two books: Wilks (1995) and Jolliffe and Stephenson (2003). The body of knowledge aimed at assessing the goodness of predictions is referred to as performance assessment in most fields; in atmospheric circles, though, it is generally called verification. In this chapter, I consider only a few of the numerous performance measures considered in the literature, but my emphasis is on ways of assessing their uncertainty (i.e., statistical significance).

Here, prediction (or forecast) does not necessarily refer to the prediction of the future state of some variable. It refers to the estimation of the state of some variable, from information on another variable. The two variables may be contemporaneous, or not. What is required, however, is that the data on which the performance of the algorithm is being assessed is as independent as possible from the data on which the algorithm is developed or fine-tuned; otherwise, the performance will be optimistically biased—and that is not a good thing; see Section 2.6 in Chapter 2.


Mean Square Error False Alarm Rate Sampling Distribution Ensemble Prediction System False Alarm Ratio 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Baldwin, M. E., Lakshmivarahan, S., & Kain, J. S. (2002). Development of an “events-oriented” approach to forecast verification. 15th Conference, Numerical Weather Prediction, San Antonio, TC, August 12–16, 2002. Available at
  2. Brown, B. G., Bullock, R., Davis, C. A., Gotway, J. H., Chapman, M., Takacs, A., Gilleland, E., Mahoney, J. L., & Manning, K. (2004). New verification approaches for convective weather forecasts. Preprints, 11th Conference on Aviation, Range, and Aerospace, Hyannis, MA, October 3–8Google Scholar
  3. Casati, B., Ross, G., & Stephenson, D. B. (2004). A new intensity-scale approach for the verification of spatial precipitation forecasts. Meteorological Applications, 11, 141–154CrossRefGoogle Scholar
  4. Devore, J., & Farnum, N. (2005). Applied statistics for engineers and scientists. Belmont, CA: Thomson LearningGoogle Scholar
  5. Doswell, C. A., III, Davies-Jones, R., & Keller, D. (1990). On summary measures of skill in rare event forecasting based on contingency tables. Weather and Forecasting, 5, 576–585CrossRefGoogle Scholar
  6. Du, J., & Mullen, S. L. (2000). Removal of distortion error from an ensemble forecast. Monthly Weather Review, 128, 3347– 3351CrossRefGoogle Scholar
  7. Ebert, E. E., & McBride, J. L. (2000). Verification of precipitation in weather systems: Determination of systematic errors. Journal of Hydrology, 239, 179–202CrossRefGoogle Scholar
  8. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman & HallGoogle Scholar
  9. Fawcett, T. (2006). An introduction to ROC analysis.Pattern Recognition Letters, 27, 861–874CrossRefGoogle Scholar
  10. Ferro, C. (2007). Comparing probabilistic forecasting systems with the Brier score. Weather and Forecasting, 22, 1076– 1088CrossRefGoogle Scholar
  11. Gandin, L. S., & Murphy, A. (1992). Equitable skill scores for categorical forecasts. Monthly Weather Review, 120, 361– 370CrossRefGoogle Scholar
  12. Gerrity, J. P. Jr. (1992). A note on Gandin and Murphy's equitable skill score. Monthly Weather Review, 120, 2707–2712CrossRefGoogle Scholar
  13. Glahn, H. R., Lowry, D. A. (1972). The use of Model Output Statistics (MOS) in objective weather forecasting. Journal of Applied Meteorology, 11, 1203–1211CrossRefGoogle Scholar
  14. Gneiting, T., & Raftery, A. E. (2005). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, (477), 359–378CrossRefGoogle Scholar
  15. Gneiting, T., Raftery, A. E., Westveld, A. H., & Goldman, T. (2005). Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review, 133, 1098–1118CrossRefGoogle Scholar
  16. Gneiting, T., Balabdaoui, F., Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2), 243–268CrossRefGoogle Scholar
  17. Good, P. I. (2005a). Permutation, parametric and bootstrap tests of hypotheses (3rd ed.). New York: Springer. ISBN 0-387-98898-XGoogle Scholar
  18. Good, P. I. (2005b). Introduction to statistics through resampling methods and R/S-PLUS. New Jersey, Canada: Wiley. ISBN 0-471-71575-1Google Scholar
  19. Hamill, T. M. (1997). Reliability diagrams for multicategory probabilistic forecasts. Weather and Forecasting, 12, 736– 741CrossRefGoogle Scholar
  20. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer Series in Statistics. New York: SpringerGoogle Scholar
  21. Heidke, P. (1926). Berechnung des Erfolges und der Gute der Windstarkevorhersagen im Sturmwarnungsdienst. Geogra Ann. 8, 301–349CrossRefGoogle Scholar
  22. Jolliffe, I. T. (2007). Uncertainty and inference for verification measures. Weather and Forecasting, 22, 633–646CrossRefGoogle Scholar
  23. Jolliffe, I. T., & Stephenson, D. B. (2003). Forecast verification: A practitioner's guide in atmospheric science. Chichester WileyGoogle Scholar
  24. Livezey in Jolliffe, I. T., & Stephenson, D. B. (2003). Forecast verification: A practitioner's guide in atmospheric science. West Sussex, England: Wiley. See chapter 4 concerning categorical events, written by R. E. LivezeyGoogle Scholar
  25. Macskassy, S. A., Provost, F. (2004). Confidence bands for ROC curves: Methods and an empirical study. First workshop on ROC analysis in AI, ECAI-2004, SpainGoogle Scholar
  26. Marzban, C. (1998). Scalar measures of performance in rare-event situations. Weather and Forecasting, 13, 753–763CrossRefGoogle Scholar
  27. Marzban, C. (2004). The ROC curve and the area under it as a performance measure. Weather and Forecasting, 19(6), 1106–1114CrossRefGoogle Scholar
  28. Marzban, C., & Lakshmanan, V. (1999). On the uniqueness of Gandin and Murphy's equitable performance measures. Monthly Weather Review, 127(6), 1134–1136CrossRefGoogle Scholar
  29. Marzban, C., Sandgathe, S. (2006). Cluster analysis for verification of precipitation fields. Weather and Forecasting, 21(5), 824–838CrossRefGoogle Scholar
  30. Marzban, C., & Sandgathe, S. (2008). Cluster analysis for object-oriented verification of fields: A variation. Monthly Weather Review, 136, 1013–1025CrossRefGoogle Scholar
  31. Marzban, C., & Stumpf, G. J. (1998). A neural network for damaging wind prediction. Weather and Forecasting, 13, 151– 163CrossRefGoogle Scholar
  32. Marzban, C., & Witt, A. (2001). A Bayesian neural network for hail size prediction. Weather and Forecasting, 16(5), 600– 610CrossRefGoogle Scholar
  33. Marzban, C., Sandgathe, S., & Lyons, H. (2008). An object-oriented verification of three NWP model formulations via cluster analysis: An objective and a subjective analysis. Monthly Weather Review, 136, 3392–3407CrossRefGoogle Scholar
  34. Murphy, A. H. (1991). Forecast verification: Its complexity and dimensionality. Monthly Weather Review, 119, 1590–1601CrossRefGoogle Scholar
  35. Murphy, A. H. (1993). What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting, 8, 281–293CrossRefGoogle Scholar
  36. Murphy, A. H., & Epstein, E. S. (1967). A note on probabilistic forecasts and “hedging”. Journal of Applied Meteorology, 6, 1002–1004CrossRefGoogle Scholar
  37. Murphy, A. H., & Winkler, R. L. (1987). A general framework for forecast verification. Monthly Weather Review, 115, 1330–1338CrossRefGoogle Scholar
  38. Murphy, A. H., & Winkler, R. L. (1992). Diagnostic verification of probability forecasts. International Journal of Forecasting, 7, 435–455CrossRefGoogle Scholar
  39. Nachamkin, J. E. (2004). Mesoscale verification using meteorological composites. Monthly Weather Review, 132, 941–955CrossRefGoogle Scholar
  40. Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review, 133, 1155–1174CrossRefGoogle Scholar
  41. Richardson, D. S. (2000). Skill and relative economic value of the ECMWF ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, 126, 649–667CrossRefGoogle Scholar
  42. Roebber, P. J., & Bosart, L. F. (1996). The complex relationship between forecast skill and forecast value: A real-world analysis. Weather and Forecasting, 11, 544–559CrossRefGoogle Scholar
  43. Roulston, M. S., & Smith, L. A. (2002). Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130, 1653–1660CrossRefGoogle Scholar
  44. Seaman, R., Mason, I., & Woodcock, F. (1996). Confidence intervals for some performance measures of Yes-No forecasts. Australian Meteorological Magazine, 45, 49–53Google Scholar
  45. Stephenson, D. B., Casati, B., & Wilson, C. (2004). Verification of rare extreme events. WMO verification workshop, Montreal, September 13–17Google Scholar
  46. Venugopal, V., Basu, S., & Foufoula-Georgiou, E. (2005). A new metric for comparing precipitation patterns with an application to ensemble forecasts. Journal of Geophysical Research, 110, D8, D08111 DOI: 10.1029/2004JD005395Google Scholar
  47. Wilks, D. S. (1995). Statistical methods in the atmospheric sciences (467 pp.). San Diego, CA: Academic PressGoogle Scholar
  48. Wilks, D. S. (2001). A skill score based on economic value for probability forecasts. Meteorological Applications, 8, 209– 219CrossRefGoogle Scholar
  49. Wilson, L. J., Burrows, W. R., & Lanzinger, A. (1999). A strategy for verification of weather element forecasts from an ensemble prediction system. Monthly Weather Review, 127, 956–970CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V 2009

Authors and Affiliations

  1. 1.Applied Physics Laboratory and Department of StatisticsUniversity of WashingtonSeattleUSA

Personalised recommendations