Machine Learning

, Volume 106, Issue 11, pp 1817–1837 | Cite as

Statistical comparison of classifiers through Bayesian hierarchical modelling

  • Giorgio Corani
  • Alessio Benavoli
  • Janez Demšar
  • Francesca Mangili
  • Marco Zaffalon


Usually one compares the accuracy of two competing classifiers using null hypothesis significance tests. Yet such tests suffer from important shortcomings, which can be overcome by switching to Bayesian hypothesis testing. We propose a Bayesian hierarchical model that jointly analyzes the cross-validation results obtained by two classifiers on multiple data sets. The model estimates more accurately the difference between classifiers on the individual data sets than the traditional approach of averaging, independently on each data set, the cross-validation results. It does so by jointly analyzing the results obtained on all data sets, and applying shrinkage to the estimates. The model eventually returns the posterior probability of the accuracies of the two classifiers being practically equivalent or significantly different.



The research in this paper has been partially supported by the Swiss NSF grants ns. IZKSZ2_162188 and n. 200021_146606.


  1. Benavoli, A., Corani, G., Demsar, J., & Zaffalon, M. Time for a change: A tutorial for comparing multiple classifiers through Bayesian analysis. arXiv:1606.04316.
  2. Benavoli, A., Corani, G., Mangili, F., Zaffalon, M., & Ruggeri, F. (2014). A Bayesian Wilcoxon signed-rank test based on the Dirichlet process. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), (pp. 1026–1034).Google Scholar
  3. Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., et al. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32.CrossRefGoogle Scholar
  4. Corani, G., & Benavoli, A. (2015). A Bayesian approach for comparing cross-validated algorithms on multiple data sets. Machine Learning, 100(2), 285–304.MathSciNetCrossRefMATHGoogle Scholar
  5. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.MathSciNetMATHGoogle Scholar
  6. Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67.MathSciNetCrossRefMATHGoogle Scholar
  7. Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–534.MathSciNetCrossRefMATHGoogle Scholar
  8. Hand, D. J., et al. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 1–14.MathSciNetCrossRefMATHGoogle Scholar
  9. Juárez, M. A., & Steel, M. F. J. (2010). Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics, 28(1), 52–66.MathSciNetCrossRefMATHGoogle Scholar
  10. Krueger, T., Panknin, D., & Braun, M. (2015). Fast cross-validation via sequential testing. Journal of Machine Learning Research, 16, 1103–1155.MathSciNetMATHGoogle Scholar
  11. Kruschke, J. (2015). Doing Bayesian data analysis: A tutorial with R, Jags and Stan. New York: Academic Press.MATHGoogle Scholar
  12. Kruschke, J. K. (2013). Bayesian estimation supersedes the t–test. Journal of Experimental Psychology: General, 142(2), 573.CrossRefGoogle Scholar
  13. Lacoste, A., Laviolette, F., & Marchand, M. (2012). Bayesian comparison of machine learning algorithms on single and multiple datasets. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), (pp. 665–675).Google Scholar
  14. Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge: MIT press.MATHGoogle Scholar
  15. Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239–281.CrossRefMATHGoogle Scholar
  16. Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–164.CrossRefGoogle Scholar
  17. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133.MathSciNetCrossRefGoogle Scholar
  18. Witten, I. H., Frank, E., & Hall, M. (2011). Data Mining: Practical machine learning tools and techniques (third ed.). Los Altos: Morgan Kaufmann.Google Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  • Giorgio Corani
    • 1
  • Alessio Benavoli
    • 1
  • Janez Demšar
    • 2
  • Francesca Mangili
    • 1
  • Marco Zaffalon
    • 1
  1. 1.Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), Scuola Universitaria Professionale della Svizzera Italiana (SUPSI)Università della Svizzera Italiana (USI)MannoSwitzerland
  2. 2.Faculty of Computer and Information ScienceUniversity of LjubljanaLjubljanaSlovenia

Personalised recommendations