Skip to main content

Decoding Machine Learning Benchmarks

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2020)

Abstract

Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied IRT to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers. Several classifiers ranging from classical to ensembles ones were evaluated using IRT models, which could simultaneously estimate dataset difficulty and classifiers’ ability. The Glicko-2 rating system was applied on the top of IRT to summarize the innate ability and aptitude of classifiers. It was observed that not all datasets from OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated in this work (84%) contain easy instances in general (e.g., around 10% of difficult instances only). Also, 80% of the instances in half of this benchmark are very discriminating ones, which can be of great use for pairwise algorithm comparison, but not useful to push classifiers abilities. This paper presents this new evaluation methodology based on IRT as well as the tool decodIRT, developed to guide IRT estimation over ML benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Link to access OpenML-CC18: https://www.openml.org/s/99.

  2. 2.

    Link to the source code: https://github.com/LucasFerraroCardoso/IRT_OpenML.

  3. 3.

    All classification results can be obtained at https://github.com/LucasFerraroCardoso/IRT_OpenML/tree/master/benchmarking.

  4. 4.

    All data generated can be accessed at https://github.com/LucasFerraroCardoso/IRT_OpenML/tree/master/BRACIS.

References

  1. Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)

    Article  Google Scholar 

  2. Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newslet. 15(2), 49–60 (2014)

    Article  Google Scholar 

  3. Monard, M.C., Baranauskas, J.A.: Conceitos sobre aprendizado de máquina. Sistemas inteligentes-Fundamentos e aplicações 1(1), 32 (2003)

    Google Scholar 

  4. Martínez-Plumed, F., Prudêncio, R. B., Martínez-Usó, A., Hernández-Orallo, J.: Making sense of item response theory in machine learning. In: Proceedings of the Twenty-second European Conference on Artificial Intelligence, pp. 1140–1148. IOS Press (2016)

    Google Scholar 

  5. Prudêncio, R.B., Hernández-Orallo, J., Martınez-Usó, A.: Analysis of instance hardness in machine learning using item response theory. In: Second International Workshop on Learning over Multiple Contexts in ECML 2015, Porto, Portugal (2015)

    Google Scholar 

  6. Martínez-Plumed, F., Prudêncio, R.B., Martínez-Usó, A., Hernández-Orallo, J.: Item response theory in AI: analysing machine learning classifiers at the instance level. Artif. Intell. 271, 18–42 (2019)

    Article  MathSciNet  Google Scholar 

  7. Bischl, B., et al.: OpenML benchmarking suites and the OpenML100. arXiv preprint arXiv:1708.03731 (2017)

  8. Samothrakis, S., Perez, D., Lucas, S.M., Rohlfshagen, P.: Predicting dominance rankings for score-based games. IEEE Trans. Comput. Intell. AI Games 8(1), 1–12 (2014)

    Article  Google Scholar 

  9. Glickman, M.E.: Example of the Glicko-2 system, pp. 1–6. Boston University (2012)

    Google Scholar 

  10. de Andrade, D.F., Tavares, H.R., da Cunha Valle, R.: Teoria da Resposta ao Item: conceitos e aplicações. ABE, Sao Paulo (2000)

    Google Scholar 

  11. Veček, N., Mernik, M., Črepinšek, M.: A chess rating system for evolutionary algorithms: a new method for the comparison and ranking of evolutionary algorithms. Inf. Sci. 277, 656–679 (2014)

    Article  MathSciNet  Google Scholar 

  12. Birnbaum, A.L.: Some latent trait models and their use in inferring an examinee’s ability. Statistical theories of mental test scores (1968)

    Google Scholar 

  13. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Adedoyin, O.O., Mokobi, T.: Using IRT psychometric analysis in examining the quality of junior certificate mathematics multiple choice examination test items. Int. J. Asian Soc. Sci. 3(4), 992–1011 (2013)

    Google Scholar 

  15. Lord, F.M., Wingersky, M.S.: Comparison of IRT true-score and equipercentile observed-score “equatings”. Appl. Psychol. Meas. 8(4), 453–461 (1984)

    Article  Google Scholar 

  16. Pereira, D.G., Afonso, A., Medeiros, F.M.: Overview of Friedman’s test and post-hoc analysis. Commun. Stat.-Simul. Comput. 44(10), 2636–2653 (2015)

    Article  MathSciNet  Google Scholar 

  17. Nemenyi, P.: Distribution-free multiple comparisons. In: Biometrics, vol. 18, no. 2, p. 263 (1962). 1441 I ST, NW, SUITE 700, WASHINGTON, DC 20005–2210: International Biometric Soc

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucas F. F. Cardoso .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cardoso, L.F.F., Santos, V.C.A., Francês, R.S.K., Prudêncio, R.B.C., Alves, R.C.O. (2020). Decoding Machine Learning Benchmarks. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12320. Springer, Cham. https://doi.org/10.1007/978-3-030-61380-8_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61380-8_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61379-2

  • Online ISBN: 978-3-030-61380-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics