Decoding Machine Learning Benchmarks

Cardoso, Lucas F. F.; Santos, Vitor C. A.; Francês, Regiane S. Kawasaki; Prudêncio, Ricardo B. C.; Alves, Ronnie C. O.

doi:10.1007/978-3-030-61380-8_28

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12320))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

934 Accesses
3 Citations

Abstract

Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied IRT to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers. Several classifiers ranging from classical to ensembles ones were evaluated using IRT models, which could simultaneously estimate dataset difficulty and classifiers’ ability. The Glicko-2 rating system was applied on the top of IRT to summarize the innate ability and aptitude of classifiers. It was observed that not all datasets from OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated in this work (84%) contain easy instances in general (e.g., around 10% of difficult instances only). Also, 80% of the instances in half of this benchmark are very discriminating ones, which can be of great use for pairwise algorithm comparison, but not useful to push classifiers abilities. This paper presents this new evaluation methodology based on IRT as well as the tool decodIRT, developed to guide IRT estimation over ML benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Link to access OpenML-CC18: https://www.openml.org/s/99.
2.
Link to the source code: https://github.com/LucasFerraroCardoso/IRT_OpenML.
3.
All classification results can be obtained at https://github.com/LucasFerraroCardoso/IRT_OpenML/tree/master/benchmarking.
4.
All data generated can be accessed at https://github.com/LucasFerraroCardoso/IRT_OpenML/tree/master/BRACIS.

References

Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)
Article Google Scholar
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newslet. 15(2), 49–60 (2014)
Article Google Scholar
Monard, M.C., Baranauskas, J.A.: Conceitos sobre aprendizado de máquina. Sistemas inteligentes-Fundamentos e aplicações 1(1), 32 (2003)
Google Scholar
Martínez-Plumed, F., Prudêncio, R. B., Martínez-Usó, A., Hernández-Orallo, J.: Making sense of item response theory in machine learning. In: Proceedings of the Twenty-second European Conference on Artificial Intelligence, pp. 1140–1148. IOS Press (2016)
Google Scholar
Prudêncio, R.B., Hernández-Orallo, J., Martınez-Usó, A.: Analysis of instance hardness in machine learning using item response theory. In: Second International Workshop on Learning over Multiple Contexts in ECML 2015, Porto, Portugal (2015)
Google Scholar
Martínez-Plumed, F., Prudêncio, R.B., Martínez-Usó, A., Hernández-Orallo, J.: Item response theory in AI: analysing machine learning classifiers at the instance level. Artif. Intell. 271, 18–42 (2019)
Article MathSciNet Google Scholar
Bischl, B., et al.: OpenML benchmarking suites and the OpenML100. arXiv preprint arXiv:1708.03731 (2017)
Samothrakis, S., Perez, D., Lucas, S.M., Rohlfshagen, P.: Predicting dominance rankings for score-based games. IEEE Trans. Comput. Intell. AI Games 8(1), 1–12 (2014)
Article Google Scholar
Glickman, M.E.: Example of the Glicko-2 system, pp. 1–6. Boston University (2012)
Google Scholar
de Andrade, D.F., Tavares, H.R., da Cunha Valle, R.: Teoria da Resposta ao Item: conceitos e aplicações. ABE, Sao Paulo (2000)
Google Scholar
Veček, N., Mernik, M., Črepinšek, M.: A chess rating system for evolutionary algorithms: a new method for the comparison and ranking of evolutionary algorithms. Inf. Sci. 277, 656–679 (2014)
Article MathSciNet Google Scholar
Birnbaum, A.L.: Some latent trait models and their use in inferring an examinee’s ability. Statistical theories of mental test scores (1968)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Adedoyin, O.O., Mokobi, T.: Using IRT psychometric analysis in examining the quality of junior certificate mathematics multiple choice examination test items. Int. J. Asian Soc. Sci. 3(4), 992–1011 (2013)
Google Scholar
Lord, F.M., Wingersky, M.S.: Comparison of IRT true-score and equipercentile observed-score “equatings”. Appl. Psychol. Meas. 8(4), 453–461 (1984)
Article Google Scholar
Pereira, D.G., Afonso, A., Medeiros, F.M.: Overview of Friedman’s test and post-hoc analysis. Commun. Stat.-Simul. Comput. 44(10), 2636–2653 (2015)
Article MathSciNet Google Scholar
Nemenyi, P.: Distribution-free multiple comparisons. In: Biometrics, vol. 18, no. 2, p. 263 (1962). 1441 I ST, NW, SUITE 700, WASHINGTON, DC 20005–2210: International Biometric Soc
Google Scholar

Download references

Author information

Authors and Affiliations

Faculdade de Computação, Universidade Federal do Pará, Belém, Brazil
Lucas F. F. Cardoso, Vitor C. A. Santos & Regiane S. Kawasaki Francês
Centro de Informática, Universidade Federal de Pernambuco, Recife, Brazil
Ricardo B. C. Prudêncio
Instituto Tecnológico Vale, Belém, Brazil
Ronnie C. O. Alves

Authors

Lucas F. F. Cardoso
View author publications
You can also search for this author in PubMed Google Scholar
Vitor C. A. Santos
View author publications
You can also search for this author in PubMed Google Scholar
Regiane S. Kawasaki Francês
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo B. C. Prudêncio
View author publications
You can also search for this author in PubMed Google Scholar
Ronnie C. O. Alves
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas F. F. Cardoso .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Ricardo Cerri
Federal University of ABC, Santo Andre, Brazil
Ronaldo C. Prati

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cardoso, L.F.F., Santos, V.C.A., Francês, R.S.K., Prudêncio, R.B.C., Alves, R.C.O. (2020). Decoding Machine Learning Benchmarks. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12320. Springer, Cham. https://doi.org/10.1007/978-3-030-61380-8_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-61380-8_28
Published: 13 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61379-2
Online ISBN: 978-3-030-61380-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics