Advertisement

The Prevalence of Errors in Machine Learning Experiments

  • Martin ShepperdEmail author
  • Yuchen Guo
  • Ning Li
  • Mahir Arzoky
  • Andrea Capiluppi
  • Steve Counsell
  • Giuseppe Destefanis
  • Stephen Swift
  • Allan Tucker
  • Leila Yousefi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11871)

Abstract

Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments.

Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors.

Method: We analyse 49 papers describing 2456 individual experimental results from a previously undertaken systematic review comparing supervised and unsupervised defect prediction classifiers. We extract the confusion matrices and test for relevant constraints, e.g., the marginal probabilities must sum to one. We also check for multiple statistical significance testing errors.

Results: We find that a total of 22 out of 49 papers contain demonstrable errors. Of these 7 were statistical and 16 related to confusion matrix inconsistency (one paper contained both classes of error).

Conclusions: Whilst some errors may be of a relatively trivial nature, e.g., transcription errors their presence does not engender confidence. We strongly urge researchers to follow open science principles so errors can be more easily be detected and corrected, thus as a community reduce this worryingly high error rate with our computational experiments.

Keywords

Classifier Computational experiment Reliability Error 

References

  1. 1.
    Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(1), 2653–2688 (2017)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Bender, R., Lange, S.: Adjusting for multiple testing - when and how? J. Clin. Epidemiol. 54(4), 343–349 (2001)CrossRefGoogle Scholar
  3. 3.
    Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Bowes, D., Hall, T., Gray, D.: DConfusion: a technique to allow cross study performance evaluation of fault prediction studies. Autom. Softw. Eng. 21(2), 287–313 (2014)CrossRefGoogle Scholar
  5. 5.
    Brown, N., Heathers, J.: The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology. Soc. Psychol. Pers. Sci. 8(4), 363–369 (2017)CrossRefGoogle Scholar
  6. 6.
    Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009)CrossRefGoogle Scholar
  7. 7.
    Colquhoun, D.: An investigation of the false discovery rate and the misinterpretation of p-values. Royal Soc. Open Sci. 1, 140216 (2014)CrossRefGoogle Scholar
  8. 8.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Earp, B., Trafimow, D.: Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol. 6, 621 (2015)CrossRefGoogle Scholar
  10. 10.
    Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)CrossRefGoogle Scholar
  11. 11.
    Ioannidis, J.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)CrossRefGoogle Scholar
  12. 12.
    Kitchenham, B., Budgen, D., Brereton, P.: Evidence-Based Software Engineering and Systematic Reviews. CRC Press, Boca Raton (2015)Google Scholar
  13. 13.
    Li, N., Shepperd, M., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. (2019, under review)Google Scholar
  14. 14.
    Munafò, M., et al.: A manifesto for reproducible science. Nat. Hum. Behav. 1(1), 0021 (2017)CrossRefGoogle Scholar
  15. 15.
    Nuijten, M., Hartgerink, C., van Assen, M., Epskamp, S., Wicherts, J.: The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48(4), 1205–1226 (2016)CrossRefGoogle Scholar
  16. 16.
    Perlin, M., Imasato, T., Borenstein, D.: Is predatory publishing a real threat? Evidence from a large database study. Scientometrics (2018, online).  https://doi.org/10.1007/s11192-018-2750-6CrossRefGoogle Scholar
  17. 17.
    Shepperd, M., Bowes, D., Hall, T.: Researcher bias: the use of machine learning in software defect prediction. IEEE Trans. Softw. Eng. 40(6), 603–616 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Martin Shepperd
    • 1
    Email author
  • Yuchen Guo
    • 2
  • Ning Li
    • 3
  • Mahir Arzoky
    • 1
  • Andrea Capiluppi
    • 1
  • Steve Counsell
    • 1
  • Giuseppe Destefanis
    • 1
  • Stephen Swift
    • 1
  • Allan Tucker
    • 1
  • Leila Yousefi
    • 1
  1. 1.Brunel University LondonLondonUK
  2. 2.Xi’an Jiaotong UniversityXi’anChina
  3. 3.Northwestern Polytechnical UniversityXi’anChina

Personalised recommendations