Skip to main content
Log in

Bugs in machine learning-based systems: a faultload benchmark

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs’ lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs’ origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://defect4aitesting.soccerlab.polymtl.ca/

  2. https://caffe.berkeleyvision.org/

  3. https://keras.io/

  4. https://www.TensorFlow.org/

  5. https://github.com/Theano/Theano

  6. http://torch.ch/

  7. https://github.com/

  8. https://stackoverflow.com/

  9. https://data.stackexchange.com/stackoverflow/query/new

  10. https://keras.io/api/datasets/

  11. In requirements.txt file, that consists of detailed information of dependencies per bug.

  12. In conf.ini file, that contains the required configuration per bug.

References

  • Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). Savannah, USENIX, pp 265–283

  • Abidi M, Grichi M, Khomh F, Guéhéneuc Y G (2019a) Code smells for multi-language systems. In: Proceedings of the 24th European conference on pattern languages of programs, pp 1–13

  • Abidi M, Khomh F, Guéhéneuc Y G (2019b) Anti-patterns for multi-language systems. In: Proceedings of the 24th European conference on pattern languages of programs, pp 1–14

  • Abidi M, Rahman M S, Openja M, Khomh F (2021) Are multi-language design smells fault-prone? An empirical study. ACM Trans Softw Eng Methodol (TOSEM) 30(3):1–56

    Article  Google Scholar 

  • Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, Belopolsky A et al (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv e-prints pp arXiv–1605

  • Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: a case study. In: 2019 IEEE/ACM 41st international conference on software engineering: Software engineering in practice (ICSE-SEIP). IEEE, pp 291–300

  • Barocas S, Selbst AD (2016) Big data’s disparate impact. Calif Law Rev 104(3):671–732. http://www.jstor.org/stable/24758720. Accessed 11 Jan 2022

    Google Scholar 

  • Borg M (2021) The aiq meta-testbed: pragmatically bridging academic ai testing and industrial q needs. In: International conference on software quality. Springer, pp 66–77

  • Bourque P, Dupuis R, Abran A, Moore J W, Tripp L (1999) The guide to the software engineering body of knowledge. IEEE Softw 16(6):35–44

    Article  Google Scholar 

  • Brownlee J (2020) Use early stopping to halt the training of neural networks at the right time. https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/. Accessed: 2022-12-29

  • Chollet F et al (2018) Keras: the python deep learning library. Astrophysics Source Code Library, pp ascl–1806

  • Chouldechova A, Roth A (2018) The frontiers of fairness in machine learning. arXiv:1810.08810

  • Collobert R, Bengio S, Mariéthoz J (2002) Torch: a modular machine learning software library. Tech. rep. Idiap

  • Developer guideline documentation G (2021) Github rest api. https://developer.github.com/v3/. Accessed: 2021-7-27

  • Dwork C (2008) Differential privacy: a survey of results. In: International conference on theory and applications of models of computation. Springer, pp 1–19

  • Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J (2019) A guide to deep learning in healthcare. Nat Med 25(1):24–29

    Article  Google Scholar 

  • Felderer M, Ramler R (2021) Quality assurance for ai-based systems: overview and challenges (introduction to interactive session). In: International conference on software quality. Springer, pp 33–42

  • Galin D (2004) Software quality assurance: from theory to implementation. Pearson Education, England

    Google Scholar 

  • GitHub (2021) Github official website. https://github.com/about. Accessed: 2021-7-27

  • Gupta S (2021) What is the best language for machine learning? https://www.springboard.com/blog/data-science/best-language-for-machine-learning. Accessed: 2021-10-06

  • Hawkins D M (2004) The problem of overfitting. J Chem Inf Comput 44(1):1–12

    Article  Google Scholar 

  • https://github.com/dpressel/baseline/commit/4dad463 (2016). Accessed: 2021-11-01

  • https://stackoverflow.com/questions/34311586 (2016). Accessed: 2021-11-01

  • https://stackoverflow.com/questions/38080035 (2017). Accessed: 2021-11-01

  • https://stackoverflow.com/questions/42264649 (2017). Accessed: 2021-11-01

  • https://github.com/suchaoxiao/keras-frcnn_modify/commit/2f51f68 (2017). Accessed: 2021-11-01

  • https://github.com/albu/albumentations/commit/fec1f3b (2018). Accessed: 2021-11-01

  • https://github.com/vmelan/cifar-experiment/commit/561c82e (2018). Accessed: 2022-06-01

  • https://stackoverflow.com/questions/53119432 (2018). Accessed: 2021-11-01

  • https://github.com/acflorea/keras-playground/commit/d44c90c (2018). Accessed: 2022-06-01

  • https://github.com/keras-team/keras-tuner/commit/3758611 (2018). Accessed: 2022-06-01

  • https://github.com/hunkim/DeepLearningZeroToAll/commit/9f8fb94 (2018). Accessed: 2022-06-01

  • https://stackoverflow.com/questions/44924690 (2018). Accessed: 2021-11-01

  • https://stackoverflow.com/questions/58636087 (2018). Accessed: 2021-11-01

  • https://stackoverflow.com/questions/50079585 (2018). Accessed: 2021-11-01

  • https://github.com/PhilippeNguyen/kinopt/commit/fdee16f (2018). Accessed: 2021-11-01

  • https://stackoverflow.com/questions/56103207 (2019). Accessed: 2021-11-01

  • https://github.com/vaclavcadek/keras2pmml/commit/4795ec6 (2019). Accessed: 2021-11-01

  • Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, pp 1110–1121

  • Huppler K (2009) The art of building a good benchmark. In: Technology conference on performance evaluation and benchmarking. Springer, pp 18–30

  • IEEE standard for system, software, and hardware verification and validation (2017). IEEE Std 1012-2016 (Revision of IEEE Std 1012-2012/ Incorporates IEEE Std 1012-2016/Cor1-2017), pp 1–260. https://doi.org/10.1109/IEEESTD.2017.8055462

  • IEEE standard glossary of software engineering terminology (1990). IEEE Std 610.12-1990, pp 1–84. https://doi.org/10.1109/IEEESTD.1990.101064

  • ISO/IEC/IEEE international standard—systems and software engineering—vocabulary (2010). ISO/IEC/IEEE 24765:2010(E), pp 1–418. https://doi.org/10.1109/IEEESTD.2010.5733835

  • Islam M J, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 510–520

  • Islam M J, Pan R, Nguyen G, Rajan H (2020) Repairing deep neural networks: fix patterns and challenges. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE). IEEE, pp 1135–1146

  • Jia L, Zhong H, Huang L (2021a) The unit test quality of deep learning libraries: a mutation analysis. In: 2021 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 47–57

  • Jia L, Zhong H, Wang X, Huang L, Lu X (2021b) The symptoms, causes, and repairs of bugs inside a deep learning library. J Syst Softw 177:110935

  • Jia L, Zhong H, Wang X, Huang L, Li Z (2022) How do injected bugs affect deep learning?. In: 2022 IEEE International conference on software analysis, evolution and reengineering (SANER). IEEE, pp 793–804

  • Jiang Y, Liu H, Niu N, Zhang L, Hu Y (2021) Extracting concise bug-fixing patches from human-written patches in version control systems. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE). IEEE, pp 686–698

  • Just R, Jalali D, Ernst M D (2014) Defects4j: a database of existing faults to enable controlled testing studies for java programs. In: Proceedings of the 2014 international symposium on software testing and analysis, pp 437–440

  • Keras (2016) Keras 2.1.5. https://github.com/keras-team/keras/releases/tag/2.1.5. Accessed: 2021-11-01

  • Kim M, Kim Y, Lee E (2021) Denchmark: a bug benchmark of deep learning-related software. In: 2021 IEEE/ACM 18th international conference on mining software repositories (MSR). IEEE, pp 540–544

  • Kirk M (2014) Thoughtful machine learning: a test-driven approach. O’Reilly Media, Inc.

  • Kistowski JV, Arnold JA, Huppler K, Lange KD, Henning JL, Cao P (2015) How to build a benchmark. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, pp 333–336

  • Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images

  • Le Goues C, Holtschulte N, Smith E K, Brun Y, Devanbu P, Forrest S, Weimer W (2015) The manybugs and introclass benchmarks for automated repair of c programs. IEEE Trans Softw Eng 41(12):1236–1256

    Article  Google Scholar 

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Lenarduzzi V, Lomio F, Moreschini S, Taibi D, Tamburri D A (2021) Software quality for ai: where we are now?. In: International conference on software quality. Springer, pp 43–53

  • Lin Z, Marinov D, Zhong H, Chen Y, Zhao J (2015) Jacontebe: a benchmark suite of real-world java concurrency bugs (t). In: 2015 30th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 178–189

  • Lipton Z C (2018) The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16 (3):31–57

    Article  Google Scholar 

  • Liu X, Xie L, Wang Y, Zou J, Xiong J, Ying Z, Vasilakos A V (2020) Privacy and security issues in deep learning: a survey. IEEE Access 9:4566–4593

    Article  Google Scholar 

  • Lu S, Li Z, Qin F, Tan L, Zhou P, Zhou Y (2005) Bugbench: benchmarks for evaluating bug detection tools. In: Workshop on the evaluation of software defect detection tools, vol 5. Chicago

  • Lyu M R (2007) Software reliability engineering: a roadmap. In: Future of software engineering (FOSE’07). IEEE, Minneapolis, pp 153–170

  • Ma L, Juefei-Xu F, Zhang F, Sun J, Xue M, Li B, Chen C, Su T, Li L, Liu Y et al (2018) Deepgauge: multi-granularity testing criteria for deep learning systems. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering. Association for Computing Machinery (ACM), New York, pp 120–131

  • Madeiral F, Urli S, Maia M, Monperrus M (2019) Bears: an extensible java bug benchmark for automatic program repair studies. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 468–478

  • Marijan D, Gotlieb A, Ahuja M K (2019) Challenges of testing machine learning based systems. In: 2019 IEEE International conference on artificial intelligence testing (AITest). IEEE, pp 101–102

  • Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2021) Software engineering for ai-based systems: a survey. arXiv:2105.01984

  • McDonald N, Schoenebeck S, Forte A (2019) Reliability and inter-rater reliability in qualitative research: Norms and guidelines for cscw and hci practice. Proc ACM on Human-Comput Interact 3(CSCW):1–23

    Google Scholar 

  • McHugh M L (2012) Interrater reliability: the kappa statistic. Biochemia Medica 22(3):276–282

    Article  MathSciNet  Google Scholar 

  • Nejadgholi M, Yang J (2019) A study of oracle approximations in testing deep learning libraries. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 785–796

  • Nikanjam A, Khomh F (2021) Design smells in deep learning programs: an empirical study. In: 2021 IEEE International conference on software maintenance and evolution (ICSME), pp 332–342

  • Nikanjam A, Braiek H B, Morovati M M, Khomh F (2021a) Automatic fault detection for deep learning programs using graph transformations. ACM Trans Softw Eng Methodol 31(1). https://doi.org/10.1145/3470006

  • Nikanjam A, Morovati M M, Khomh F, Braiek H B (2021b) Faults in deep reinforcement learning programs: a taxonomy and a detection approach. arXiv:2101.00135

  • Organisation T (2021) Torch official github repository. https://github.com/torch/torch7. Accessed: 2021-9-1

  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. arXiv:1912.01703

  • Pei K, Cao Y, Yang J, Jana S (2017) Deepxplore: automated whitebox testing of deep learning systems. In: Proceedings of the 26th symposium on operating systems principles. Association for Computing Machinery (ACM), New York, pp 1–18

  • Pham H V, Qian S, Wang J, Lutellier T, Rosenthal J, Tan L, Yu Y, Nagappan N (2021) Problems and opportunities in training deep learning software systems: an analysis of variance. In: Proceedings of the 35th IEEE/ACM international conference on automated software engineering, ASE ’20. Association for Computing Machinery, New York, pp 771–783. https://doi.org/10.1145/3324884.3416545

  • Pressman R S (2005) Software engineering: a practitioner’s approach. Palgrave Macmillan

  • Radjenović D, Heričko M, Torkar R, živkovič A (2013) Software fault prediction metrics: a systematic literature review. Inf Softw Technol 55(8):1397–1418

    Article  Google Scholar 

  • Riccio V, Jahangirova G, Stocco A, Humbatova N, Weiss M, Tonella P (2020) Testing machine learning based systems: a systematic mapping. Empir Softw Eng 25(6):5193–5254

    Article  Google Scholar 

  • Rice L, Wong E, Kolter Z (2020) Overfitting in adversarially robust deep learning. In: International conference on machine learning. PMLR, pp 8093–8104

  • Rivera-Landos E, Khomh F, Nikanjam A (2021) The challenge of reproducible ml: an empirical study on the impact of bugs

  • Road vehicles—safety of the intended functionality. Standard (2019). https://www.iso.org/standard/70939.html. Accessed 11 Jan 2022

  • Rodríguez-Pérez G, Robles G, González-Barahona JM (2018) Reproducibility and credibility in empirical software engineering: a case study based on a systematic literature review of the use of the szz algorithm. Inf Softw Technol 99:164–176

    Article  Google Scholar 

  • Schoop E, Huang F, Hartmann B (2021) Umlaut: debugging deep learning programs using program structure and model behavior. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–16

  • Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J F, Dennison D (2015) Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst 28:2503–2511

    Google Scholar 

  • Shen Q, Ma H, Chen J, Tian Y, Cheung S C, Chen X (2021) A comprehensive study of deep learning compiler bugs. In: Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 968–980

  • Spadini D, Aniche M, Bacchelli A (2018) PyDriller: python framework for mining software repositories. In: Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering—ESEC/FSE 2018. ACM Press, New York, pp 908–911. https://doi.org/10.1145/3236024.3264598

  • StackOverflow: Stack overflow annual developer survey. https://insights.stackoverflow.com/survey/2021 (2021). Accessed: 2022-04-01

  • Tambon F, Nikanjam A, An L, Khomh F, Antoniol G (2021) Silent bugs in deep learning frameworks: an empirical study of keras and tensorflow

  • Tian Y, Pei K, Jana S, Ray B (2018) Deeptest: automated testing of deep-neural-network-driven autonomous cars. In: Proceedings of the 40th international conference on software engineering, pp 303–314

  • Vieira M, Madeira H, Sachs K, Kounev S (2012) Resilience benchmarking. In: Resilience assessment and evaluation of computing systems. Springer, pp 283–301

  • Voskoglou C (2017) What is the best programming language for machine learning. https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7. Accessed: 2021-10-06

  • Wardat M, Le W, Rajan H (2021) Deeplocalize: fault localization for deep neural networks. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE). IEEE, pp 251–262

  • Wardat M, Cruz B D, Le W, Rajan H (2022) Deepdiagnosis: automatically diagnosing faults and recommending actionable fixes in deep learning programs. In: Proceedings of the 44th international conference on software engineering, pp 561–572

  • Widyasari R, Sim S Q, Lok C, Qi H, Phan J, Tay Q, Tan C, Wee F, Tan J E, Yieh Y et al (2020) Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In: Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 1556–1560

  • Xue M, Yuan C, Wu H, Zhang Y, Liu W (2020) Machine learning security: threats, countermeasures, and evaluations. IEEE Access 8:74720–74742

    Article  Google Scholar 

  • Yalçın OG (2021) Top 5 deep learning frameworks to watch in 2021 and why tensorflow. https://towardsdatascience.com/top-5-deep-learning-frameworks-to-watch-in-2021-and-why-tensorflow-98d8d6667351. Accessed: 2022-12-29

  • Zerouali A, Mens T, Robles G, Gonzalez-Barahona J M (2019) On the diversity of software package popularity metrics: an empirical study of npm. In: 2019 IEEE 26th international conference on software analysis, evolution and reengineering (SANER). IEEE, pp 589–593

  • Zhang M, Zhang Y, Zhang L, Liu C, Khurshid S (2018a) Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In: 2018 33rd IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 132–142

  • Zhang Y, Chen Y, Cheung S C, Xiong Y, Zhang L (2018b) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, pp 129–140

  • Zhang J, Barr E T, Guedj B, Harman M, Shawe-Taylor J (2019) Perturbed model validation: a new framework to validate model relevance

  • Zhang J M, Harman M, Ma L, Liu Y (2020) Machine learning testing: survey, landscapes and horizons. IEEE Trans Softw Eng

  • Zhu C, Huang W R, Li H, Taylor G, Studer C, Goldstein T (2019) Transferable clean-label poisoning attacks on deep neural nets. In: International conference on machine learning. PMLR, pp 7614–7623

  • Zubrow D (2009) IEEE Standard classification for software anomalies. IEEE Computer Society

Download references

Acknowledgements

This work was supported by: Fonds de Recherche du Québec (FRQ), the Canadian Institute for Advanced Research (CIFAR) as well as the DEEL project CRDPJ 537462-18 funded by the National Science and Engineering Research Council of Canada (NSERC) and the Consortium for Research and Innovation in Aerospace in Québec (CRIAQ), together with its industrial partners Thales Canada inc, Bell Textron Canada Limited, CAE inc and Bombardier inc.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Mehdi Morovati.

Additional information

Communicated by: Andrea Stocco, Onn Shehory, Gunel Jahangirova, Vincenzo Riccio

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Software Testing in the Machine Learning Era.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Morovati, M.M., Nikanjam, A., Khomh, F. et al. Bugs in machine learning-based systems: a faultload benchmark. Empir Software Eng 28, 62 (2023). https://doi.org/10.1007/s10664-023-10291-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-023-10291-1

Keywords

Navigation