Abstract
The rapid growth of applying Machine Learning (ML) in different domains, especially in safety-critical areas, increases the need for reliable ML components, i.e., a software component operating based on ML. Since corrective maintenance, i.e. identifying and resolving systems bugs, is a key task in the software development process to deliver reliable software components, it is necessary to investigate the usage of ML components, from the software maintenance perspective. Understanding the bugs’ characteristics and maintenance challenges in ML-based systems can help developers of these systems to identify where to focus maintenance and testing efforts, by giving insights into the most error-prone components, most common bugs, etc. In this paper, we investigate the characteristics of bugs in ML-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint. We extracted 447,948 GitHub repositories that used one of the three most popular ML frameworks, i.e., TensorFlow, Keras, and PyTorch. After multiple filtering steps, we select the top 300 repositories with the highest number of closed issues. We manually investigate the extracted repositories to exclude non-ML-based systems. Our investigation involved a manual inspection of 386 sampled reported issues in the identified ML-based systems to indicate whether they affect ML components or not. Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components. Next, we thoroughly examined 109 identified ML bugs to identify their root causes, and symptoms, and calculate their required fixing time. The results also revealed that ML bugs have significantly different characteristics compared to non-ML bugs, in terms of the complexity of bug-fixing (number of commits, changed files, and changed lines of code). Based on our results, fixing ML bugs is more costly and ML components are more error-prone, compared to non-ML bugs and non-ML components respectively. Hence, paying significant attention to the reliability of the ML components is crucial in ML-based systems. These results deepen the understanding of ML bugs and we hope that our findings help shed light on opportunities for designing effective tools for testing and debugging ML-based systems.
Similar content being viewed by others
Data Availability Statement
The dataset generated during the current study is available in the replication package, which is accessible via (Replication package 2023).
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16). Savannah, GA, USA, USENIX, pp 265–283
Add typing for trainer.logger (2021) https://github.com/Lightning-AI/lightning/pull/11114
Aithal SG, Rao AB, Singh S (2021) Automatic question-answer pairs generation and question similarity mechanism in question answering system. Appl Intell 51(11):8484–8497
Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: a case study. In: 2019 IEEE/ACM 41st International conference on software engineering: software engineering in practice (ICSE-SEIP), IEEE, pp 291–300
Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: Proceedings of the 2005 OOPSLA workshop on eclipse technology exchange, ser. eclipse ’05. New York, NY, USA: Association for Computing Machinery, pp 35–39. [Online]. Available: https://doi.org/10.1145/1117696.1117704
Arcuri A, Briand L (2014) A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw Test Verif Rel 24(3):219–250
Bennett KH, Rajlich VT (2000) Software maintenance and evolution: a roadmap. In: Proceedings of the conference on the future of software engineering, pp 73–87
Bosu A, Carver JC, Bird C, Orbeck J, Chockley C (2016) Process aspects and social dynamics of contemporary code review: insights from open source development and industrial practice at microsoft. IEEE Trans Softw Eng 43(1):56–75
Bosu A, Carver JC (2014) Impact of developer reputation on code review outcomes in oss projects: an empirical investigation. In: Proceedings of the 8th ACM/IEEE international symposium on empirical software engineering and measurement, pp 1–10
Bug fix for static pulse shapes (2018). https://github.com/cms-sw/cmssw/pull/23001
Bug: mismatch value with speechbrain.nnet.pooling.statisticalpooling (2021) https://github.com/speechbrain/speechbrain/issues/1048
Cao J, Chen B, Sun C, Hu L, Peng X (2021) Characterizing performance bugs in deep learning systems. [Online]. Available: arXiv:2112.01771
Carta S, Corriga A, Ferreira A, Podda AS, Recupero DR (2021) A multi-layer and multi-ensemble stock trader using deep learning and deep reinforcement learning. Appl Intell 51(2):889–905
Chaturvedi K, Kapur P, Anand S, Singh V (2014) Predicting the complexity of code changes using entropy based measures. Int J Syst Assur Eng Manag 5(2):155–164
Chen Z, Cao Y, Liu Y, Wang H, Xie T, Liu X (2020) A comprehensive study on challenges in deploying deep learning based software. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2020. New York, NY, USA: Association for Computing Machinery, pp 750–762. [Online]. Available: https://doi.org/10.1145/3368089.3409759
Chollet F et al (2018) Keras: the python deep learning library. Michigan, United States, pp ascl–1806
Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol Bull 114(3):494
developer guideline documentation G (2021) Github rest api. https://developer.github.com/v3/. Accessed 27 July 2021
Falotico R, Quatto P (2015) Fleiss’ kappa statistic without paradoxes. Quality & Quantity 49(2):463–470
Fix default ckpt path when logger exists (2020). https://github.com/Lightning-AI/lightning/pull/771
Fix docs for early stopping (2020). https://github.com/Lightning-AI/lightning/pull/865
Fix model architecture for deployment to onnx (2023). https://github.com/mehta-lab/microDL/pull/234
Galin D (2004) Software quality assurance: from theory to implementation. pearson.com: Pearson education
Gensim library (2022) https://radimrehurek.com/gensim_3.8.3/index.html
GitHub (2021) Github graphql api documentation.https://docs.github.com/en/graphql. Accessed 27 July 2021
Github (2022) https://github.com/
Grubb P, Takang AA (2003) Software maintenance: concepts and practice. World Scientific
Gupta S (2021) What is the best language for machine learning? https://www.springboard.com/blog/data-science/best-language-for-machine-learning. Accessed 06 Oct 2021
Hanam Q, Brito FSdM, Mesbah A (2016) Discovering bug patterns in javascript. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, pp 144–156
HanLP (2021) Hanlp: han language processing. https://github.com/hankcs/HanLP. Accessed 01 Nov 2021
Hartling L, Hamm M, Milne A, Vandermeer B, Santaguida PL, Ansari M, Tsertsvadze A, Hempel S, Shekelle P, Dryden DM (2012) Validity and inter-rater reliability testing of quality assessment instruments
Hata H, Kula RG, Ishio T, Treude C (2021) Same file, different changes: the potential of meta-maintenance on github. In: Proceedings of the 43rd international conference on software engineering, IEEE. 3 Park Avenue, New York NY 10016-5997, USA: IEEE Press, pp 773–784. [Online]. Available: https://doi.org/10.1109/ICSE43902.2021.00076
Hu X, Chu L, Pei J, Liu W, Bian J (2021) Model complexity of deep learning: a survey. Knowl Inf Syst 63:2585–2619
Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, ser. ICSE ’20. New York, USA: Association for Computing Machinery, pp 1110–1121. [Online]. Available: https://doi.org/10.1145/3377811.3380395
Humbatova N, Jahangirova G, Tonella P (2021) Deepcrime: mutation testing of deep learning systems based on real faults. In: Proceedings of the30th ACM SIGSOFT international symposium on software testing and analysis, ser. ISSTA 2021. New York, USA: Association for Computing Machinery, pp 67–78. [Online]. Available: https://doi.org/10.1145/3460319.3464825
IEEE (2010) ISO/IEC/IEEE International Standard - Systems and software engineering – Vocabulary. 3 Park Avenue, New York 10016-5997, USA: IEEE
IEEE (2017) IEEE recommended practice on software reliability. 3 Park Avenue, New York 10016-5997, USA: IEEE
Interactive learning has server error hosting on docker (2019) https://github.com/RasaHQ/rasa/issues/4142
Islam MJ, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2019. New York, USA: Association for Computing Machinery, pp 510–520. [Online]. Available: https://doi.org/10.1145/3338906.3338955
Islam MJ, Pan R, Nguyen G, Rajan H (2020) Repairing deep neural networks: fix patterns and challenges. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. New York, USA: Association for Computing Machinery, pp 1135–1146. [Online]. Available: https://doi.org/10.1145/3377811.3380378
Jia L, Zhong H, Wang X, Huang L, Lu X (2021) The symptoms, causes, and repairs of bugs inside a deep learning library. J Syst Softw 177:110935
Joshua G, Yang F, Junjie S, Sumaya A, Yuan Chen X, Alfred Q (2020) A comprehensive study of autonomous vehicle bugs. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, ser. ICSE’ 20. New York, USA: Association for Computing Machinery, pp 385–396. [Online]. Available: https://doi.org/10.1145/3377811.3380397
Kampenes VB, Dybå T, Hannay JE, Sjøberg DI (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11):1073–1086. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950584907000195
Keras (2022) Formal documentation of keras apis. https://keras.io/api/models/
Kononenko O, Rose T, Baysal O, Godfrey M, Theisen D, De Water B (2018) Studying pull request merges: a case study of shopify’s active merchant. In: Proceedings of the 40th international conference on software engineering: software engineering in practice. 3 Park Avenue, New York 10016-5997, USA: IEEE, pp 124–133
Krishna R, Agrawal A, Rahman A, Sobran A, Menzies T (2018) What is the connection between issues, bugs, and enhancements? In: 2018 IEEE/ACM 40th International conference on software engineering: software engineering in practice track (ICSE-SEIP). 3 Park Avenue, New York 10016-5997, USA: IEEE, pp 306–315
Lau K-K, di Cola S (2017) An introduction to component-based software development. World Scientific Publishing Co Pte Ltd: World Scientific, [Online]. Available: https://www.worldscientific.com/doi/abs/10.1142/10486
Lenarduzzi V, Lomio F, Moreschini S, Taibi D, Tamburri DA (2021) Software quality for ai: where we are now? In: Winkler D, Biffl S, Mendez D, Wimmer M, Bergsmann J (eds) Software quality: future perspectives on software engineering quality. Springer International Publishing, Cham, pp 43–53
Liu Z, Li D, Ge SS, Tian F (2020) Small traffic sign detection from large image. Appl Intell 50(1):1–13
Liu C, Lu J, Li G, Yuan T, Li L, Tan F, Yang J, You L, Xue J (2021) Detecting tensorflow program bugs in real-world industrial environment. In: 2021 36th IEEE/ACM International conference on automated software engineering (ASE). 3 Park Avenue, New York 10016-5997, USA: IEEE, pp 55–66
Li S, Wu Y, Liu Y, Wang D, Wen M, Tao Y, Sui Y, Liu Y (2020) An exploratory study of bugs in extended reality applications on the web. In: 2020 IEEE 31st International symposium on software reliability engineering (ISSRE). 3 Park Avenue, New York 10016–5997, USA: IEEE, pp 172–183
Loading a checkpoint that was saved in pl \(<\) 1.2 still breaks (2021) https://github.com/Lightning-AI/lightning/issues/7400
Loading saved destvi.from_rna_model dosn’t consider two anndatas used to create model (2021) https://github.com/scverse/scvi-tools/issues/1087
Long G, Chen T (2022) On reporting performance and accuracy bugs for deep learning frameworks: an exploratory study from github. [Online]. Available: arXiv:2204.07893
Lyu MR (2007) Software reliability engineering: a roadmap. In: Future of software engineering (FOSE’07). 3 Park Avenue, New York 10016-5997, USA: IEEE, pp 153–170
Macbeth G, Razumiejczyk E, Ledesma RD (2011) Cliff’s delta calculator: a non-parametric effect size program for two groups of observations. Universitas Psychologica 10(2):545–555
Maddila C, Bansal C, Nagappan N (2019) Predicting pull request completion time: a case study on large scale cloud services. In: Proceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 874–882
Mallet library (2022) https://mimno.github.io/Mallet/
Martínez-Fernández S, Bogner J, Franch X, Oriol M, Siebert J, Trendowicz A, Vollmer AM, Wagner S (2021) Software engineering for ai-based systems: a survey
Menzies T (2019) The five laws of se for ai. IEEE Softw 37(1):81–85
Morovati MM, Nikanjam A, Khomh F, Jiang ZM (2023) Bugs in machine learning-based systems: a faultload benchmark. Empir Softw Eng 28(3):62
Ni Z, Li B, Sun X, Chen T, Tang B, Shi X (2020) Analyzing bug fix for automatic bug cause classification. J Syst Softw 163:110538
Nikanjam A, Braiek HB, Morovati MM, Khomh F (2021) Automatic fault detection for deep learning programs using graph transformations. ACM Trans Softw Eng Methodol (TOSEM) 31(1):1–27
Nikanjam A, Morovati MM, Khomh F, Ben Braiek H (2022) Faults in deep reinforcement learning programs: a taxonomy and a detection approach. Autom Softw Eng 29(1):1–32
NVIDIA (2021) Nvtabular. https://github.com/NVIDIA-Merlin/NVTabular. Accessed 01 Nov 2021
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Ithaca, NY, United States
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. [Online]. Available: http://scikit-learn.sourceforge.net
Pps: fix of ispixelhit (2021) https://github.com/cms-sw/cmssw/pull/35089
Quach S, Lamothe M, Kamei Y, Shang W (2021) An empirical study on the use of szz for identifying inducing changes of non-functional bugs. Empir Softw Eng 26(4):1–25
RasaHQ (2021a) https://github.com/RasaHQ/rasa/issues/8541. Accessed 01 Nov 2021
RasaHQ (2021b) https://github.com/RasaHQ/rasa/issues/4730. Accessed 01 Nov 2021
reduce_lr_on_plateau can’t find validation metrics in most recent release (2021) https://github.com/scverse/scvi-tools/issues/1112
Replication package (2023) https://github.com/ML-Bugs-2022/Replication-Package
Riccio V, Jahangirova G, Stocco A, Humbatova N, Weiss M, Tonella P (2020) Testing machine learning based systems: a systematic mapping. Empir Softw Eng 25(6):5193–5254
Rivera-Landos E, Khomh F, Nikanjam A (2021) The challenge of reproducible ml: an empirical study on the impact of bugs
Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the nsse and other surveys: are the t-test and cohen’sd indices the most appropriate choices. In: Annual meeting of the Southern association for institutional research. The Pennsylvania State University, Citeseer, pp 1–51
Romano A, Liu X, Kwon Y, Wang W (2021) An empirical study of bugs in webassembly compilers. In: 2021 36th IEEE/ACM International conference on automated software engineering (ASE). 3 Park Avenue, New York NY 10016-5997, USA: IEEE, pp 42–54
Schober P, Boer C, Schwarte LA (2018) Correlation coefficients: appropriate use and interpretation. Anesth Analg 126(5):1763–1768
Schoop E, Huang F, Hartmann B (2021) Umlaut: debugging deep learning programs using program structure and model behavior. In: Proceedings of the 2021 CHI conference on human factors in computing systems, ser. CHI’ 21. New York, USA: Association for Computing Machinery, [Online]. Available: https://doi.org/10.1145/3411764.3445538
Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J-F, Dennison D (2015) Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst 28
Seaman CB (1999) Qualitative methods in empirical studies of software engineering. IEEE Trans Softw Eng 25(4):557–572
Shen Q, Ma H, Chen J, Tian Y, Cheung S-C, Chen X (2021) A comprehensive study of deep learning compiler bugs. In: Proceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2021. New York, USA: Association for Computing Machinery, pp 968–980. [Online]. Available: https://doi.org/10.1145/3468264.3468591
some tests fails with pytorch 1.6 (2020) https://github.com/speechbrain/speechbrain/issues/248
Tagra A, Zhang H, Rajbahadur GK, Hassan AE (2022) Revisiting reopened bugs in open source software systems. Empir Softw Eng 27(4):1–34
Tambon F, Nikanjam A, An L, Khomh F, Antoniol G (2021) Silent bugs in deep learning frameworks: an empirical study of keras and tensorflow
Tan L, Liu C, Li Z, Wang X, Zhou Y, Zhai C (2014) Bug characteristics in open source software. Empir Softw Eng 19:1665–1705
Tensorflow (2021) https://github.com/tensorflow/models. Accessed 01 Nov 2021
Tensorpack (2021) https://github.com/tensorpack/tensorpack. Accessed 01 Nov 2021
tf.function not used for model inference (2022). https://github.com/RasaHQ/rasa/issues/10728
ultralytics (2021) Yolov3.’ https://github.com/ultralytics/yolov3. Accessed 01 Nov 2021
Vlasic B, Boudette NE (2016) Self-driving tesla was involved in fatal crash, us says. The New York Times Company New York, NY, USA. [Online]. Available: https://www.nytimes.com/2016/07/01/business/self-driving-tesla-fatal-crash-investigation.html
Voskoglou C (2017) What is the best programming language for machine learning. Towards data science. [Online]. Available: https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7
Wang H, Pham H et al (2006) Reliability and optimal maintenance. Springer International Publishing, Springer, p 14197
Wardat M, Cruz BD, Le W, Rajan H (2022) Deepdiagnosis: automatically diagnosing faults and recommending actionable fixes in deep learning programs. In: Proceedings of the 44th international conference on software engineering, pp 561–572
Wardat M, Le W, Rajan H (2021) Deeplocalize: fault localization for deep neural networks. In: 2021 IEEE/ACM 43rd International conference on software engineering (ICSE). 3 Park Avenue, New York 10016-5997, USA: IEEE, pp 251–262
Wirsansky E (2020) Hands-on genetic algorithms with Python: applying genetic algorithms to solve real-world deep learning and artificial intelligence problems. Packt Publishing Ltd, Packt Publishing Ltd
Write complete json log after training (2019) https://github.com/snorkel-team/snorkel/pull/1445
Yan M, Chen J, Zhang X, Tan L, Wang G, Wang Z (2021) Exposing numerical bugs in deep learning via gradient back-propagation. In: Proceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2021. New York, USA: Association for Computing Machinery, pp 627–638. [Online]. Available: https://doi.org/10.1145/3468264.3468612
Yang Y, He T, Feng Y, Liu S, Xu B (2022) Mining python fix patterns via analyzing fine-grained source code changes. Empir Softw Eng 27(2):1–37
Yao Y, Xiao Z, Wang B, Viswanath B, Zheng H, Zhao BY (2017) Complexity vs. performance: empirical analysis of machine learning as a service. In: Proceedings of the 2017 internet measurement conference, pp 384–397
Zhang JM, Harman M, Ma L, Liu Y (2022) Machine learning testing: survey, landscapes and horizons. IEEE Trans Softw Eng 48(01):1–36
Zhang Y, Chen Y, Cheung S-C, Xiong Y, Zhang L (2018) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 3 Park Avenue, New York 10016-5997, USA: IEEE, pp 129–140
Zhang T, Gao C, Ma L, Lyu M, Kim M (2019) An empirical study of common challenges in developing deep learning applications. In: 2019 IEEE 30th International symposium on software reliability engineering (ISSRE). 3 Park Avenue, New York 10016–5997, USA: IEEE, pp 104–115
Zhang X, Zhai J, Ma S, Shen C (2021) Autotrainer: an automatic dnn training problem detection and repair system. In: 2021 IEEE/ACM 43rd International conference on software engineering (ICSE). IEEE, pp 359–371
Zimmermann T, Nagappan N, Guo PJ, Murphy B (2012) Characterizing and predicting which bugs get reopened. In: 2012 34th International conference on software engineering (ICSE). IEEE, pp 1074–1083
Acknowledgements
This work was supported by: Fonds de Recherche du Québec (FRQ), the Canadian Institute for Advanced Research (CIFAR) as well as the DEEL project CRDPJ 537462-18 funded by the National Science and Engineering Research Council of Canada (NSERC) and the Consortium for Research and Innovation in Aerospace in Québec (CRIAQ), together with its industrial partners Thales Canada inc, Bell Textron Canada Limited, CAE inc and Bombardier inc.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Bowen Xu || Xiaofei Xie || Maxime Cordy and Bibi Stamatia
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Morovati, M.M., Nikanjam, A., Tambon, F. et al. Bug characterization in machine learning-based systems. Empir Software Eng 29, 14 (2024). https://doi.org/10.1007/s10664-023-10400-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10400-0