Abstract
Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this article, we conduct a multimodal study to better understand the non-reproducibility of software bugs. First, we perform an empirical study using 576 non-reproducible bug reports from two popular software systems (Firefox, Eclipse) and identify 11 key factors that might lead a reported bug to non-reproducibility. Second, we conduct a user study involving 13 professional developers where we investigate how the developers cope with non-reproducible bugs. We found that they either close these bugs or solicit for further information, which involves long deliberations and counter-productive manual searches. Third, we offer several actionable insights on how to avoid non-reproducibility (e.g., false-positive bug report detector) and improve reproducibility of the reported bugs (e.g., sandbox for bug reproduction) by combining our analyses from multiple studies (e.g., empirical study, developer study). Fourth, we explain the differences between reproducible and non-reproducible bug reports by systematically interpreting multiple machine learning models that classify these reports with high accuracy. We found that links to existing bug reports might help improve the reproducibility of a reported bug. Finally, we detect the connected bug reports to a non-reproducible bug automatically and further demonstrate how 93 bugs connected to 71 non-reproducible bugs from our dataset can offer complementary information (e.g., attachments, screenshots, program flows).
Similar content being viewed by others
References
Amoui M, Kaushik N, Al-Dabbagh A, Tahvildari L, Li S, Liu W (2013) Search-based duplicate defect detection: An industrial experience. In: Proc. MSR, pp 173–182
An L, Castelluccio M, Khomh F (2019) An empirical study of dll injection bugs in the firefox ecosystem. EMSE 24:1799–1822
Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y (2008) Is it a bug or an enhancement? a text-based approach to classify change requests. In: Proc. CASCON, p 15
Apache Lucene Core (2019) https://lucene.apache.org/core
Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proc. ICSE, pp 298–308
Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008) What makes a good bug report?. In: Proc. FSE, pp 308–318
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv:1607.04606
Breiman L (2001) Random forests. Mach. Learn. 45(1):5–32
Cessie S L, Houwelingen J C V (1992) Ridge estimators in logistic regression. JSTOR 41(1):191–201
Chaparro O, Bernal-Cárdenas C, Lu J, Moran K, Marcus A, Di Penta M, Poshyvanyk D, Ng V (2019) Assessing the quality of the steps to reproduce in bug reports. In: Proc.ESEC/FSE, pp 86–96
Chaparro O, Florez J M, Marcus A (2017) Using observed behavior to reformulate queries during text retrieval-based bug localization. In: Proc. ICSME, p to appear
Chaparro O, Florez J M, Singh U, Marcus A (2019) Reformulating queries for duplicate bug report detection. In: Proc. SANER, pp 218–229
Chaparro O, Lu J, Zampetti F, Moreno L, Di Penta M, Marcus A, Bavota G, Ng V (2017) Detecting missing information in bug descriptions. In: Proc. ESEC/FSE, pp 396–407
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proc. SIGKDD, pp 785–794
Dam H K, Tran T, Ghose A (2018) Explainable software analytics. In: Proc. ICSE-C, pp 53–56
Doxygen (2020) https://www.doxygen.nl/index.html
Firefox profiler (2020) https://profiler.firefox.com
Fagerland M W (2012) t-tests, non-parametric tests, and large studies–a paradox of statistical practice?. BMC Med Res Methodol, 12(78)
Fan Y, Xia X, D.Lo, Hassan A E (2018) Chaff from the wheat: Characterizing and determining valid bug reports. TSE
Furnas G W, Landauer T K, Gomez L M, Dumais S T (1987) The Vocabulary Problem in Human-system Communication. Commun. ACM 30(11):964–971
Glaser B G, Strauss A L (1967) The discovery of grounded theory : strategies for qualitative research. Aldine Publishing, Chicago
Goyal A, Sardana N (2017) Nrfixer: Sentiment based model for predicting the fixability of non-reproducible bugs. e-Informatica 11(1):103–116
Guo P J, Zimmermann T, Nagappan N, Murphy B (2010) Characterizing and predicting which bugs get fixed: An empirical study of microsoft windows. In: Proc. ICSE, pp 495–504
Guo P J, Zimmermann T, Nagappan N, Murphy B (2011) “not my bug!” and other reasons for software bug report reassignments. In: Proc. CSCW, pp 395–404
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In: Proc. ICSE, pp 392–401
Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Empirical Softw. Engg. 24(2):902–936
ICSME replication package (2020). https://github.com/masud-technope/ICSME2020-Replication-Package
Jiarpakdee J, Tantithamthavorn C, Dam H K, Grundy J (2020) An empirical study of model-agnostic techniques for defect prediction models. TSE
Jiarpakdee J, Tantithamthavorn C, Grundy J (2021) Practitioners’ perceptions of the goals and visual explanations of defect prediction models. In: Proc. MSR, pp 432–443
John G H, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proc. UAI, pp 338–345
Joorabchi M E, Mirzaaghaei M, Mesbah A (2014) Works for me! characterizing non-reproducible bug reports. In: Proc. MSR, pp 62–71
Lin B, Zampetti F, Bavota G, Di Penta M, Lanza M, Oliveto R (2018) Sentiment analysis for software engineering: How far can we go?. In: Proc. ICSE, pp 94–104
Lundberg S M, Erion G, Chen H, DeGrave A, Prutkin J M, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S (2020) From local explanations to global understanding with explainable ai for trees. Nature machine intelligence 2 (1):56–67
Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? on automatically classifying app reviews. In: Proc. RE, pp 116–125
Nayrolles M, Hamou-Lhadj A (2018) Towards a classification of bugs to facilitate software maintainability tasks. In: Proc. SQUADE, pp 25–32
O’Callahan R, Jones C, Froyd N, Huey K, Noll A, Partush N (2017) Engineering record and replay for deployability. In: Proc. USENIX, pp 377–389
Parnin C, Orso A (2011) Are Automated Debugging Techniques Actually Helping Programmers?. In: Proc. ISSTA, pp 199–209
Pernosco (2020) https://pernos.co/about/overview
Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving Low Quality Stack Overflow Post Detection. In: Proc. ICSME, pp 541–544
Quinlan J R (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers Inc.
Rahman M M, Khomh F, Castelluccio M (2020) Why are some bugs non-reproducible? an empirical investigation using data fusion. In: Proc. ICSME, p 12
Rahman M M, Roy C K, Collins J (2016) CORRECT: Code Reviewer Recommendation Based on Cross-Project and Technology Experience. In: Proc. ICSE, p to appear
Rahman M M, Roy C K, Lo D (2019) Automatic query reformulation for code search using crowdsourced knowledge. EMSE 24:1869–1924
Researcher posts facebook bug report to mark Zuckerberg’s wall (2013) https://cnet.co/2PvIH9O
Report: (2019) Software failure caused $1.7 trillion in financial losses in 2017. https://tek.io/2FBNl2i
Ribeiro M T, Singh S, Guestrin C (2016) ”why should i trust you?”: Explaining the predictions of any classifier. In: Proc. KDD, pp 1135–1144
Royston J P (1982) An extension of shapiro and wilk’s w test for normality to large samples. J R Stat Soc 31(2):115–124
Sarkar A, Rigby P C, Bartalos B (2019) Improving bug triaging with high confidence predictions at ericsson. In: Proc. ICSME, pp 81–91
Srcml (2020) https://www.srcml.org/
Shapley values (2021) https://christophm.github.io/interpretable-ml-book/shapley.html
Shafiq H A, Arshad Z (2014) Automated debugging and bug fixing solutions : A systematic literature review and classification
Shi Z, Keung J, Song Q (2014) An Empirical Study of BM25 and BM25F Based Feature Location Techniques. In: Proc. InnoSWDev, pp 106–114
Socher R, Perelygin A, Wu J, Chuang J, Manning C D, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proc. EMNLP, pp 1631–1642
Tan L, Liu C, Li Z, Wang X, Zhou Y, Zhai C (2014) Bug characteristics in open source software. EMSE 19(6):1665–1705
Thelwall M, Buckley K, Paltoglou G, Cai D, Kappas A (2010) Sentiment strength detection in short informal text. JASIST 61(12):2544–2558
Thongtanunam P, Kula R G, Yoshida N, Iida H, Matsumoto K (2015) Who Should Review my Code?. In: Proc. SANER, pp 141–150
Tian Y, Sun C, Lo D (2012) Improved duplicate bug report identification. In: Proc. CSMR, pp 385–390
Vyas D, Fritz T, Shepherd D (2014) Bug reproduction: A collaborative practice within software maintenance activities. In: COOP, pp 189–207
Wang S, Lo D (2014) Version history, similar report, and structure: Putting them together for improved bug localization. In: Proc. ICPC, pp 53–63
Wang S, Lo D (2016) Amalgam+: Composing rich information sources for accurate bug localization. JSEP 28(10):921–942
Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. Proc. ICSE, pp 461–470
Wattanakriengkrai S, Thongtanunam P, Tantithamthavorn C, Hata H, Matsumoto K (2020) Predicting defective lines using a model-agnostic technique. TSE
WEKA Toolkit. http://www.cs.waikato.ac.nz/ml/weka
Works for me (2022) https://bit.ly/2M94cff
Wong C P, Xiong Y, Zhang H, Hao D, Zhang L, Mei H (2014) Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In: Proc. ICSME, pp 181–190
Xia X, Lo D, Shihab E, Wang X (2016) Automated bug report field reassignment and refinement prediction. TSR 65(3):1094–1113
Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: Proc. ISSRE, pp 127–137
Ye X, Shen H, Ma X, Bunescu R, Liu C (2016) From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In: Proc. ICSE, pp 404–415
Yuan T, Lo D, Lawall J (2014) Automated Construction of a Software-specific Word Similarity Database. In: Proc. CSMR-WCRE, pp 44–53
Zhao Y, Yu T, Su T, Liu Y, Zheng W, Zhang J, Halfond WGJ (2019) Recdroid: Automatically reproducing android application crashes from bug reports. In: Proc. ICSE, pp 128–139
Zhou J, Zhang H, Lo D (2012) Where Should the Bugs Be Fixed? - More Accurate Information Retrieval-based Bug Localization Based on Bug Reports. In: Proc. ICSE
Zimmermann T, Nagappan N, Guo P J, Murphy B (2012) Characterizing and predicting which bugs get reopened. In: Proc. ICSE, pp 1074–1083
Acknowledgment
This work was supported by Fonds de Recherche du Quebec (FRQ), the Natural Sciences and Engineering Research Council of Canada (NSERC), and Tenure-track startup grant, Faculty of Computer Science, Dalhousie University, Canada. We would also like to thank all the anonymous respondents to the survey.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Zhenchang Xing and Kelly Blincoe
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Software Maintenance and Evolution (ICSME)
Appendix A: Feature Importance from Models Trained with Extended Dataset
Appendix A: Feature Importance from Models Trained with Extended Dataset
Rights and permissions
About this article
Cite this article
Rahman, M.M., Khomh, F. & Castelluccio, M. Works for Me! Cannot Reproduce – A Large Scale Empirical Study of Non-reproducible Bugs. Empir Software Eng 27, 111 (2022). https://doi.org/10.1007/s10664-022-10153-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-022-10153-2