Works for Me! Cannot Reproduce – A Large Scale Empirical Study of Non-reproducible Bugs

Rahman, Mohammad M.; Khomh, Foutse; Castelluccio, Marco

doi:10.1007/s10664-022-10153-2

Works for Me! Cannot Reproduce – A Large Scale Empirical Study of Non-reproducible Bugs

Published: 30 May 2022

Volume 27, article number 111, (2022)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

478 Accesses
2 Citations
6 Altmetric
Explore all metrics

Abstract

Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this article, we conduct a multimodal study to better understand the non-reproducibility of software bugs. First, we perform an empirical study using 576 non-reproducible bug reports from two popular software systems (Firefox, Eclipse) and identify 11 key factors that might lead a reported bug to non-reproducibility. Second, we conduct a user study involving 13 professional developers where we investigate how the developers cope with non-reproducible bugs. We found that they either close these bugs or solicit for further information, which involves long deliberations and counter-productive manual searches. Third, we offer several actionable insights on how to avoid non-reproducibility (e.g., false-positive bug report detector) and improve reproducibility of the reported bugs (e.g., sandbox for bug reproduction) by combining our analyses from multiple studies (e.g., empirical study, developer study). Fourth, we explain the differences between reproducible and non-reproducible bug reports by systematically interpreting multiple machine learning models that classify these reports with high accuracy. We found that links to existing bug reports might help improve the reproducibility of a reported bug. Finally, we detect the connected bug reports to a non-reproducible bug automatically and further demonstrate how 93 bugs connected to 71 non-reproducible bugs from our dataset can offer complementary information (e.g., attachments, screenshots, program flows).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 9

Fig. 11

Reproducibility of Software Bugs

An empirical study of non-reproducible bugs

Article 07 September 2019

On the unreliability of bug severity data

Article 27 October 2015

Notes

References

Amoui M, Kaushik N, Al-Dabbagh A, Tahvildari L, Li S, Liu W (2013) Search-based duplicate defect detection: An industrial experience. In: Proc. MSR, pp 173–182
An L, Castelluccio M, Khomh F (2019) An empirical study of dll injection bugs in the firefox ecosystem. EMSE 24:1799–1822
Google Scholar
Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y (2008) Is it a bug or an enhancement? a text-based approach to classify change requests. In: Proc. CASCON, p 15
Apache Lucene Core (2019) https://lucene.apache.org/core
Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proc. ICSE, pp 298–308
Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008) What makes a good bug report?. In: Proc. FSE, pp 308–318
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv:1607.04606
Breiman L (2001) Random forests. Mach. Learn. 45(1):5–32
Article Google Scholar
Cessie S L, Houwelingen J C V (1992) Ridge estimators in logistic regression. JSTOR 41(1):191–201
MATH Google Scholar
Chaparro O, Bernal-Cárdenas C, Lu J, Moran K, Marcus A, Di Penta M, Poshyvanyk D, Ng V (2019) Assessing the quality of the steps to reproduce in bug reports. In: Proc.ESEC/FSE, pp 86–96
Chaparro O, Florez J M, Marcus A (2017) Using observed behavior to reformulate queries during text retrieval-based bug localization. In: Proc. ICSME, p to appear
Chaparro O, Florez J M, Singh U, Marcus A (2019) Reformulating queries for duplicate bug report detection. In: Proc. SANER, pp 218–229
Chaparro O, Lu J, Zampetti F, Moreno L, Di Penta M, Marcus A, Bavota G, Ng V (2017) Detecting missing information in bug descriptions. In: Proc. ESEC/FSE, pp 396–407
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proc. SIGKDD, pp 785–794
Dam H K, Tran T, Ghose A (2018) Explainable software analytics. In: Proc. ICSE-C, pp 53–56
Doxygen (2020) https://www.doxygen.nl/index.html
Firefox profiler (2020) https://profiler.firefox.com
Fagerland M W (2012) t-tests, non-parametric tests, and large studies–a paradox of statistical practice?. BMC Med Res Methodol, 12(78)
Fan Y, Xia X, D.Lo, Hassan A E (2018) Chaff from the wheat: Characterizing and determining valid bug reports. TSE
Furnas G W, Landauer T K, Gomez L M, Dumais S T (1987) The Vocabulary Problem in Human-system Communication. Commun. ACM 30(11):964–971
Article Google Scholar
Glaser B G, Strauss A L (1967) The discovery of grounded theory : strategies for qualitative research. Aldine Publishing, Chicago
Google Scholar
Goyal A, Sardana N (2017) Nrfixer: Sentiment based model for predicting the fixability of non-reproducible bugs. e-Informatica 11(1):103–116
Google Scholar
Guo P J, Zimmermann T, Nagappan N, Murphy B (2010) Characterizing and predicting which bugs get fixed: An empirical study of microsoft windows. In: Proc. ICSE, pp 495–504
Guo P J, Zimmermann T, Nagappan N, Murphy B (2011) “not my bug!” and other reasons for software bug report reassignments. In: Proc. CSCW, pp 395–404
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In: Proc. ICSE, pp 392–401
Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Empirical Softw. Engg. 24(2):902–936
Article Google Scholar
ICSME replication package (2020). https://github.com/masud-technope/ICSME2020-Replication-Package
Jiarpakdee J, Tantithamthavorn C, Dam H K, Grundy J (2020) An empirical study of model-agnostic techniques for defect prediction models. TSE
Jiarpakdee J, Tantithamthavorn C, Grundy J (2021) Practitioners’ perceptions of the goals and visual explanations of defect prediction models. In: Proc. MSR, pp 432–443
John G H, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proc. UAI, pp 338–345
Joorabchi M E, Mirzaaghaei M, Mesbah A (2014) Works for me! characterizing non-reproducible bug reports. In: Proc. MSR, pp 62–71
Lin B, Zampetti F, Bavota G, Di Penta M, Lanza M, Oliveto R (2018) Sentiment analysis for software engineering: How far can we go?. In: Proc. ICSE, pp 94–104
Lundberg S M, Erion G, Chen H, DeGrave A, Prutkin J M, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S (2020) From local explanations to global understanding with explainable ai for trees. Nature machine intelligence 2 (1):56–67
Article Google Scholar
Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? on automatically classifying app reviews. In: Proc. RE, pp 116–125
Nayrolles M, Hamou-Lhadj A (2018) Towards a classification of bugs to facilitate software maintainability tasks. In: Proc. SQUADE, pp 25–32
O’Callahan R, Jones C, Froyd N, Huey K, Noll A, Partush N (2017) Engineering record and replay for deployability. In: Proc. USENIX, pp 377–389
Parnin C, Orso A (2011) Are Automated Debugging Techniques Actually Helping Programmers?. In: Proc. ISSTA, pp 199–209
Pernosco (2020) https://pernos.co/about/overview
Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving Low Quality Stack Overflow Post Detection. In: Proc. ICSME, pp 541–544
Quinlan J R (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers Inc.
Rahman M M, Khomh F, Castelluccio M (2020) Why are some bugs non-reproducible? an empirical investigation using data fusion. In: Proc. ICSME, p 12
Rahman M M, Roy C K, Collins J (2016) CORRECT: Code Reviewer Recommendation Based on Cross-Project and Technology Experience. In: Proc. ICSE, p to appear
Rahman M M, Roy C K, Lo D (2019) Automatic query reformulation for code search using crowdsourced knowledge. EMSE 24:1869–1924
Google Scholar
Researcher posts facebook bug report to mark Zuckerberg’s wall (2013) https://cnet.co/2PvIH9O
Report: (2019) Software failure caused $1.7 trillion in financial losses in 2017. https://tek.io/2FBNl2i
Ribeiro M T, Singh S, Guestrin C (2016) ”why should i trust you?”: Explaining the predictions of any classifier. In: Proc. KDD, pp 1135–1144
Royston J P (1982) An extension of shapiro and wilk’s w test for normality to large samples. J R Stat Soc 31(2):115–124
MATH Google Scholar
Sarkar A, Rigby P C, Bartalos B (2019) Improving bug triaging with high confidence predictions at ericsson. In: Proc. ICSME, pp 81–91
Srcml (2020) https://www.srcml.org/
Shapley values (2021) https://christophm.github.io/interpretable-ml-book/shapley.html
Shafiq H A, Arshad Z (2014) Automated debugging and bug fixing solutions : A systematic literature review and classification
Shi Z, Keung J, Song Q (2014) An Empirical Study of BM25 and BM25F Based Feature Location Techniques. In: Proc. InnoSWDev, pp 106–114
Socher R, Perelygin A, Wu J, Chuang J, Manning C D, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proc. EMNLP, pp 1631–1642
Tan L, Liu C, Li Z, Wang X, Zhou Y, Zhai C (2014) Bug characteristics in open source software. EMSE 19(6):1665–1705
Google Scholar
Thelwall M, Buckley K, Paltoglou G, Cai D, Kappas A (2010) Sentiment strength detection in short informal text. JASIST 61(12):2544–2558
Article Google Scholar
Thongtanunam P, Kula R G, Yoshida N, Iida H, Matsumoto K (2015) Who Should Review my Code?. In: Proc. SANER, pp 141–150
Tian Y, Sun C, Lo D (2012) Improved duplicate bug report identification. In: Proc. CSMR, pp 385–390
Vyas D, Fritz T, Shepherd D (2014) Bug reproduction: A collaborative practice within software maintenance activities. In: COOP, pp 189–207
Wang S, Lo D (2014) Version history, similar report, and structure: Putting them together for improved bug localization. In: Proc. ICPC, pp 53–63
Wang S, Lo D (2016) Amalgam+: Composing rich information sources for accurate bug localization. JSEP 28(10):921–942
Google Scholar
Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. Proc. ICSE, pp 461–470
Wattanakriengkrai S, Thongtanunam P, Tantithamthavorn C, Hata H, Matsumoto K (2020) Predicting defective lines using a model-agnostic technique. TSE
WEKA Toolkit. http://www.cs.waikato.ac.nz/ml/weka
Works for me (2022) https://bit.ly/2M94cff
Wong C P, Xiong Y, Zhang H, Hao D, Zhang L, Mei H (2014) Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In: Proc. ICSME, pp 181–190
Xia X, Lo D, Shihab E, Wang X (2016) Automated bug report field reassignment and refinement prediction. TSR 65(3):1094–1113
Google Scholar
Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: Proc. ISSRE, pp 127–137
Ye X, Shen H, Ma X, Bunescu R, Liu C (2016) From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In: Proc. ICSE, pp 404–415
Yuan T, Lo D, Lawall J (2014) Automated Construction of a Software-specific Word Similarity Database. In: Proc. CSMR-WCRE, pp 44–53
Zhao Y, Yu T, Su T, Liu Y, Zheng W, Zhang J, Halfond WGJ (2019) Recdroid: Automatically reproducing android application crashes from bug reports. In: Proc. ICSE, pp 128–139
Zhou J, Zhang H, Lo D (2012) Where Should the Bugs Be Fixed? - More Accurate Information Retrieval-based Bug Localization Based on Bug Reports. In: Proc. ICSE
Zimmermann T, Nagappan N, Guo P J, Murphy B (2012) Characterizing and predicting which bugs get reopened. In: Proc. ICSE, pp 1074–1083

Download references

Acknowledgment

This work was supported by Fonds de Recherche du Quebec (FRQ), the Natural Sciences and Engineering Research Council of Canada (NSERC), and Tenure-track startup grant, Faculty of Computer Science, Dalhousie University, Canada. We would also like to thank all the anonymous respondents to the survey.

Author information

Authors and Affiliations

Dalhousie University, Halifax, Canada
Mohammad M. Rahman
Polytechnique Montréal, Montréal, Canada
Foutse Khomh
Mozilla Corporation, Mountain View, California, USA
Marco Castelluccio

Authors

Mohammad M. Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Foutse Khomh
View author publications
You can also search for this author in PubMed Google Scholar
Marco Castelluccio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad M. Rahman.

Additional information

Communicated by: Zhenchang Xing and Kelly Blincoe

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Software Maintenance and Evolution (ICSME)

Appendix A: Feature Importance from Models Trained with Extended Dataset

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rahman, M.M., Khomh, F. & Castelluccio, M. Works for Me! Cannot Reproduce – A Large Scale Empirical Study of Non-reproducible Bugs. Empir Software Eng 27, 111 (2022). https://doi.org/10.1007/s10664-022-10153-2

Download citation

Accepted: 22 March 2022
Published: 30 May 2022
DOI: https://doi.org/10.1007/s10664-022-10153-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Works for Me! Cannot Reproduce – A Large Scale Empirical Study of Non-reproducible Bugs

Abstract

Access this article