Skip to main content
Log in

Works for Me! Cannot Reproduce – A Large Scale Empirical Study of Non-reproducible Bugs

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript


Software developers attempt to reproduce software bugs to understand their erroneous behaviours and to fix them. Unfortunately, they often fail to reproduce (or fix) them, which leads to faulty, unreliable software systems. However, to date, only a little research has been done to better understand what makes the software bugs non-reproducible. In this article, we conduct a multimodal study to better understand the non-reproducibility of software bugs. First, we perform an empirical study using 576 non-reproducible bug reports from two popular software systems (Firefox, Eclipse) and identify 11 key factors that might lead a reported bug to non-reproducibility. Second, we conduct a user study involving 13 professional developers where we investigate how the developers cope with non-reproducible bugs. We found that they either close these bugs or solicit for further information, which involves long deliberations and counter-productive manual searches. Third, we offer several actionable insights on how to avoid non-reproducibility (e.g., false-positive bug report detector) and improve reproducibility of the reported bugs (e.g., sandbox for bug reproduction) by combining our analyses from multiple studies (e.g., empirical study, developer study). Fourth, we explain the differences between reproducible and non-reproducible bug reports by systematically interpreting multiple machine learning models that classify these reports with high accuracy. We found that links to existing bug reports might help improve the reproducibility of a reported bug. Finally, we detect the connected bug reports to a non-reproducible bug automatically and further demonstrate how 93 bugs connected to 71 non-reproducible bugs from our dataset can offer complementary information (e.g., attachments, screenshots, program flows).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others







  • Amoui M, Kaushik N, Al-Dabbagh A, Tahvildari L, Li S, Liu W (2013) Search-based duplicate defect detection: An industrial experience. In: Proc. MSR, pp 173–182

  • An L, Castelluccio M, Khomh F (2019) An empirical study of dll injection bugs in the firefox ecosystem. EMSE 24:1799–1822

    Google Scholar 

  • Antoniol G, Ayari K, Di Penta M, Khomh F, Guéhéneuc Y (2008) Is it a bug or an enhancement? a text-based approach to classify change requests. In: Proc. CASCON, p 15

  • Apache Lucene Core (2019)

  • Aranda J, Venolia G (2009) The secret life of bugs: Going past the errors and omissions in software repositories. In: Proc. ICSE, pp 298–308

  • Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008) What makes a good bug report?. In: Proc. FSE, pp 308–318

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv:1607.04606

  • Breiman L (2001) Random forests. Mach. Learn. 45(1):5–32

    Article  Google Scholar 

  • Cessie S L, Houwelingen J C V (1992) Ridge estimators in logistic regression. JSTOR 41(1):191–201

    MATH  Google Scholar 

  • Chaparro O, Bernal-Cárdenas C, Lu J, Moran K, Marcus A, Di Penta M, Poshyvanyk D, Ng V (2019) Assessing the quality of the steps to reproduce in bug reports. In: Proc.ESEC/FSE, pp 86–96

  • Chaparro O, Florez J M, Marcus A (2017) Using observed behavior to reformulate queries during text retrieval-based bug localization. In: Proc. ICSME, p to appear

  • Chaparro O, Florez J M, Singh U, Marcus A (2019) Reformulating queries for duplicate bug report detection. In: Proc. SANER, pp 218–229

  • Chaparro O, Lu J, Zampetti F, Moreno L, Di Penta M, Marcus A, Bavota G, Ng V (2017) Detecting missing information in bug descriptions. In: Proc. ESEC/FSE, pp 396–407

  • Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proc. SIGKDD, pp 785–794

  • Dam H K, Tran T, Ghose A (2018) Explainable software analytics. In: Proc. ICSE-C, pp 53–56

  • Doxygen (2020)

  • Firefox profiler (2020)

  • Fagerland M W (2012) t-tests, non-parametric tests, and large studies–a paradox of statistical practice?. BMC Med Res Methodol, 12(78)

  • Fan Y, Xia X, D.Lo, Hassan A E (2018) Chaff from the wheat: Characterizing and determining valid bug reports. TSE

  • Furnas G W, Landauer T K, Gomez L M, Dumais S T (1987) The Vocabulary Problem in Human-system Communication. Commun. ACM 30(11):964–971

    Article  Google Scholar 

  • Glaser B G, Strauss A L (1967) The discovery of grounded theory : strategies for qualitative research. Aldine Publishing, Chicago

    Google Scholar 

  • Goyal A, Sardana N (2017) Nrfixer: Sentiment based model for predicting the fixability of non-reproducible bugs. e-Informatica 11(1):103–116

    Google Scholar 

  • Guo P J, Zimmermann T, Nagappan N, Murphy B (2010) Characterizing and predicting which bugs get fixed: An empirical study of microsoft windows. In: Proc. ICSE, pp 495–504

  • Guo P J, Zimmermann T, Nagappan N, Murphy B (2011) “not my bug!” and other reasons for software bug report reassignments. In: Proc. CSCW, pp 395–404

  • Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In: Proc. ICSE, pp 392–401

  • Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Empirical Softw. Engg. 24(2):902–936

    Article  Google Scholar 

  • ICSME replication package (2020).

  • Jiarpakdee J, Tantithamthavorn C, Dam H K, Grundy J (2020) An empirical study of model-agnostic techniques for defect prediction models. TSE

  • Jiarpakdee J, Tantithamthavorn C, Grundy J (2021) Practitioners’ perceptions of the goals and visual explanations of defect prediction models. In: Proc. MSR, pp 432–443

  • John G H, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proc. UAI, pp 338–345

  • Joorabchi M E, Mirzaaghaei M, Mesbah A (2014) Works for me! characterizing non-reproducible bug reports. In: Proc. MSR, pp 62–71

  • Lin B, Zampetti F, Bavota G, Di Penta M, Lanza M, Oliveto R (2018) Sentiment analysis for software engineering: How far can we go?. In: Proc. ICSE, pp 94–104

  • Lundberg S M, Erion G, Chen H, DeGrave A, Prutkin J M, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S (2020) From local explanations to global understanding with explainable ai for trees. Nature machine intelligence 2 (1):56–67

    Article  Google Scholar 

  • Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? on automatically classifying app reviews. In: Proc. RE, pp 116–125

  • Nayrolles M, Hamou-Lhadj A (2018) Towards a classification of bugs to facilitate software maintainability tasks. In: Proc. SQUADE, pp 25–32

  • O’Callahan R, Jones C, Froyd N, Huey K, Noll A, Partush N (2017) Engineering record and replay for deployability. In: Proc. USENIX, pp 377–389

  • Parnin C, Orso A (2011) Are Automated Debugging Techniques Actually Helping Programmers?. In: Proc. ISSTA, pp 199–209

  • Pernosco (2020)

  • Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving Low Quality Stack Overflow Post Detection. In: Proc. ICSME, pp 541–544

  • Quinlan J R (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers Inc.

  • Rahman M M, Khomh F, Castelluccio M (2020) Why are some bugs non-reproducible? an empirical investigation using data fusion. In: Proc. ICSME, p 12

  • Rahman M M, Roy C K, Collins J (2016) CORRECT: Code Reviewer Recommendation Based on Cross-Project and Technology Experience. In: Proc. ICSE, p to appear

  • Rahman M M, Roy C K, Lo D (2019) Automatic query reformulation for code search using crowdsourced knowledge. EMSE 24:1869–1924

    Google Scholar 

  • Researcher posts facebook bug report to mark Zuckerberg’s wall (2013)

  • Report: (2019) Software failure caused $1.7 trillion in financial losses in 2017.

  • Ribeiro M T, Singh S, Guestrin C (2016) ”why should i trust you?”: Explaining the predictions of any classifier. In: Proc. KDD, pp 1135–1144

  • Royston J P (1982) An extension of shapiro and wilk’s w test for normality to large samples. J R Stat Soc 31(2):115–124

    MATH  Google Scholar 

  • Sarkar A, Rigby P C, Bartalos B (2019) Improving bug triaging with high confidence predictions at ericsson. In: Proc. ICSME, pp 81–91

  • Srcml (2020)

  • Shapley values (2021)

  • Shafiq H A, Arshad Z (2014) Automated debugging and bug fixing solutions : A systematic literature review and classification

  • Shi Z, Keung J, Song Q (2014) An Empirical Study of BM25 and BM25F Based Feature Location Techniques. In: Proc. InnoSWDev, pp 106–114

  • Socher R, Perelygin A, Wu J, Chuang J, Manning C D, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proc. EMNLP, pp 1631–1642

  • Tan L, Liu C, Li Z, Wang X, Zhou Y, Zhai C (2014) Bug characteristics in open source software. EMSE 19(6):1665–1705

    Google Scholar 

  • Thelwall M, Buckley K, Paltoglou G, Cai D, Kappas A (2010) Sentiment strength detection in short informal text. JASIST 61(12):2544–2558

    Article  Google Scholar 

  • Thongtanunam P, Kula R G, Yoshida N, Iida H, Matsumoto K (2015) Who Should Review my Code?. In: Proc. SANER, pp 141–150

  • Tian Y, Sun C, Lo D (2012) Improved duplicate bug report identification. In: Proc. CSMR, pp 385–390

  • Vyas D, Fritz T, Shepherd D (2014) Bug reproduction: A collaborative practice within software maintenance activities. In: COOP, pp 189–207

  • Wang S, Lo D (2014) Version history, similar report, and structure: Putting them together for improved bug localization. In: Proc. ICPC, pp 53–63

  • Wang S, Lo D (2016) Amalgam+: Composing rich information sources for accurate bug localization. JSEP 28(10):921–942

    Google Scholar 

  • Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. Proc. ICSE, pp 461–470

  • Wattanakriengkrai S, Thongtanunam P, Tantithamthavorn C, Hata H, Matsumoto K (2020) Predicting defective lines using a model-agnostic technique. TSE

  • WEKA Toolkit.

  • Works for me (2022)

  • Wong C P, Xiong Y, Zhang H, Hao D, Zhang L, Mei H (2014) Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In: Proc. ICSME, pp 181–190

  • Xia X, Lo D, Shihab E, Wang X (2016) Automated bug report field reassignment and refinement prediction. TSR 65(3):1094–1113

    Google Scholar 

  • Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: Proc. ISSRE, pp 127–137

  • Ye X, Shen H, Ma X, Bunescu R, Liu C (2016) From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In: Proc. ICSE, pp 404–415

  • Yuan T, Lo D, Lawall J (2014) Automated Construction of a Software-specific Word Similarity Database. In: Proc. CSMR-WCRE, pp 44–53

  • Zhao Y, Yu T, Su T, Liu Y, Zheng W, Zhang J, Halfond WGJ (2019) Recdroid: Automatically reproducing android application crashes from bug reports. In: Proc. ICSE, pp 128–139

  • Zhou J, Zhang H, Lo D (2012) Where Should the Bugs Be Fixed? - More Accurate Information Retrieval-based Bug Localization Based on Bug Reports. In: Proc. ICSE

  • Zimmermann T, Nagappan N, Guo P J, Murphy B (2012) Characterizing and predicting which bugs get reopened. In: Proc. ICSE, pp 1074–1083

Download references


This work was supported by Fonds de Recherche du Quebec (FRQ), the Natural Sciences and Engineering Research Council of Canada (NSERC), and Tenure-track startup grant, Faculty of Computer Science, Dalhousie University, Canada. We would also like to thank all the anonymous respondents to the survey.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Mohammad M. Rahman.

Additional information

Communicated by: Zhenchang Xing and Kelly Blincoe

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Software Maintenance and Evolution (ICSME)

Appendix A: Feature Importance from Models Trained with Extended Dataset

Appendix A: Feature Importance from Models Trained with Extended Dataset

Fig. 15
figure 15

Feature importance using bee swarm plot (RandomForest model)

Fig. 16
figure 16

Feature importance using bee swarm plot (Logistic Regression model)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rahman, M.M., Khomh, F. & Castelluccio, M. Works for Me! Cannot Reproduce – A Large Scale Empirical Study of Non-reproducible Bugs. Empir Software Eng 27, 111 (2022).

Download citation

  • Accepted:

  • Published:

  • DOI: