Skip to main content
Log in

DENATURE: duplicate detection and type identification in open source bug repositories

  • Original Article
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

Software projects reckon on the bug tracking systems to guide software maintenance activities. The critical information about the nature of the crash is carried by the bug reports which are submitted to bug repositories. This information is in free form text format and is submitted by users or developers. A large amount of bug reports gets collected in bug repositories. Out of these submitted bugs, many reports are mere identical of the already existing bugs. Furthermore, not all non-duplicate bugs are reproducible in nature. This paper introduces DENATURE, a two step framework for detecting duplication and identifying bug type. The proposed framework will help to minimize time and developer’s effort utilized in resolution of bug reports which will further improvise overall software quality. Information retrieval techniques are used for finding duplicate bugs and machine learning classification techniques are used for identifying the type of bug report. Through experiments, we found that the proposed framework obtained prediction accuracy up to 88.81%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://www.bugzilla.org/.

  2. https://jira.atlassian.com/.

  3. https://www.mantisbt.org/.

  4. https://trac.edgewall.org/.

  5. https://bugs.eclipse.org/bugs/.

References

  • Aggarwal K, Timbers F, Rutgers T, Hindle A, Stroulia E, Greiner R (2017) Detecting duplicate bug reports with software engineering domain knowledge. J Softw: Evol Process 29(3):e1821

    Google Scholar 

  • Aggarwal K, Timbers F, Rutgers T, Hindle A, Stroulia E, Greiner R (2017) Detecting duplicate bug reports with software engineering domain knowledge. J Softw: Evol Process 29(3):e1821

    Google Scholar 

  • Akbarinasaji S, Caglayan B, Bener A (2018) Predicting bug-fixing time: a replication study using an open source software project. J Sys Softw 136:173–186

    Article  Google Scholar 

  • Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In 2013 10th working conference on mining software repositories (MSR), IEEE, pp 183–192

  • Amoui M, Kaushik N, Al-Dabbagh A, Tahvildari L, Li S, Liu W (2013) Search-based duplicate defect detection: an industrial experience. In 2013 10th working conference on mining software repositories (MSR), IEEE, pp 173–182

  • Anvik J, Hiew L, Murphy, GC (2005) Coping with an open bug repository. In: Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange, pp 35–39

  • Banerjee S, Syed Z, Helmick J, Culp M, Ryan K, Cukic B (2017) Automated triaging of very large bug repositories. Inf Softw Technol 89:1–13

    Article  Google Scholar 

  • Banerjee S, Cukic B, Adjeroh D (2012). Automated duplicate bug report classification using subsequence matching. In: 2012 IEEE 14th International symposium on high-assurance systems engineering, IEEE, pp 74–81

  • Banerjee S, Syed Z, Helmick J, Cukic B (2013) A fusion approach for classifying duplicate problem reports. In: 2013 IEEE 24th International symposium on software reliability engineering (ISSRE), IEEE, pp 208–217

  • Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008) What makes a good bug report? In: Proceedings of the 16th ACM SIGSOFT International symposium on foundations of software engineering, pp 308–318

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Chaparro O, Florez JM, Singh U, Marcus A (2019) Reformulating queries for duplicate bug report detection. In: 2019 IEEE 26th International conference on software analysis, evolution and reengineering (SANER), IEEE, pp 218–229

  • Chen T, Guestrin C (2016). Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794

  • Cosine similarity. https://en.wikipedia.org/wiki/Co-sinesimilarity. Accessed: 2020-03-30

  • Dang Y, Wu R, Zhang H, Zhang D, Nobel P (2012) Rebucket: A method for clustering duplicate crash reports based on call stack similarity. In: 2012 34th International Conference on Software Engineering (ICSE), IEEE, pp 1084–1093

  • Decision tree. https://en.wikipedia.org/wiki/Decis-ion_tree. Accessed: 2020-03-30

  • Erfani Joorabchi M, Mirzaaghaei M, Mesbah A (2014) Works for me! characterizing non-reproducible bug reports. In: Proceedings of the 11th working conference on mining software repositories, pp 62–71

  • Example of a bug report.https://bugs.eclipse.org/bugs/show_bug.cgi?id=409843. Accessed: 2020-03-30

  • Feng L, Song L, Sha C, Gong X (2013) Practical duplicate bug reports detection in a large web-based development community. In: Asia-pacific web conference, Springer, pp 709–720

  • Freund Y, Schapire RE (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. In: European conference on computational learning theory, Springer, pp 23–37

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat & Data Anal 38(4):367–378

    Article  MathSciNet  MATH  Google Scholar 

  • Goyal A, Sardana N (2017) Optimizing bug report assignment using multi criteria decision making technique. Intell Decision Technol 11(3):307–320

    Article  Google Scholar 

  • Goyal A, Sardana N (2019) An empirical study of non-reproducible bugs. Int J Sys Assur Eng Manag 10(5):1186–1220

    Article  Google Scholar 

  • Goyal A, Sardana N (2017) Nrfixer: sentiment based model for predicting the fixability of non-reproducible bugs. e-Informatica Softw Eng J 11(1):103–116

    Google Scholar 

  • Goyal A, Sardana N (2018) Characterization study of developers in non-reproducible bugs. In: 2018 Eleventh International conference on contemporary computing (IC3), IEEE, pp 1–6

  • Goyal A, Sardana N (2020) Imnrfixer: a hybrid approach to alleviate class-imbalance problem for predicting the fixability of non-reproducible bugs. J Softw Evol Process. https://doi.org/10.1002/smr.2290

    Article  Google Scholar 

  • Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Emp Softw Eng 24(2):902–936

    Article  Google Scholar 

  • Hindle A, Alipour A, Stroulia E (2016) A contextual approach towards more accurate duplicate bug report detection and ranking. Emp Softw Eng 21(2):368–410

    Article  Google Scholar 

  • Jalbert N, Weimer, W (2008) Automated duplicate detection for bug tracking systems. In 2008 IEEE International conference on dependable systems and networks with FTCS and DCC (DSN), IEEE, pp 52–61

  • Jingliang C, Zhe M, Jun S (2016) A data-driven approach based on lda for identifying duplicate bug report. In 2016 IEEE 8th International conference on intelligent systems (IS), IEEE, pp 686–691

  • Kaushik N, Tahvildari L (2012) A comparative study of the performance of ir models on duplicate bug detection. In: 2012 16th European conference on software maintenance and reengineering, IEEE, pp 159–168

  • Klein N, Corley CS, Kraft NA (2014) New features for duplicate bug detection. In: Proceedings of the 11th working conference on mining software repositories, pp 324–327

  • Klein N, Corley CS, Kraft NA (2014) New features for duplicate bug detection. In: Proceedings of the 11th working conference on mining software repositories, pp 324–327

  • Lal S, Sardana N, Sureka A (2017) Eclogger: cross-project catch-block logging prediction using ensemble of classifiers. e-Inform Softw Eng J. https://doi.org/10.5277/e-Inf170101

    Article  Google Scholar 

  • Lazar A , Ritchey S, Sharif B (2014) Generating duplicate bug datasets. In: Proceedings of the 11th working conference on mining software repositories, pp 392–395

  • Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: Proceedings of the 11th working conference on mining software repositories, pp 308–311

  • Lerch J, Mezini M (2013) Finding duplicates of your yet unwritten bug report. In: 2013 17th European conference on software maintenance and reengineering, IEEE, pp 69–78

  • Limsettho N, Hata H, Monden A, Matsumoto K (2014) Automatic unsupervised bug report categorization. In: 2014 6th International workshop on empirical software engineering in practice, IEEE, pp 7–12

  • Lin MJ, Yang CZ (2014) An improved discriminative model for duplication detection on bug reports with cluster weighting. In: 2014 IEEE 38th annual computer software and applications conference, IEEE, pp 117–122

  • Lin M-J, Yang C-Z, Lee C-Y, Chen C-C (2016) Enhancements for duplication detection in bug reports with manifold correlation features. J Sys Softw 121:223–233

    Article  Google Scholar 

  • Liu K, Tan HBK, Zhang H (2013) Has this bug been reported? In: 2013 20th Working conference on reverse engineering (WCRE), IEEE, pp 82–91

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT press, Cambridge

    MATH  Google Scholar 

  • Minh PN (2014) An approach to detecting duplicate bug reports using n-gram features and cluster chrinkage technique. Int J Sci Res Publ (IJSRP) 4(5):89–100

    Google Scholar 

  • Neysiani BS, Babamir SM (2019) New methodology for contextual features usage in duplicate bug reports detection: dimension expansion based on manhattan distance similarity of topics. In: 2019 5th International conference on web research (ICWR), IEEE, pp 178–183

  • Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: 2012 Proceedings of the 27th IEEE/ACM international conference on automated software engineering, IEEE, pp 70–79

  • Prifti T, Banerjee S, Cukic B (2011) Detecting bug duplicate reports through local references. In: Proceedings of the 7th international conference on predictive models in software engineering, pp 1–9

  • Rish I et al (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3, pp 41–46

  • Rocha H, Valente MT, Marques-Neto H, Murphy GC (2016) An empirical study on recommendations of similar bugs. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), IEEE, Vol. 1, pp 46–56

  • Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: 29th International conference on software engineering (ICSE’07), IEEE, pp 499–510

  • Sabor, K.K., Hamou-Lhadj A, Larsson A (2017) Durfex: a feature extraction technique for efficient detection of duplicate bug reports. In: 2017 IEEE International conference on software quality, reliability and security (QRS), IEEE, pp 240–250

  • Salton G, McGill M (1986) Introduction to modern information retrieval

  • Sun C, Lo D, Khoo SC, Jiang, J (2011) Towards more accurate retrieval of duplicate bug reports. In: 2011 26th IEEE/ACM International conference on automated software engineering (ASE 2011), IEEE, pp 253–262

  • Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering, Vol. 1, pp 45–54

  • Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: 2010 Asia Pacific Software Engineering Conference, IEEE, pp 366–374

  • Tahir MA, Bouridane A, Kurugollu F (2007) Simultaneous feature selection and feature weighting using hybrid tabu search/k-nearest neighbor classifier. Patt Recognit Lett 28(4):438–446

    Article  Google Scholar 

  • Tan S (2006) An effective refinement strategy for KNN text classifier. Exp Sys Appl 30(2):290–298

    Article  Google Scholar 

  • Tian Y, Lo D, Xia X, Sun C (2015) Automated prediction of bug report priority using multi-factor analysis. Emp Softw Eng 20(5):1354–1383

    Article  Google Scholar 

  • Tian Y, Sun C, Lo D (2012) Improved duplicate bug report identification. In: 2012 16th European conference on software maintenance and reengineering, IEEE, pp 385–390

  • Tomašev, N, Leban G, Mladenić D (2013) Exploiting hubs for self-adaptive secondary re-ranking in bug report duplicate detection. In: Proceedings of the ITI 2013 35th international conference on information technology interfaces, IEEE, pp 131–136

  • Tsuruda A, Manabe Y, Aritsugi M (2015) Can we detect bug report duplication with unfinished bug reports? In: 2015 Asia-pacific software engineering conference (APSEC), IEEE, pp 151–158

  • Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) A n approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on Software engineering, pp 461–470

  • Wu Q, Wang Q (2010) Natural language processing based detection of duplicate defect patterns. In: 2010 IEEE 34th annual computer software and applications conference workshops, IEEE, pp 220–225

  • Yang CZ, Du HH, Wu SS, Chen X (2012) Duplication detection for software bug reports based on bm25 term weighting. In: 2012 Conference on technologies and applications of artificial intelligence, IEEE, pp 33–38

  • Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 127–137

  • Yang G, Min K, Lee J-W, Lee B (2019) Applying topic modeling and similarity for predicting bug severity in cross projects. KSII Trans Internet & Inf Sys 13(3):1583–1598

    Google Scholar 

  • Zhang T, Lee B (2014) A novel technique for duplicate detection and classification of bug reports. IEICE Trans Inf Sys 97(7):1756–1768

    Article  Google Scholar 

  • Zhang T, Oles FJ (2001) Text categorization based on regularized linear classification methods. Inf Retr 4(1):5–31

    Article  MATH  Google Scholar 

  • Zhao HL, Shu C (2011) Analysis of duplicate issue reports for issue tracking system. In: The 3rd International conference on data mining and intelligent information technology applications, IEEE, pp 86–91

  • Zhou J, Zhang H (2012) Learning to rank duplicate bug reports. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 852–861

  • Zou J, Xu L, Yang M, Zhang X, Zeng J, Hirokawa S (2016) Automated duplicate bug report detection using multi-factor analysis. IEICE Trans Inf Sys 99(7):1762–1775

    Article  Google Scholar 

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anjali Goyal.

Ethics declarations

Conflict of interest

The authors delcare that they have no conflict of interest.

Human and animal rights

This study did not involve any human participants or animals.

Informed consent

NIL.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chauhan, R., Sharma, S. & Goyal, A. DENATURE: duplicate detection and type identification in open source bug repositories. Int J Syst Assur Eng Manag 14 (Suppl 1), 275–292 (2023). https://doi.org/10.1007/s13198-023-01855-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-023-01855-x

Keywords

Navigation