Abstract
Many recent software engineering papers have examined duplicate issue reports. Thus far, duplicate reports have been considered a hindrance to developers and a drain on their resources. As a result, prior research in this area focuses on proposing automated approaches to accurately identify duplicate reports. However, there exists no studies that attempt to quantify the actual effort that is spent on identifying duplicate issue reports. In this paper, we empirically examine the effort that is needed for manually identifying duplicate reports in four open source projects, i.e., Firefox, SeaMonkey, Bugzilla and Eclipse-Platform. Our results show that: (i) More than 50 % of the duplicate reports are identified within half a day. Most of the duplicate reports are identified without any discussion and with the involvement of very few people; (ii) A classification model built using a set of factors that are extracted from duplicate issue reports classifies duplicates according to the effort that is needed to identify them with a precision of 0.60 to 0.77, a recall of 0.23 to 0.96, and an ROC area of 0.68 to 0.80; and (iii) Factors that capture the developer awareness of the duplicate issue’s peers (i.e., other duplicates of that issue) and textual similarity of a new report to prior reports are the most influential factors in our models. Our findings highlight the need for effort-aware evaluation of approaches that identify duplicate issue reports, since the identification of a considerable amount of duplicate reports (over 50 %) appear to be a relatively trivial task for developers. To better assist developers, research on identifying duplicate issue reports should put greater emphasis on assisting developers in identifying effort-consuming duplicate issues.
This is a preview of subscription content, access via your institution.







Notes
Issue triaging is the task of determining if an issue report describes a meaningful new problem or enhancement, so it can be assigned to an appropriate developer for further handling (Anvik et al. 2006).
Replication package: http://sailhome.cs.queensu.ca/replication/EMSE2015_DuplicateReports/
Release notes for Bugzilla 4.0: https://www.bugzilla.org/releases/4.0/release-notes.html
References
Aggarwal K, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: SANER 2015: International conference on software analysis, evolution and reengineering. IEEE, pp 211–220
Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: MSR 2013: Proceedings of the 10th working conference on mining software repositories, pp 183–192
Angrist JD, Pischke JS (2008) Mostly harmless econometrics: An empiricist’s companion. Princeton university press, Princeton
Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: Eclipse 2005: Proceedings of the 2005 OOPSLA Workshop on Eclipse Technology eXchange. ACM, pp 35–39
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug?. In: ICSE 2006: Proceedings of the 28th international conference on software engineering. ACM, pp 361–370
Bertram D, Voida A, Greenberg S, Walker R (2010) Communication, collaboration, and bugs: The social nature of issue tracking in small, collocated teams. In: CSCW 2010: Proceedings of the ACM conference on computer supported cooperative work. ACM, pp 291–300
Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2007) Quality of bug reports in eclipse. In: Eclipse 2007: Proceedings of the 2007 OOPSLA Workshop on Eclipse Technology eXchange. ACM, New York, pp 21–25
Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008a) What makes a good bug report?. In: SIGSOFT ’08/FSE-16: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering. ACM, New York, pp 308–318
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008b) Duplicate bug reports considered harmful really?. In: ICSM 2008: Proceedings of the IEEE international conference on software maintenance. IEEE, pp 337–345
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Cavalcanti YC, Da Mota Silveira Neto PA, de Almeida ES, Lucrédio D, da Cunha CEA, de Lemos Meira SR (2010) One step more to understand the bug report duplication problem. In: SBES 2010: Brazilian symposium on software engineering. IEEE, pp 148–157
Cavalcanti YC, Neto PAdMS, Lucrédio D, Vale T, de Almeida ES, de Lemos Meira SR (2013) The bug report duplication problem: an exploratory study. Softw Qual J 21(1):39–66
Chavent M, Kuentz V, Liquet B, Saracco J (2015) Variable Clustering. http://svitsrv25.epfl.ch/R-doc/library/Hmisc/html/varclus.html
Davidson JL, Mohan N, Jensen C (2011) Coping with duplicate bug reports in free/open source software projects. In: VL/HCC 2011: IEEE symposium on visual languages and Human-Centric computing. IEEE, pp 101–108
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineering
Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: DSN 2008: Proceedings of the IEEE international conference on dependable systems and networks with FTCS and DCC. IEEE, pp 52–61
Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules?. In: Proceedings of the 2008 workshop on Defects in large software systems. ACM, pp 16–20
Kamei Y, Matsumoto S, Monden A, Matsumoto Ki, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: ICSM 2010: IEEE international conference on software maintenance. IEEE, pp 1–10
Kampstra P, et al. (2008) Beanplot: A boxplot alternative for visual comparison of distributions
Kanaris I, Kanaris K, Houvardas I, Stamatatos E (2007) Words versus character n-grams for anti-spam filtering. Int J Artif Intell Tools 16(06):1047–1067
Kanerva P, Kristofersson J, Holst A (2000) Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd annual conference of the cognitive science society, vol 1036 . Citeseer
Kaushik N, Tahvildari L (2012) A comparative study of the performance of ir models on duplicate bug detection. In: CSMR 2012: Proceedings of the 16th European conference on software maintenance and reengineering. IEEE Computer Society, pp 159–168
Koponen T (2006) Life cycle of defects in open source software projects. In: Open Source Systems. Springer, pp 195–200
Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: MSR 2014: Proceedings of the 11th working conference on mining software repositories. ACM, pp 308–311
Lerch J, Mezini M (2013) Finding duplicates of your yet unwritten bug report. In: CSMR 2013: 17th European conference on software maintenance and reengineering. IEEE, pp 69–78
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485– 496
Liaw A, Wiener M (2014) Random Forest R package. http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
McIntosh S, Kamei Y, Adams B, Hassan AE (2015) An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 1–44
Mitchell MW (2011) Bias of the random forest out-of-bag (oob) error for certain input parameters. Open J Stat 1(03):205
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180
Nagwani NK, Singh P (2009) Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs. In: ICAC 2009: Proceedings of the international conference on advances in computing, communication and control. ACM, pp 202–207
Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models, vol. 4. Irwin Chicago
Prifti T, Banerjee S, Cukic B (2011) Detecting bug duplicate reports through local references. In: PROMISE 2011: Proceedings of the 7th international conference on predictive models in software engineering. ACM, pp 8:1–8:9
Robertson S, Zaragoza H, Taylor M (2004) Simple bm25 extension to multiple weighted fields. In: CIKM 2004: Proceedings of the Thirteenth ACM international conference on information and knowledge management. ACM, pp 42–49
Robnik-Ṡikonja M (2004) Improving random forests. In: Machine Learning: ECML 2004. Springer, pp 359–370
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE 2007: Proceedings of the 29th international conference on software engineering. IEEE Computer Society, pp 499–510
Scott A, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 507–512
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: ASE 2011: Proceedings of the 26th IEEE/ACM international conference on automated software engineering. IEEE, pp 253–262
Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: ICSE 2010: Proceedings of the 32Nd ACM/IEEE international conference on software engineering. ACM, pp 45–54
Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: APSEC 2010: Proceedings of the Asia Pacific software engineering conference. IEEE Computer Society, pp 366–374
Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto Ki, Ghotra B, Kamei Y, Adams B, Morales R, Khomh F, et al. (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineering
Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: ICSE 2008: Proceedings of the 30th international conference on software engineering. ACM, pp 461–470
Xavier R, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M (2015) pROC R package. http://cran.r-project.org/web/packages/pROC/pROC.pdf
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Emerson Murphy-Hill
Rights and permissions
About this article
Cite this article
Rakha, M.S., Shang, W. & Hassan, A.E. Studying the needed effort for identifying duplicate issues. Empir Software Eng 21, 1960–1989 (2016). https://doi.org/10.1007/s10664-015-9404-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-015-9404-6