Skip to main content

Studying the needed effort for identifying duplicate issues


Many recent software engineering papers have examined duplicate issue reports. Thus far, duplicate reports have been considered a hindrance to developers and a drain on their resources. As a result, prior research in this area focuses on proposing automated approaches to accurately identify duplicate reports. However, there exists no studies that attempt to quantify the actual effort that is spent on identifying duplicate issue reports. In this paper, we empirically examine the effort that is needed for manually identifying duplicate reports in four open source projects, i.e., Firefox, SeaMonkey, Bugzilla and Eclipse-Platform. Our results show that: (i) More than 50 % of the duplicate reports are identified within half a day. Most of the duplicate reports are identified without any discussion and with the involvement of very few people; (ii) A classification model built using a set of factors that are extracted from duplicate issue reports classifies duplicates according to the effort that is needed to identify them with a precision of 0.60 to 0.77, a recall of 0.23 to 0.96, and an ROC area of 0.68 to 0.80; and (iii) Factors that capture the developer awareness of the duplicate issue’s peers (i.e., other duplicates of that issue) and textual similarity of a new report to prior reports are the most influential factors in our models. Our findings highlight the need for effort-aware evaluation of approaches that identify duplicate issue reports, since the identification of a considerable amount of duplicate reports (over 50 %) appear to be a relatively trivial task for developers. To better assist developers, research on identifying duplicate issue reports should put greater emphasis on assisting developers in identifying effort-consuming duplicate issues.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7




  3. Issue triaging is the task of determining if an issue report describes a meaningful new problem or enhancement, so it can be assigned to an appropriate developer for further handling (Anvik et al. 2006).

  4. Replication package:

  5. Release notes for Bugzilla 4.0:


  • Aggarwal K, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: SANER 2015: International conference on software analysis, evolution and reengineering. IEEE, pp 211–220

  • Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: MSR 2013: Proceedings of the 10th working conference on mining software repositories, pp 183–192

  • Angrist JD, Pischke JS (2008) Mostly harmless econometrics: An empiricist’s companion. Princeton university press, Princeton

    MATH  Google Scholar 

  • Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: Eclipse 2005: Proceedings of the 2005 OOPSLA Workshop on Eclipse Technology eXchange. ACM, pp 35–39

  • Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug?. In: ICSE 2006: Proceedings of the 28th international conference on software engineering. ACM, pp 361–370

  • Bertram D, Voida A, Greenberg S, Walker R (2010) Communication, collaboration, and bugs: The social nature of issue tracking in small, collocated teams. In: CSCW 2010: Proceedings of the ACM conference on computer supported cooperative work. ACM, pp 291–300

  • Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2007) Quality of bug reports in eclipse. In: Eclipse 2007: Proceedings of the 2007 OOPSLA Workshop on Eclipse Technology eXchange. ACM, New York, pp 21–25

    Chapter  Google Scholar 

  • Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008a) What makes a good bug report?. In: SIGSOFT ’08/FSE-16: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering. ACM, New York, pp 308–318

    Chapter  Google Scholar 

  • Bettenburg N, Premraj R, Zimmermann T, Kim S (2008b) Duplicate bug reports considered harmful really?. In: ICSM 2008: Proceedings of the IEEE international conference on software maintenance. IEEE, pp 337–345

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MathSciNet  MATH  Google Scholar 

  • Cavalcanti YC, Da Mota Silveira Neto PA, de Almeida ES, Lucrédio D, da Cunha CEA, de Lemos Meira SR (2010) One step more to understand the bug report duplication problem. In: SBES 2010: Brazilian symposium on software engineering. IEEE, pp 148–157

  • Cavalcanti YC, Neto PAdMS, Lucrédio D, Vale T, de Almeida ES, de Lemos Meira SR (2013) The bug report duplication problem: an exploratory study. Softw Qual J 21(1):39–66

    Article  Google Scholar 

  • Chavent M, Kuentz V, Liquet B, Saracco J (2015) Variable Clustering.

  • Davidson JL, Mohan N, Jensen C (2011) Coping with duplicate bug reports in free/open source software projects. In: VL/HCC 2011: IEEE symposium on visual languages and Human-Centric computing. IEEE, pp 101–108

  • Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407

    Article  Google Scholar 

  • Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineering

  • Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: DSN 2008: Proceedings of the IEEE international conference on dependable systems and networks with FTCS and DCC. IEEE, pp 52–61

  • Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules?. In: Proceedings of the 2008 workshop on Defects in large software systems. ACM, pp 16–20

  • Kamei Y, Matsumoto S, Monden A, Matsumoto Ki, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: ICSM 2010: IEEE international conference on software maintenance. IEEE, pp 1–10

  • Kampstra P, et al. (2008) Beanplot: A boxplot alternative for visual comparison of distributions

  • Kanaris I, Kanaris K, Houvardas I, Stamatatos E (2007) Words versus character n-grams for anti-spam filtering. Int J Artif Intell Tools 16(06):1047–1067

    Article  Google Scholar 

  • Kanerva P, Kristofersson J, Holst A (2000) Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd annual conference of the cognitive science society, vol 1036 . Citeseer

  • Kaushik N, Tahvildari L (2012) A comparative study of the performance of ir models on duplicate bug detection. In: CSMR 2012: Proceedings of the 16th European conference on software maintenance and reengineering. IEEE Computer Society, pp 159–168

  • Koponen T (2006) Life cycle of defects in open source software projects. In: Open Source Systems. Springer, pp 195–200

  • Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: MSR 2014: Proceedings of the 11th working conference on mining software repositories. ACM, pp 308–311

  • Lerch J, Mezini M (2013) Finding duplicates of your yet unwritten bug report. In: CSMR 2013: 17th European conference on software maintenance and reengineering. IEEE, pp 69–78

  • Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485– 496

    Article  Google Scholar 

  • Liaw A, Wiener M (2014) Random Forest R package.

  • McIntosh S, Kamei Y, Adams B, Hassan AE (2015) An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 1–44

  • Mitchell MW (2011) Bias of the random forest out-of-bag (oob) error for certain input parameters. Open J Stat 1(03):205

    Article  MathSciNet  Google Scholar 

  • Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180

    Article  Google Scholar 

  • Nagwani NK, Singh P (2009) Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs. In: ICAC 2009: Proceedings of the international conference on advances in computing, communication and control. ACM, pp 202–207

  • Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models, vol. 4. Irwin Chicago

  • Prifti T, Banerjee S, Cukic B (2011) Detecting bug duplicate reports through local references. In: PROMISE 2011: Proceedings of the 7th international conference on predictive models in software engineering. ACM, pp 8:1–8:9

  • Robertson S, Zaragoza H, Taylor M (2004) Simple bm25 extension to multiple weighted fields. In: CIKM 2004: Proceedings of the Thirteenth ACM international conference on information and knowledge management. ACM, pp 42–49

  • Robnik-Ṡikonja M (2004) Improving random forests. In: Machine Learning: ECML 2004. Springer, pp 359–370

  • Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE 2007: Proceedings of the 29th international conference on software engineering. IEEE Computer Society, pp 499–510

  • Scott A, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 507–512

  • Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: ASE 2011: Proceedings of the 26th IEEE/ACM international conference on automated software engineering. IEEE, pp 253–262

  • Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: ICSE 2010: Proceedings of the 32Nd ACM/IEEE international conference on software engineering. ACM, pp 45–54

  • Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: APSEC 2010: Proceedings of the Asia Pacific software engineering conference. IEEE Computer Society, pp 366–374

  • Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto Ki, Ghotra B, Kamei Y, Adams B, Morales R, Khomh F, et al. (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineering

  • Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: ICSE 2008: Proceedings of the 30th international conference on software engineering. ACM, pp 461–470

  • Xavier R, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M (2015) pROC R package.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Mohamed Sami Rakha.

Additional information

Communicated by: Emerson Murphy-Hill

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rakha, M.S., Shang, W. & Hassan, A.E. Studying the needed effort for identifying duplicate issues. Empir Software Eng 21, 1960–1989 (2016).

Download citation

  • Published:

  • Issue Date:

  • DOI: