Empirical Software Engineering

, Volume 21, Issue 5, pp 1960–1989 | Cite as

Studying the needed effort for identifying duplicate issues

  • Mohamed Sami RakhaEmail author
  • Weiyi Shang
  • Ahmed E. Hassan


Many recent software engineering papers have examined duplicate issue reports. Thus far, duplicate reports have been considered a hindrance to developers and a drain on their resources. As a result, prior research in this area focuses on proposing automated approaches to accurately identify duplicate reports. However, there exists no studies that attempt to quantify the actual effort that is spent on identifying duplicate issue reports. In this paper, we empirically examine the effort that is needed for manually identifying duplicate reports in four open source projects, i.e., Firefox, SeaMonkey, Bugzilla and Eclipse-Platform. Our results show that: (i) More than 50 % of the duplicate reports are identified within half a day. Most of the duplicate reports are identified without any discussion and with the involvement of very few people; (ii) A classification model built using a set of factors that are extracted from duplicate issue reports classifies duplicates according to the effort that is needed to identify them with a precision of 0.60 to 0.77, a recall of 0.23 to 0.96, and an ROC area of 0.68 to 0.80; and (iii) Factors that capture the developer awareness of the duplicate issue’s peers (i.e., other duplicates of that issue) and textual similarity of a new report to prior reports are the most influential factors in our models. Our findings highlight the need for effort-aware evaluation of approaches that identify duplicate issue reports, since the identification of a considerable amount of duplicate reports (over 50 %) appear to be a relatively trivial task for developers. To better assist developers, research on identifying duplicate issue reports should put greater emphasis on assisting developers in identifying effort-consuming duplicate issues.


Mining software repositories Automated detection of duplicate issues Software issue reports Effort based analysis Duplicate bug reports 


  1. Aggarwal K, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: SANER 2015: International conference on software analysis, evolution and reengineering. IEEE, pp 211–220Google Scholar
  2. Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: MSR 2013: Proceedings of the 10th working conference on mining software repositories, pp 183–192Google Scholar
  3. Angrist JD, Pischke JS (2008) Mostly harmless econometrics: An empiricist’s companion. Princeton university press, PrincetonzbMATHGoogle Scholar
  4. Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: Eclipse 2005: Proceedings of the 2005 OOPSLA Workshop on Eclipse Technology eXchange. ACM, pp 35–39Google Scholar
  5. Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug?. In: ICSE 2006: Proceedings of the 28th international conference on software engineering. ACM, pp 361–370Google Scholar
  6. Bertram D, Voida A, Greenberg S, Walker R (2010) Communication, collaboration, and bugs: The social nature of issue tracking in small, collocated teams. In: CSCW 2010: Proceedings of the ACM conference on computer supported cooperative work. ACM, pp 291–300Google Scholar
  7. Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2007) Quality of bug reports in eclipse. In: Eclipse 2007: Proceedings of the 2007 OOPSLA Workshop on Eclipse Technology eXchange. ACM, New York, pp 21–25CrossRefGoogle Scholar
  8. Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008a) What makes a good bug report?. In: SIGSOFT ’08/FSE-16: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering. ACM, New York, pp 308–318CrossRefGoogle Scholar
  9. Bettenburg N, Premraj R, Zimmermann T, Kim S (2008b) Duplicate bug reports considered harmful really?. In: ICSM 2008: Proceedings of the IEEE international conference on software maintenance. IEEE, pp 337–345Google Scholar
  10. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  11. Breiman L (2001) Random forests. Mach Learn 45(1):5–32MathSciNetCrossRefzbMATHGoogle Scholar
  12. Cavalcanti YC, Da Mota Silveira Neto PA, de Almeida ES, Lucrédio D, da Cunha CEA, de Lemos Meira SR (2010) One step more to understand the bug report duplication problem. In: SBES 2010: Brazilian symposium on software engineering. IEEE, pp 148–157Google Scholar
  13. Cavalcanti YC, Neto PAdMS, Lucrédio D, Vale T, de Almeida ES, de Lemos Meira SR (2013) The bug report duplication problem: an exploratory study. Softw Qual J 21(1):39–66CrossRefGoogle Scholar
  14. Chavent M, Kuentz V, Liquet B, Saracco J (2015) Variable Clustering.
  15. Davidson JL, Mohan N, Jensen C (2011) Coping with duplicate bug reports in free/open source software projects. In: VL/HCC 2011: IEEE symposium on visual languages and Human-Centric computing. IEEE, pp 101–108Google Scholar
  16. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407CrossRefGoogle Scholar
  17. Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineeringGoogle Scholar
  18. Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: DSN 2008: Proceedings of the IEEE international conference on dependable systems and networks with FTCS and DCC. IEEE, pp 52–61Google Scholar
  19. Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules?. In: Proceedings of the 2008 workshop on Defects in large software systems. ACM, pp 16–20Google Scholar
  20. Kamei Y, Matsumoto S, Monden A, Matsumoto Ki, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: ICSM 2010: IEEE international conference on software maintenance. IEEE, pp 1–10Google Scholar
  21. Kampstra P, et al. (2008) Beanplot: A boxplot alternative for visual comparison of distributionsGoogle Scholar
  22. Kanaris I, Kanaris K, Houvardas I, Stamatatos E (2007) Words versus character n-grams for anti-spam filtering. Int J Artif Intell Tools 16(06):1047–1067CrossRefGoogle Scholar
  23. Kanerva P, Kristofersson J, Holst A (2000) Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd annual conference of the cognitive science society, vol 1036 . CiteseerGoogle Scholar
  24. Kaushik N, Tahvildari L (2012) A comparative study of the performance of ir models on duplicate bug detection. In: CSMR 2012: Proceedings of the 16th European conference on software maintenance and reengineering. IEEE Computer Society, pp 159–168Google Scholar
  25. Koponen T (2006) Life cycle of defects in open source software projects. In: Open Source Systems. Springer, pp 195–200Google Scholar
  26. Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: MSR 2014: Proceedings of the 11th working conference on mining software repositories. ACM, pp 308–311Google Scholar
  27. Lerch J, Mezini M (2013) Finding duplicates of your yet unwritten bug report. In: CSMR 2013: 17th European conference on software maintenance and reengineering. IEEE, pp 69–78Google Scholar
  28. Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485– 496CrossRefGoogle Scholar
  29. Liaw A, Wiener M (2014) Random Forest R package.
  30. McIntosh S, Kamei Y, Adams B, Hassan AE (2015) An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 1–44Google Scholar
  31. Mitchell MW (2011) Bias of the random forest out-of-bag (oob) error for certain input parameters. Open J Stat 1(03):205MathSciNetCrossRefGoogle Scholar
  32. Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180CrossRefGoogle Scholar
  33. Nagwani NK, Singh P (2009) Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs. In: ICAC 2009: Proceedings of the international conference on advances in computing, communication and control. ACM, pp 202–207Google Scholar
  34. Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models, vol. 4. Irwin ChicagoGoogle Scholar
  35. Prifti T, Banerjee S, Cukic B (2011) Detecting bug duplicate reports through local references. In: PROMISE 2011: Proceedings of the 7th international conference on predictive models in software engineering. ACM, pp 8:1–8:9Google Scholar
  36. Robertson S, Zaragoza H, Taylor M (2004) Simple bm25 extension to multiple weighted fields. In: CIKM 2004: Proceedings of the Thirteenth ACM international conference on information and knowledge management. ACM, pp 42–49Google Scholar
  37. Robnik-Ṡikonja M (2004) Improving random forests. In: Machine Learning: ECML 2004. Springer, pp 359–370Google Scholar
  38. Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE 2007: Proceedings of the 29th international conference on software engineering. IEEE Computer Society, pp 499–510Google Scholar
  39. Scott A, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 507–512Google Scholar
  40. Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: ASE 2011: Proceedings of the 26th IEEE/ACM international conference on automated software engineering. IEEE, pp 253–262Google Scholar
  41. Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: ICSE 2010: Proceedings of the 32Nd ACM/IEEE international conference on software engineering. ACM, pp 45–54Google Scholar
  42. Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: APSEC 2010: Proceedings of the Asia Pacific software engineering conference. IEEE Computer Society, pp 366–374Google Scholar
  43. Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto Ki, Ghotra B, Kamei Y, Adams B, Morales R, Khomh F, et al. (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineeringGoogle Scholar
  44. Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: ICSE 2008: Proceedings of the 30th international conference on software engineering. ACM, pp 461–470Google Scholar
  45. Xavier R, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M (2015) pROC R package.

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Mohamed Sami Rakha
    • 1
    Email author
  • Weiyi Shang
    • 1
  • Ahmed E. Hassan
    • 1
  1. 1.Software Analysis and Intelligence Lab (SAIL), School of ComputingQueen’s UniversityKingstonCanada

Personalised recommendations