Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Same-Same But Different: On Understanding Duplicates in Stack Overflow

  • 47 Accesses

Abstract

Stack Overflow (SO) is one of the most popular online sites for asking and answering developers’ questions. New posts that cover exactly the same knowledge as previously posted questions get closed and deleted by the community. However, new posts that are very similar to previous questions but which are phrased slightly different are kept and tagged as duplicates: since they might include additional information, hints, or keywords. In this paper, we study exact duplicates and similar duplicates in SO in order to get insights about their properties and content and to understand how the community distinguishes useful from useless (i. e. to be deleted) redundant knowledge. We identified several interesting trends. Unique questions are significantly longer than others. Original questions get answered faster, include more answers, and get more frequently viewed than exact and similar duplicates. When comparing the overlapped text in duplicate pairs, we found almost no difference between exact and similar duplicates. In both cases, about 20–25 % of the question text and 40 % of the tags are identical in an original and its duplicate. However, the answers of the duplicates seem much more diverse with only 5–6 % repeated text.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions in stack overflow. In: Proceedings of the 13th International Conference on Mining Software Repositories. ACM, pp 402–412

  2. 2.

    Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 183–192

  3. 3.

    Anderson A, Huttenlocher D, Kleinberg J, Leskovec J (2012) Discovering value from community activity on focused question answering sites: a case study of stack overflow. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 850–858

  4. 4.

    Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 97–100

  5. 5.

    Atwood J (2009) Handling Duplicate Questions

  6. 6.

    Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Softw Eng 19(3):619–654

  7. 7.

    Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful ... really? In: IEEE international conference on Software Maintenance, ICSM 2008. IEEE, pp 337–345

  8. 8.

    Bird C, Menzies T, Zimmermann T (2015) The Art and Science of Analyzing Software Data. Elsevier

  9. 9.

    Bird S, Klein E, Loper E (2009) Natural Language Processing with Python. O’Reilly Media, Inc.

  10. 10.

    Bogdanova D, Nogueira dos Santos C, Barbosa L, Zadrozny B (2015) Detecting semantically equivalent questions in online user forums. CoNLL 123:2015

  11. 11.

    Bruegge B, Dutoit AH (2004) Object-Oriented Software Engineering Using UML, Patterns and Java-(Required). Prentice Hall

  12. 12.

    Burke RD, Hammond KJ, Kulyukin V, Lytinen SL, Tomuro N, Schoenberg S (1997) Question answering from frequently asked question files: Experiences with the faq finder system. AI magazine 18(2):57

  13. 13.

    Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Tech 41(6):391–407

  14. 14.

    Oxford Dictionaries (2017) Definition of an artefact. https://en.oxforddictionaries.com/definition/artefact

  15. 15.

    Dumais ST, Furnas GW, Landauer TK, Deerwester S, Harshman R (1988) Using latent semantic analysis to improve access to textual information. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, New York, NY, USA. ACM, pp 281–285

  16. 16.

    Ellmann M (2018) Natural language processing (nlp) applied on issue trackers. In: Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering. ACM, pp 38–41

  17. 17.

    Ellmann M, Oeser A, Fucci D, Maalej W (2007) Find, understand, and extend development screencasts on youtube. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics, SWAN 2017, New York, NY, USA. ACM, pp 1–7

  18. 18.

    Fritz T, Murphy GC (2010) Using information fragments to answer the questions developers ask. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, pp 175–184

  19. 19.

    Furnas GW, Landauer TK, Gómez LM, Dumais ST (1987) The vocabulary problem in human-system communication. Commun ACM 30(11):964–971

  20. 20.

    Glassman EL, Zhang T, Hartmann B, Kim M (2018) Visualizing api usage examples at scale. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, p 580

  21. 21.

    Gómez C, Cleary B, Singer L (2013) A study of innovation diffusion through link sharing on stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 81–84

  22. 22.

    Gottipati S, Lo D, Jiang J (2011) Finding relevant answers in software forums. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 323–332

  23. 23.

    Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 392–401

  24. 24.

    Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems. pp 856–864

  25. 25.

    Jaccard P (1902) Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38:69–130

  26. 26.

    Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50

  27. 27.

    Jährling C (2015) Monitoring Developer’s Actions to Generate a Question in Stack Overflow, Technical report. University Hamburg, Department of Informatics

  28. 28.

    Kincaid JP, Fishburne Jr RP, Rogers RL, Chissom BS (1975) Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Technical report. DTIC Document

  29. 29.

    Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. In: 29th International Conference on Software Engineering 2007, ICSE 2007. IEEE, pp 344–353

  30. 30.

    Kodhai E, Kanmani S, Kamatchi A, Radhika R, Vijaya Saranya B (2010) Detection of type-1 and type-2 code clones using textual analysis and metrics. In: 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC). IEEE, pp 241–243

  31. 31.

    Lahtinen E, Ala-Mutka K, Järvinen H-M (2005) A study of the difficulties of novice programmers. ACM Sigcse Bull 37:14–18

  32. 32.

    Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets. Cambridge University Press

  33. 33.

    Lethbridge TC, Singer J, Forward A (2003) How software engineers use documentation: The state of the practice. IEEE Software 20(6):35–39

  34. 34.

    Maalej W (2009) Task-first or context-first? tool integration revisited. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 344–355

  35. 35.

    Maalej W, Ellmann M (2015) On the similarity of task contexts. In: Proceedings of the Second International Workshop on Context for Software Development. IEEE Press, pp 8–12

  36. 36.

    Maalej W, Ellmann M, Robbes R (2016) Using contexts similarity to predict relationships between tasks. J Syst Softw

  37. 37.

    Maalej W, Happel H-J (2010) Can development work describe itself? In: 7th IEEE Working Conference on Mining Software Repositories (MSR). IEEE, pp 191–200

  38. 38.

    Maalej W, Robillard MP (2013) Patterns of knowledge in api reference documentation. IEEE T Software Eng 39(9):1264–1282

  39. 39.

    Maalej W, Tiarks R, Roehm T, Koschke R (2014) On the comprehension of program comprehension. ACM T Softw Eng Meth 23(4):31

  40. 40.

    MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using youtube. In: IEEE 23rd International Conference on Program Comprehension (ICPC). IEEE, pp 104–114

  41. 41.

    Mamykina L, Manoim B, Mittal M, G Hripcsak, Hartmann B (2011) Design lessons from the fastest q&a site in the west. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, pp 2857–2866

  42. 42.

    Mizobuchi Y, Takayama K (2017) Two improvements to detect duplicates in stack overflow. In: IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 563–564

  43. 43.

    Muthmann K, Petrova A (2014) An automatic approach for identifying topical near-duplicate relations between questions from social media q/a sites

  44. 44.

    Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 70–79

  45. 45.

    Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, New York, NY, USA. ACM, pp 70–79

  46. 46.

    Page A (2007) Duplicate bugs. https://blogs.msdn.microsoft.com/alanpa/2007/08/01/duplicate-bugs/

  47. 47.

    Panjer LD (2007) Predicting eclipse bug lifetimes. In: Proceedings of the Fourth International Workshop on mining software repositories. IEEE Computer Society, p 29

  48. 48.

    Park H, Lee S-C, Lee S-H, Kim S-W (2010) Centralmatch: A fast and accurate method to identify blog-duplicates. In: 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol 1. IEEE, pp 112–119

  49. 49.

    Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving low quality stack overflow post detection. In: 2014 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 541–544

  50. 50.

    Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

  51. 51.

    pyLDAvis (2014) Python library for interactive topic model visualization

  52. 52.

    Rakha MS, Shang W, Hassan AE (2016) Studying the needed effort for identifying duplicate issues. Empirical Softw Eng 21(5):1960–1989

  53. 53.

    Řehuřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. ELRA, pp 45–50. http://is.muni.cz/publication/884893/en

  54. 54.

    Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci Comput Program 74(7):470–495

  55. 55.

    Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 499–510

  56. 56.

    Schnecke M (2015) An empirical study to improve the quality of developer’s q&as in stack overflow, Technical report. University Hamburg, Department of Informatics

  57. 57.

    Silva RFG, Paixão K, de Almeida Maia M (2018) Duplicate question detection in stack overflow: A reproducibility study. In: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 572–581

  58. 58.

    Singer J, Lethbridge T, Vinson N, Anquetil N (2010) An examination of software engineering work practices. In: CASCON First Decade High Impact Papers. IBM Corp., pp 174–188

  59. 59.

    Sinha VS, Mani S, Gupta M (2013) Exploring activeness of users in qa forums. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 77–80

  60. 60.

    Sun C, Lo D, Khoo S-C, Jiang J (2011) Towards more accurate retrieval of duplicate reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE ’11, Washington, DC, USA. IEEE Computer Society, pp 253–262

  61. 61.

    Sun C, Lo D, Wang X, Jiang J, Khoo S-C (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering – Volume 1, ICSE ’10, New York, NY, USA. ACM, pp 45–54

  62. 62.

    Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: 17th Asia Pacific Software Engineering Conference (APSEC). IEEE, pp 366–374

  63. 63.

    Tiarks R (2015) How-To Software Knowledge. Verlag Dr. Hut

  64. 64.

    Tiarks R, Maalej W (2014) How does a typical tutorial for mobile development look like? In: Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, pp 272–281

  65. 65.

    Timmann I (2015) An empirical study towards a quality model for faqs in software development, Technical report. University Hamburg, Department of Informatics

  66. 66.

    Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the web?: Nier track. In: 33rd International Conference on Software Engineering (ICSE). IEEE, pp 804–807

  67. 67.

    Treude C, Robillard MP (2016) Augmenting api documentation with insights from stack overflow. In: IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 392–403

  68. 68.

    Robillard MP, Maalej W (2013) Patterns of knowledge in api reference documentation. IEEE T Softw Eng 39(9)

  69. 69.

    Wang S, Lo D, Jiang L (2013) An empirical study on developer interactions in stackoverflow. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, New York, NY, USA. ACM, pp 1019–1024

  70. 70.

    Wang X, Lo D, Jiang J, Zhang L, Mei H (2009) Extracting paraphrases of technical terms from noisy parallel software corpora. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, pp 197–200

  71. 71.

    Stack Overflow Community Wiki (2016) How should duplicate questions be handled? https://meta.stackexchange.com/questions/10841/how-should-duplicate-questions-be-handled

  72. 72.

    Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 51–62

  73. 73.

    Zhang Y, Lo D, Xia X, Sun J-L (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997

Download references

Author information

Correspondence to Mathias Ellmann.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ellmann, M. Same-Same But Different: On Understanding Duplicates in Stack Overflow. Informatik Spektrum 42, 266–286 (2019). https://doi.org/10.1007/s00287-019-01185-y

Download citation