Abstract
Stack Overflow (SO) is one of the most popular online sites for asking and answering developers’ questions. New posts that cover exactly the same knowledge as previously posted questions get closed and deleted by the community. However, new posts that are very similar to previous questions but which are phrased slightly different are kept and tagged as duplicates: since they might include additional information, hints, or keywords. In this paper, we study exact duplicates and similar duplicates in SO in order to get insights about their properties and content and to understand how the community distinguishes useful from useless (i. e. to be deleted) redundant knowledge. We identified several interesting trends. Unique questions are significantly longer than others. Original questions get answered faster, include more answers, and get more frequently viewed than exact and similar duplicates. When comparing the overlapped text in duplicate pairs, we found almost no difference between exact and similar duplicates. In both cases, about 20–25 % of the question text and 40 % of the tags are identical in an original and its duplicate. However, the answers of the duplicates seem much more diverse with only 5–6 % repeated text.
This is a preview of subscription content, access via your institution.
References
- 1.
Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions in stack overflow. In: Proceedings of the 13th International Conference on Mining Software Repositories. ACM, pp 402–412
- 2.
Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 183–192
- 3.
Anderson A, Huttenlocher D, Kleinberg J, Leskovec J (2012) Discovering value from community activity on focused question answering sites: a case study of stack overflow. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 850–858
- 4.
Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 97–100
- 5.
Atwood J (2009) Handling Duplicate Questions
- 6.
Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Softw Eng 19(3):619–654
- 7.
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful ... really? In: IEEE international conference on Software Maintenance, ICSM 2008. IEEE, pp 337–345
- 8.
Bird C, Menzies T, Zimmermann T (2015) The Art and Science of Analyzing Software Data. Elsevier
- 9.
Bird S, Klein E, Loper E (2009) Natural Language Processing with Python. O’Reilly Media, Inc.
- 10.
Bogdanova D, Nogueira dos Santos C, Barbosa L, Zadrozny B (2015) Detecting semantically equivalent questions in online user forums. CoNLL 123:2015
- 11.
Bruegge B, Dutoit AH (2004) Object-Oriented Software Engineering Using UML, Patterns and Java-(Required). Prentice Hall
- 12.
Burke RD, Hammond KJ, Kulyukin V, Lytinen SL, Tomuro N, Schoenberg S (1997) Question answering from frequently asked question files: Experiences with the faq finder system. AI magazine 18(2):57
- 13.
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Tech 41(6):391–407
- 14.
Oxford Dictionaries (2017) Definition of an artefact. https://en.oxforddictionaries.com/definition/artefact
- 15.
Dumais ST, Furnas GW, Landauer TK, Deerwester S, Harshman R (1988) Using latent semantic analysis to improve access to textual information. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, New York, NY, USA. ACM, pp 281–285
- 16.
Ellmann M (2018) Natural language processing (nlp) applied on issue trackers. In: Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering. ACM, pp 38–41
- 17.
Ellmann M, Oeser A, Fucci D, Maalej W (2007) Find, understand, and extend development screencasts on youtube. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics, SWAN 2017, New York, NY, USA. ACM, pp 1–7
- 18.
Fritz T, Murphy GC (2010) Using information fragments to answer the questions developers ask. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, pp 175–184
- 19.
Furnas GW, Landauer TK, Gómez LM, Dumais ST (1987) The vocabulary problem in human-system communication. Commun ACM 30(11):964–971
- 20.
Glassman EL, Zhang T, Hartmann B, Kim M (2018) Visualizing api usage examples at scale. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, p 580
- 21.
Gómez C, Cleary B, Singer L (2013) A study of innovation diffusion through link sharing on stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 81–84
- 22.
Gottipati S, Lo D, Jiang J (2011) Finding relevant answers in software forums. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 323–332
- 23.
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 392–401
- 24.
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems. pp 856–864
- 25.
Jaccard P (1902) Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38:69–130
- 26.
Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50
- 27.
Jährling C (2015) Monitoring Developer’s Actions to Generate a Question in Stack Overflow, Technical report. University Hamburg, Department of Informatics
- 28.
Kincaid JP, Fishburne Jr RP, Rogers RL, Chissom BS (1975) Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Technical report. DTIC Document
- 29.
Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. In: 29th International Conference on Software Engineering 2007, ICSE 2007. IEEE, pp 344–353
- 30.
Kodhai E, Kanmani S, Kamatchi A, Radhika R, Vijaya Saranya B (2010) Detection of type-1 and type-2 code clones using textual analysis and metrics. In: 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC). IEEE, pp 241–243
- 31.
Lahtinen E, Ala-Mutka K, Järvinen H-M (2005) A study of the difficulties of novice programmers. ACM Sigcse Bull 37:14–18
- 32.
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets. Cambridge University Press
- 33.
Lethbridge TC, Singer J, Forward A (2003) How software engineers use documentation: The state of the practice. IEEE Software 20(6):35–39
- 34.
Maalej W (2009) Task-first or context-first? tool integration revisited. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 344–355
- 35.
Maalej W, Ellmann M (2015) On the similarity of task contexts. In: Proceedings of the Second International Workshop on Context for Software Development. IEEE Press, pp 8–12
- 36.
Maalej W, Ellmann M, Robbes R (2016) Using contexts similarity to predict relationships between tasks. J Syst Softw
- 37.
Maalej W, Happel H-J (2010) Can development work describe itself? In: 7th IEEE Working Conference on Mining Software Repositories (MSR). IEEE, pp 191–200
- 38.
Maalej W, Robillard MP (2013) Patterns of knowledge in api reference documentation. IEEE T Software Eng 39(9):1264–1282
- 39.
Maalej W, Tiarks R, Roehm T, Koschke R (2014) On the comprehension of program comprehension. ACM T Softw Eng Meth 23(4):31
- 40.
MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using youtube. In: IEEE 23rd International Conference on Program Comprehension (ICPC). IEEE, pp 104–114
- 41.
Mamykina L, Manoim B, Mittal M, G Hripcsak, Hartmann B (2011) Design lessons from the fastest q&a site in the west. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, pp 2857–2866
- 42.
Mizobuchi Y, Takayama K (2017) Two improvements to detect duplicates in stack overflow. In: IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 563–564
- 43.
Muthmann K, Petrova A (2014) An automatic approach for identifying topical near-duplicate relations between questions from social media q/a sites
- 44.
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 70–79
- 45.
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, New York, NY, USA. ACM, pp 70–79
- 46.
Page A (2007) Duplicate bugs. https://blogs.msdn.microsoft.com/alanpa/2007/08/01/duplicate-bugs/
- 47.
Panjer LD (2007) Predicting eclipse bug lifetimes. In: Proceedings of the Fourth International Workshop on mining software repositories. IEEE Computer Society, p 29
- 48.
Park H, Lee S-C, Lee S-H, Kim S-W (2010) Centralmatch: A fast and accurate method to identify blog-duplicates. In: 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol 1. IEEE, pp 112–119
- 49.
Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving low quality stack overflow post detection. In: 2014 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 541–544
- 50.
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
- 51.
pyLDAvis (2014) Python library for interactive topic model visualization
- 52.
Rakha MS, Shang W, Hassan AE (2016) Studying the needed effort for identifying duplicate issues. Empirical Softw Eng 21(5):1960–1989
- 53.
Řehuřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. ELRA, pp 45–50. http://is.muni.cz/publication/884893/en
- 54.
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci Comput Program 74(7):470–495
- 55.
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 499–510
- 56.
Schnecke M (2015) An empirical study to improve the quality of developer’s q&as in stack overflow, Technical report. University Hamburg, Department of Informatics
- 57.
Silva RFG, Paixão K, de Almeida Maia M (2018) Duplicate question detection in stack overflow: A reproducibility study. In: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 572–581
- 58.
Singer J, Lethbridge T, Vinson N, Anquetil N (2010) An examination of software engineering work practices. In: CASCON First Decade High Impact Papers. IBM Corp., pp 174–188
- 59.
Sinha VS, Mani S, Gupta M (2013) Exploring activeness of users in qa forums. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 77–80
- 60.
Sun C, Lo D, Khoo S-C, Jiang J (2011) Towards more accurate retrieval of duplicate reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE ’11, Washington, DC, USA. IEEE Computer Society, pp 253–262
- 61.
Sun C, Lo D, Wang X, Jiang J, Khoo S-C (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering – Volume 1, ICSE ’10, New York, NY, USA. ACM, pp 45–54
- 62.
Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: 17th Asia Pacific Software Engineering Conference (APSEC). IEEE, pp 366–374
- 63.
Tiarks R (2015) How-To Software Knowledge. Verlag Dr. Hut
- 64.
Tiarks R, Maalej W (2014) How does a typical tutorial for mobile development look like? In: Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, pp 272–281
- 65.
Timmann I (2015) An empirical study towards a quality model for faqs in software development, Technical report. University Hamburg, Department of Informatics
- 66.
Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the web?: Nier track. In: 33rd International Conference on Software Engineering (ICSE). IEEE, pp 804–807
- 67.
Treude C, Robillard MP (2016) Augmenting api documentation with insights from stack overflow. In: IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 392–403
- 68.
Robillard MP, Maalej W (2013) Patterns of knowledge in api reference documentation. IEEE T Softw Eng 39(9)
- 69.
Wang S, Lo D, Jiang L (2013) An empirical study on developer interactions in stackoverflow. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, New York, NY, USA. ACM, pp 1019–1024
- 70.
Wang X, Lo D, Jiang J, Zhang L, Mei H (2009) Extracting paraphrases of technical terms from noisy parallel software corpora. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, pp 197–200
- 71.
Stack Overflow Community Wiki (2016) How should duplicate questions be handled? https://meta.stackexchange.com/questions/10841/how-should-duplicate-questions-be-handled
- 72.
Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 51–62
- 73.
Zhang Y, Lo D, Xia X, Sun J-L (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997
Author information
Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ellmann, M. Same-Same But Different: On Understanding Duplicates in Stack Overflow. Informatik Spektrum 42, 266–286 (2019). https://doi.org/10.1007/s00287-019-01185-y
Published:
Issue Date: