Abstract
Stack Overflow (SO) is one of the most popular online sites for asking and answering developers’ questions. New posts that cover exactly the same knowledge as previously posted questions get closed and deleted by the community. However, new posts that are very similar to previous questions but which are phrased slightly different are kept and tagged as duplicates: since they might include additional information, hints, or keywords. In this paper, we study exact duplicates and similar duplicates in SO in order to get insights about their properties and content and to understand how the community distinguishes useful from useless (i. e. to be deleted) redundant knowledge. We identified several interesting trends. Unique questions are significantly longer than others. Original questions get answered faster, include more answers, and get more frequently viewed than exact and similar duplicates. When comparing the overlapped text in duplicate pairs, we found almost no difference between exact and similar duplicates. In both cases, about 20–25 % of the question text and 40 % of the tags are identical in an original and its duplicate. However, the answers of the duplicates seem much more diverse with only 5–6 % repeated text.
Similar content being viewed by others
References
Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions in stack overflow. In: Proceedings of the 13th International Conference on Mining Software Repositories. ACM, pp 402–412
Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 183–192
Anderson A, Huttenlocher D, Kleinberg J, Leskovec J (2012) Discovering value from community activity on focused question answering sites: a case study of stack overflow. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 850–858
Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 97–100
Atwood J (2009) Handling Duplicate Questions
Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Softw Eng 19(3):619–654
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful ... really? In: IEEE international conference on Software Maintenance, ICSM 2008. IEEE, pp 337–345
Bird C, Menzies T, Zimmermann T (2015) The Art and Science of Analyzing Software Data. Elsevier
Bird S, Klein E, Loper E (2009) Natural Language Processing with Python. O’Reilly Media, Inc.
Bogdanova D, Nogueira dos Santos C, Barbosa L, Zadrozny B (2015) Detecting semantically equivalent questions in online user forums. CoNLL 123:2015
Bruegge B, Dutoit AH (2004) Object-Oriented Software Engineering Using UML, Patterns and Java-(Required). Prentice Hall
Burke RD, Hammond KJ, Kulyukin V, Lytinen SL, Tomuro N, Schoenberg S (1997) Question answering from frequently asked question files: Experiences with the faq finder system. AI magazine 18(2):57
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Tech 41(6):391–407
Oxford Dictionaries (2017) Definition of an artefact. https://en.oxforddictionaries.com/definition/artefact
Dumais ST, Furnas GW, Landauer TK, Deerwester S, Harshman R (1988) Using latent semantic analysis to improve access to textual information. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, New York, NY, USA. ACM, pp 281–285
Ellmann M (2018) Natural language processing (nlp) applied on issue trackers. In: Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering. ACM, pp 38–41
Ellmann M, Oeser A, Fucci D, Maalej W (2007) Find, understand, and extend development screencasts on youtube. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics, SWAN 2017, New York, NY, USA. ACM, pp 1–7
Fritz T, Murphy GC (2010) Using information fragments to answer the questions developers ask. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, pp 175–184
Furnas GW, Landauer TK, Gómez LM, Dumais ST (1987) The vocabulary problem in human-system communication. Commun ACM 30(11):964–971
Glassman EL, Zhang T, Hartmann B, Kim M (2018) Visualizing api usage examples at scale. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, p 580
Gómez C, Cleary B, Singer L (2013) A study of innovation diffusion through link sharing on stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 81–84
Gottipati S, Lo D, Jiang J (2011) Finding relevant answers in software forums. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 323–332
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 392–401
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems. pp 856–864
Jaccard P (1902) Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38:69–130
Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50
Jährling C (2015) Monitoring Developer’s Actions to Generate a Question in Stack Overflow, Technical report. University Hamburg, Department of Informatics
Kincaid JP, Fishburne Jr RP, Rogers RL, Chissom BS (1975) Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Technical report. DTIC Document
Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. In: 29th International Conference on Software Engineering 2007, ICSE 2007. IEEE, pp 344–353
Kodhai E, Kanmani S, Kamatchi A, Radhika R, Vijaya Saranya B (2010) Detection of type-1 and type-2 code clones using textual analysis and metrics. In: 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC). IEEE, pp 241–243
Lahtinen E, Ala-Mutka K, Järvinen H-M (2005) A study of the difficulties of novice programmers. ACM Sigcse Bull 37:14–18
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets. Cambridge University Press
Lethbridge TC, Singer J, Forward A (2003) How software engineers use documentation: The state of the practice. IEEE Software 20(6):35–39
Maalej W (2009) Task-first or context-first? tool integration revisited. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 344–355
Maalej W, Ellmann M (2015) On the similarity of task contexts. In: Proceedings of the Second International Workshop on Context for Software Development. IEEE Press, pp 8–12
Maalej W, Ellmann M, Robbes R (2016) Using contexts similarity to predict relationships between tasks. J Syst Softw
Maalej W, Happel H-J (2010) Can development work describe itself? In: 7th IEEE Working Conference on Mining Software Repositories (MSR). IEEE, pp 191–200
Maalej W, Robillard MP (2013) Patterns of knowledge in api reference documentation. IEEE T Software Eng 39(9):1264–1282
Maalej W, Tiarks R, Roehm T, Koschke R (2014) On the comprehension of program comprehension. ACM T Softw Eng Meth 23(4):31
MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using youtube. In: IEEE 23rd International Conference on Program Comprehension (ICPC). IEEE, pp 104–114
Mamykina L, Manoim B, Mittal M, G Hripcsak, Hartmann B (2011) Design lessons from the fastest q&a site in the west. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, pp 2857–2866
Mizobuchi Y, Takayama K (2017) Two improvements to detect duplicates in stack overflow. In: IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 563–564
Muthmann K, Petrova A (2014) An automatic approach for identifying topical near-duplicate relations between questions from social media q/a sites
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 70–79
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, New York, NY, USA. ACM, pp 70–79
Page A (2007) Duplicate bugs. https://blogs.msdn.microsoft.com/alanpa/2007/08/01/duplicate-bugs/
Panjer LD (2007) Predicting eclipse bug lifetimes. In: Proceedings of the Fourth International Workshop on mining software repositories. IEEE Computer Society, p 29
Park H, Lee S-C, Lee S-H, Kim S-W (2010) Centralmatch: A fast and accurate method to identify blog-duplicates. In: 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol 1. IEEE, pp 112–119
Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving low quality stack overflow post detection. In: 2014 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 541–544
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
pyLDAvis (2014) Python library for interactive topic model visualization
Rakha MS, Shang W, Hassan AE (2016) Studying the needed effort for identifying duplicate issues. Empirical Softw Eng 21(5):1960–1989
Řehuřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. ELRA, pp 45–50. http://is.muni.cz/publication/884893/en
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci Comput Program 74(7):470–495
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 499–510
Schnecke M (2015) An empirical study to improve the quality of developer’s q&as in stack overflow, Technical report. University Hamburg, Department of Informatics
Silva RFG, Paixão K, de Almeida Maia M (2018) Duplicate question detection in stack overflow: A reproducibility study. In: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 572–581
Singer J, Lethbridge T, Vinson N, Anquetil N (2010) An examination of software engineering work practices. In: CASCON First Decade High Impact Papers. IBM Corp., pp 174–188
Sinha VS, Mani S, Gupta M (2013) Exploring activeness of users in qa forums. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 77–80
Sun C, Lo D, Khoo S-C, Jiang J (2011) Towards more accurate retrieval of duplicate reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE ’11, Washington, DC, USA. IEEE Computer Society, pp 253–262
Sun C, Lo D, Wang X, Jiang J, Khoo S-C (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering – Volume 1, ICSE ’10, New York, NY, USA. ACM, pp 45–54
Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: 17th Asia Pacific Software Engineering Conference (APSEC). IEEE, pp 366–374
Tiarks R (2015) How-To Software Knowledge. Verlag Dr. Hut
Tiarks R, Maalej W (2014) How does a typical tutorial for mobile development look like? In: Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, pp 272–281
Timmann I (2015) An empirical study towards a quality model for faqs in software development, Technical report. University Hamburg, Department of Informatics
Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the web?: Nier track. In: 33rd International Conference on Software Engineering (ICSE). IEEE, pp 804–807
Treude C, Robillard MP (2016) Augmenting api documentation with insights from stack overflow. In: IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 392–403
Robillard MP, Maalej W (2013) Patterns of knowledge in api reference documentation. IEEE T Softw Eng 39(9)
Wang S, Lo D, Jiang L (2013) An empirical study on developer interactions in stackoverflow. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, New York, NY, USA. ACM, pp 1019–1024
Wang X, Lo D, Jiang J, Zhang L, Mei H (2009) Extracting paraphrases of technical terms from noisy parallel software corpora. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, pp 197–200
Stack Overflow Community Wiki (2016) How should duplicate questions be handled? https://meta.stackexchange.com/questions/10841/how-should-duplicate-questions-be-handled
Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 51–62
Zhang Y, Lo D, Xia X, Sun J-L (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ellmann, M. Same-Same But Different: On Understanding Duplicates in Stack Overflow. Informatik Spektrum 42, 266–286 (2019). https://doi.org/10.1007/s00287-019-01185-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00287-019-01185-y