EnTagRec ++: An enhanced tag recommendation system for software information sites

Wang, Shaowei; Lo, David; Vasilescu, Bogdan; Serebrenik, Alexander

doi:10.1007/s10664-017-9533-1

EnTagRec ⁺⁺: An enhanced tag recommendation system for software information sites

Published: 21 July 2017

Volume 23, pages 800–832, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Shaowei Wang ORCID: orcid.org/0000-0003-3823-1771¹,
David Lo²,
Bogdan Vasilescu³ &
…
Alexander Serebrenik⁴

1347 Accesses
57 Citations
2 Altmetric
Explore all metrics

Abstract

Software engineers share experiences with modern technologies using software information sites, such as Stack Overflow. These sites allow developers to label posted content, referred to as software objects, with short descriptions, known as tags. Tags help to improve the organization of questions and simplify the browsing of questions for users. However, tags assigned to objects tend to be noisy and some objects are not well tagged. For instance, 14.7% of the questions that were posted in 2015 on Stack Overflow needed tag re-editing after the initial assignment. To improve the quality of tags in software information sites, we propose EnTagRec ⁺⁺, which is an advanced version of our prior work EnTagRec. Different from EnTagRec, EnTagRec ⁺⁺ does not only integrate the historical tag assignments to software objects, but also leverages the information of users, and an initial set of tags that a user may provide for tag recommendation. We evaluate its performance on five software information sites, Stack Overflow, Ask Ubuntu, Ask Different, Super User, and Freecode. We observe that even without considering an initial set of tags that a user provides, it achieves Recall@5 scores of 0.821, 0.822, 0.891, 0.818 and 0.651, and Recall@10 scores of 0.873, 0.886, 0.956, 0.887 and 0.761, on Stack Overflow, Ask Ubuntu, Ask Different, Super User, and Freecode, respectively. In terms of Recall@5 and Recall@10, averaging across the 5 datasets, it improves upon TagCombine, which is the prior state-of-the-art approach, by 29.3% and 14.5% respectively. Moreover, the performance of our approach is further boosted if users provide some initial tags that our approach can leverage to infer additional tags: when an initial set of tags is given, Recall@5 is improved by 10%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Applications of AI in classical software engineering

Article Open access 26 July 2020

Marco Barenkamp, Jonas Rebstadt & Oliver Thomas

Source-Code Generation Using Deep Learning: A Survey

Notes

http://meta.stackexchange.com/questions/tagged/tags
http://meta.stackexchange.com/questions/206907/how-are-suggested-tags-chosen
Since the implementation of Stack Overflow’s proprietary system is, to the best of our knowledge, not documented publicly, a meaningful comparison was not possible.
http://sourceforge.net/
http://marketplace.eclipse.org/
http://snipplr.com/
http://stackoverflow.com/q/5550896
http://stackoverflow.com/q/2058138
https://data.stackexchange.com/stackoverflow/queries
http://stackoverflow.com/users/137369/thirler?tab=tags
Based on http://www.textfixer.com/resources/common-english-words.txt
http://nlp.stanford.edu/software/tagger.shtml
Our experiments show that the effectiveness of UIC substantially degrades if it takes into consideration all tags.
By construction, γ is an extra weight given to some of the tags in \(T_{\text {\small {\textsl {BIC}}} \cup \text {\small {\textsl {FIC}}}}\).
Since EnTagRec ⁺⁺ _o(t) is itself a probability score, it could also be expressed as a function of only three coefficients α′, β′, and γ′, with the fourth being automatically 1 − α′ − β′ − γ′. We chose the four-coefficient expression to better reflect the four components of EnTagRec ⁺⁺.
https://sites.google.com/site/wswshaoweiwang/projects/entagrec

References

Al-Kofahi JM, Tamrawi A, Nguyen TT, Nguyen HA, Nguyen TN (2010) Fuzzy set approach for automatic tagging in evolving software ICSM, pp 1–10
Google Scholar
Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983
Article Google Scholar
Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling ICSE, pp 95–104
Google Scholar
Baldi P, Lopes CV, Linstead E, Bajracharya SK (2008) A theory of aspects as latent topics OOPSLA, pp 543–562
Google Scholar
Bazelli B, Hindle A, Stroulia E (2013) On the personality traits of stackoverflow users. In: 2013 IEEE international conference on software maintenance, pp 460–463
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188
Article MathSciNet MATH Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. JMLR 13:281–305
MathSciNet MATH Google Scholar
Bindelli S, Criscione C, Curino C, Drago ML, Eynard D, Orsi G (2008) Improving search and navigation by combining ontologies and social tags. In: On the move to meaningful internet systems, OTM 2008 Workshops, OTM confederated international workshops and posters, ADI, AWeSoMe, COMBEK, EI2N, IWSSA, MONET, OnToContent + QSI, ORM, PerSys, RDDS, SEMELS, and SWWS 2008, Monterrey, Mexico, November 9-14, 2008. Proceedings, pp 76–85
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. JMLR, 993–1022
Brandt J, Guo PJ, Lewenstein J, Dontcheva M, Klemmer SR (2009) Two studies of opportunistic programming: interleaving web foraging, learning, and writing code CHI. ACM, pp 1589–1598
Cabot J, Izquierdo JLC, Cosentino V, Rolandi B (2015) Exploring the use of labels to categorize issues in open-source software projects. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015. Montreal, QC, Canada, March 2-6, 2015, pp 550–554
Capobianco G, Lucia AD, Oliveto R, Panichella A, Panichella S (2013) Improving IR-based traceability recovery via noun-based indexing of software artifacts. J Softw Evol Process 25(7):743–762
Article Google Scholar
Cress U, Held C, Kimmerle J (2013) The collective knowledge of social tags: direct and indirect influences on navigation, learning, and information processing. Comput Educ 60(1):59–73
Article Google Scholar
Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11(6):453–482
Article Google Scholar
Gelman A, Carlin J, Stern H, Rubin D (2003) Bayesian data analysis. CRC Press
Ghamrawi N, McCallum A (2005) Collective multi-label classification CIKM, pp 195–200
Google Scholar
Golder SA, Huberman BA (2006) Usage patterns of collaborative tagging systems. J Inf Sci 32(2):198–206
Article Google Scholar
Grissom RJ, Kim JJ (2005) Effect sizes for research. A broad practical approach
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc
Held C, Kimmerle J, Cress U (2012) Learning by foraging: the impact of individual knowledge and social tags on web navigation processes. Comput Hum Behav 28(1):34–40
Article Google Scholar
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, SOMA ’10, pp 80–88
Jäschke R, Marinho LB, Hotho A, Schmidt-Thieme L, Stumme G (2007) Tag recommendations in folksonomies PKDD
Jmac (2013) Select and display ‘suggested tags’ for all posts based on related questions (or other logic). http://meta.stackexchange.com/q/196702/182512
Joorabchi A, English M, Mahdi AE (2015) Automatic mapping of user tags to wikipedia concepts: the case of a q&a website âĂŞ stackoverflow. J Inf Sci 41 (5):570–583
Article Google Scholar
Her J (2011) Tag recommendations for stack overflow. http://meta.stackexchange.com/q/88611/182512
Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent dirichlet allocation. Inf Softw Technol 52(9):972–990
Article Google Scholar
Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, Lucia AD (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms ICSE, pp 522–531
Google Scholar
Pletea D, Vasilescu B, Serebrenik A (2014) Security and emotion: Sentiment analysis of security discussions on github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. ACM, New York, pp 348–351
Porter MF (1997) An algorithm for suffix stripping Readings in information retrieval. Morgan Kaufmann, pp 313–316
Puurula A (2011) Mixture models for multi-label text classification. In: 10th New Zealand computer science research student conference
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP ’09, pp 248–256
Rebouças M, Pinto G, Ebert F, Torres W, Serebrenik A, Castor F (2016) An empirical study on the usage of the swift programming language. In: 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), pp 634–638
Samaniego FI (2010) A comparison of the bayesian and frequentist approaches to estimation. Series in Statistics, Springer
Shokripour R, Anvik J, Kasirun ZM, Zamani S (2013) Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation MSR
Sigurbjörnsson B, van Zwol R (2008) Flickr tag recommendation based on collective knowledge WWW ’08, pp 327–336
Storey M-A, Ryall J, Singer J, Myers D, Cheng L-T, Muller M (2009) How software developers use tagging to support reminding and refinding. IEEE Trans Softw Eng 35(undefined):470–483
Article Google Scholar
Storey M-A, Treude C, van Deursen A, Cheng L-T (2010) The impact of social media on software engineering practices and tools. In: FoSER ’10, pp 359–364
Thung F, Lo D, Jiang L (2012) Detecting similar applications with collaborative tagging. In: ICSM, pp 600–603
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL
Treude C, Storey M-A (2009) How tagging helps bridge the gap between social and technical aspects in software development. In: ICSE ’09, pp 12–22
Treude C, Storey M-A (2012) Work item tagging: communicating concerns in collaborative software development. IEEE Trans Softw Eng 38(1):19–34
Article Google Scholar
Vasilescu B, Serebrenik A, Devanbu PT, Filkov V (2014) How social Q&A sites are changing knowledge sharing in open source software communities. In: CSCW, pp 342–354
Vasilescu B, Serebrenik A, van den Brand MGJ (2013) The babel of software development: linguistic diversity in open source. In: Jatowt A, Lim E-P, Ding Y, Miura A, Tezuka T, Dias G, Tanaka K, Flanagin A, Dai BT (eds) Proceedings of the social informatics: 5th international conference, SocInfo 2013, Kyoto, Japan, November 25-27, 2013. Springer International Publishing, pp 391–404
Vogt CC, Cottrell GW (1999) Fusion via a linear combination of scores. Inf Retr 1(3):151–173
Article Google Scholar
Wang S, Lo D, Jiang L (2012) Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. In: ICSM, pp 604–607
Wang S, Lo D, Vasilescu B, Serebrenik A (2014) EnTagRec: an enhanced tag recommendation system for software information sites. In: 30th IEEE international conference on software maintenance and evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, pp 291– 300
Wang W, Niu N, Liu H, Wu Y (2015) Tagging in assisted tracing. In: 2015 IEEE/ACM 8th international symposium on software and systems traceability, pp 8–14
Wang X-Y, Xia X, Lo D (2015) Tagcombine: recommending tags to contents in software information sites. J Comput Sci Technol 30(5):1017–1035
Article Google Scholar
Warbox D (2009) Auto-tagging. http://meta.stackoverflow.com/questions/1377/auto-tagging
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1 (4):80–83
Article Google Scholar
Xia X, Lo D, Wang X, Zhou B (2013) Tag recommendation in software information sites. In: MSR ’13, pp 287–296
Zangerle E, Gassler W, Specht G (2011) Using tag recommendations to homogenize folksonomies in microblogging environments. In: SocInfo’11, pp 113–126
Zubiaga A (2012) Enhancing navigation on wikipedia with social tags. CoRR, arXiv:1202.5469

Download references

Author information

Authors and Affiliations

SAIL, Queen’s University, Kingston, Canada
Shaowei Wang
School of Information Systems, Singapore Management University, Singapore, Singapore
David Lo
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Bogdan Vasilescu
Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands
Alexander Serebrenik

Authors

Shaowei Wang
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Vasilescu
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Serebrenik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaowei Wang.

Additional information

Communicated by: Romain Robbes

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Lo, D., Vasilescu, B. et al. EnTagRec ⁺⁺: An enhanced tag recommendation system for software information sites. Empir Software Eng 23, 800–832 (2018). https://doi.org/10.1007/s10664-017-9533-1

Download citation

Published: 21 July 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10664-017-9533-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EnTagRec ⁺⁺: An enhanced tag recommendation system for software information sites

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Applications of AI in classical software engineering

Source-Code Generation Using Deep Learning: A Survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

Applications of AI in classical software engineering

Source-Code Generation Using Deep Learning: A Survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation