Identifying self-admitted technical debt in open source projects using text mining

Huang, Qiao; Shihab, Emad; Xia, Xin; Lo, David; Li, Shanping

doi:10.1007/s10664-017-9522-4

Identifying self-admitted technical debt in open source projects using text mining

Published: 09 May 2017

Volume 23, pages 418–451, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Qiao Huang¹,
Emad Shihab²,
Xin Xia^1,3,
David Lo⁴ &
…
Shanping Li¹

3184 Accesses
119 Citations
2 Altmetric
Explore all metrics

Abstract

Technical debt is a metaphor to describe the situation in which long-term code quality is traded for short-term goals in software projects. Recently, the concept of self-admitted technical debt (SATD) was proposed, which considers debt that is intentionally introduced, e.g., in the form of quick or temporary fixes. Prior work on SATD has shown that source code comments can be used to successfully detect SATD, however, most current state-of-the-art classification approaches of SATD rely on manual inspection of the source code comments. In this paper, we proposed an automated approach to detect SATD in source code comments using text mining. In our approach, we utilize feature selection to select useful features for classifier training, and we combine multiple classifiers from different source projects to build a composite classifier that identifies SATD comments in a target project. We investigate the performance of our approach on 8 open source projects that contain 212,413 comments. Our experimental results show that, on every target project, our approach outperforms the state-of-the-art and the baselines approaches in terms of F1-score. The F1-score achieved by our approach ranges between 0.518 - 0.841, with an average of 0.737, which improves over the state-of-the-art approach proposed by Potdar and Shihab by 499.19%. When compared with the text mining-based baseline approaches, our approach significantly improves the average F1-score by at least 58.49%. When compared with a natural language processing-based baseline, our approach also significantly improves its F1-score by 27.95%. Our proposed approach can be used by project personnel to effectively identify SATD with minimal manual effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying self-admitted technical debt in issue tracking systems using machine learning

Article Open access 10 July 2022

Identifying Self-admitted Technical Debt with Context-Based Ladder Network

Wait for it: identifying “On-Hold” self-admitted technical debt

Article Open access 04 August 2020

Notes

derived after the manual examination of more than 100k comments
http://tartarus.org/martin/PorterStemmer/
The authors also classify the type of technical debt, but we do not leverage the type in our paper
https://www.openhub.net
https://github.com/tkdsheep/TechnicalDebt

References

Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: The 18th IEEE international symposium on software reliability (ISSRE’07), IEEE, pp 215–224
Chapter Google Scholar
Bavota G, Russo B (2016) A large-scale empirical study on self-admitted technical debt. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16, pp 315–326
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MathSciNet MATH Google Scholar
Brown N, Cai Y, Guo Y, Kazman R, Kim M, Kruchten P, Lim E, MacCormack A, Nord R, Ozkaya I et al (2010) Managing technical debt in software-reliant systems. In: Proceedings of the FSE/SDP workshop on Future of software engineering research ACM, pp 47–52
Chapter Google Scholar
Cohen J (1968) Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull 70(4):213
Article Google Scholar
Cunningham W (1993) The wycash portfolio management system. ACM SIGPLAN OOPS Messenger 4(2):29–30
Article Google Scholar
Fluri B, Wursch M, Gall HC (2007) Do code and comments co-evolve? On the relation between source code and comment changes. In: 14th working conference on reverse engineering (WCRE 2007). IEEE, pp 70–79
Chapter Google Scholar
Guo Y, Seaman C, Gomes R, Cavalcanti A, Tonin G, Da Silva FQ, Santos AL, Siebra C (2011) Tracking technical debtan exploratory case study. In: 2011 27th IEEE international conference on software maintenance (ICSM). IEEE, pp 528–531
Chapter Google Scholar
Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering-volume 2. ACM, pp 223–226
Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18
Article Google Scholar
Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato
Han J, Kamber M, Pei J (2006) Data mining: concepts and techniques. Morgan Kaufmann
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Jiang T, Tan L, Kim S (2013) Personalized defect prediction. In: 2013 IEEE/ACM 28th international conference on automated software engineering (ASE). IEEE, pp 279–289
Chapter Google Scholar
Khamis N, Witte R, Rilling J (2010) Automatic quality assessment of source code comments: the javadocminer. In: International conference on application of natural language to information systems. Springer, pp 68–79
Google Scholar
Kruchten P, Nord RL, Ozkaya I, Falessi D (2013) Technical debt: towards a crisper definition report on the 4th international workshop on managing technical debt. ACM SIGSOFT Softw Eng Notes 38(5):51–54
Article Google Scholar
Lim E, Taksande N, Seaman C (2012) A balancing act: what software practitioners have to say about technical debt. Softw IEEE 29(6):22–27
Article Google Scholar
Maldonado E, Shihab E (2015) Detecting and quantifying different types of self-admitted technical debt. In: Proceedings of the 7th IEEE international workshop on managing technical debt (MTD’15), pp 9– 15
Google Scholar
Maldonado E, Shihab E, Tsantalis N (2017) Using natural language processing to automatically detect self-admitted technical debt. IEEE Transactions on Software Engineering
Malik H, Chowdhury I, Tsou HM, Jiang ZM, Hassan AE (2008) Understanding the rationale for updating a functions comment. In: IEEE international conference on software maintenance, 2008. ICSM 2008. IEEE, pp 167–176
Google Scholar
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of the 25th international conference on software engineering, 2003. IEEE, pp 125–135
Chapter Google Scholar
Marinescu R (2004) Detection strategies: Metrics-based rules for detecting design flaws. In: Proceedings of the 20th IEEE international conference on software maintenance, 2004. IEEE, pp 350–359
Chapter Google Scholar
Marinescu R, Ganea G, Verebi I (2010) incode: Continuous quality assessment and improvement. In: 2010 14th European conference on software maintenance and reengineering (CSMR). IEEE, pp 274–275
Chapter Google Scholar
McCallum A, Nigam K et al (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on learning for text categorization. Citeseer, vol 752, pp 41–48
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Google Scholar
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering (ASE), 2012. IEEE, pp 70–79
Google Scholar
Padioleau Y, Tan L, Zhou Y (2009) Listening to programmers taxonomies and characteristics of comments in operating system code. In: Proceedings of the 31st international conference on software engineering, IEEE computer society, pp 331–341
Google Scholar
Potdar A, Shihab E (2014) An exploratory study on self-admitted technical debt. In: 2014 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 91–100
Chapter Google Scholar
Rahman F, Posnett D, Devanbu P (2012) Recalling the imprecision of cross-project defect prediction. In: Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering. ACM, p 61
Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Seaman C, Guo Y, Izurieta C, Cai Y, Zazworka N, Shull F, Vetrò A (2012) Using technical debt data in decision making: Potential decision approaches. In: Proceedings of the 3rd international workshop on managing technical debt. IEEE Press, pp 45–48
Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article Google Scholar
Shihab E, Ihara A, Kamei Y, Ibrahim WM, Ohira M, Adams B, Hassan AE, Ki M (2013) Studying re-opened bugs in open source software. Empir Softw Eng 18(5):1005–1042
Article Google Scholar
Storey MA, Ryall J, Bull RI, Myers D, Singer J (2008) Todo or to bug. In: 2008 ACM/IEEE 30th international conference on software engineering. ICSE08. IEEE, pp 251–260
Google Scholar
Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM, vol 1, pp 45-54
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: Proceedings of the 2011 26th IEEE/ACM international conference on automated software engineering. IEEE Computer Society, pp 253–262
Google Scholar
Tan L, Yuan D, Krishna G, Zhou Y (2007) /* icomment: Bugs or bad comments?*. In: ACM SIGOPS operating systems review. ACM, vol 41, pp 145–158
Tan SH, Marinov D, Tan L, Leavens GT (2012) @ Tcomment: Testing javadoc comments to detect comment-code inconsistencies. In: 2012 IEEE Fifth international conference on software testing, Verification and Validation, IEEE, pp 260–269
Chapter Google Scholar
Tian Y, Lawall J, Lo D (2012) Identifying linux bug fixing patches. In: 2012 34th international conference on software engineering, (ICSE). IEEE, pp 386–396
Chapter Google Scholar
Valdivia Garcia H, Shihab E (2014) Characterizing and predicting blocking bugs in open source projects. In: Proceedings of the 11th working conference on mining software repositories. ACM, pp 72–81
Google Scholar
Vassallo C, Zampetti F, Romano D, Beller M, Panichella A, Penta MD, Zaidman A (2016) Continuous delivery practices in a large financial organization. In: Proceedings of the international conference on software maintenance and evolution (ICSME), ICSME ’16, p To Appear
Google Scholar
Wehaibi S, Shihab E, Guerrouj L (2016) Examining the impact of self-admitted technical debt on software quality. In: Proceedings of the 23rd IEEE international conference on software analysis, evolution, and reengineering (SANER’16)
Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Article Google Scholar
Xia X, Lo D, Wang X, Zhou B (2013) Tag recommendation in software information sites. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, pp 287–296
Google Scholar
Xia X, Lo D, Qiu W, Wang X, Zhou B (2014) Automated configuration bug report prediction using text mining. In: 2014 IEEE 38th annual computer software and applications conference (COMPSAC). IEEE, pp 107–116
Google Scholar
Xia X, Lo D, Shihab E, Wang X, Yang X (2015a) Elblocker: Predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106
Article Google Scholar
Xia X, Lo D, Shihab E, Wang X, Zhou B (2015b) Automatic, high accuracy prediction of reopened bugs. Autom Softw Eng 22(1):75–109
Article Google Scholar
Xia X, Lo D, Wang X, Yang X (2015c) Who should review this change?: Putting text and file location analyses together for more accurate recommendations. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 261–270
Chapter Google Scholar
Xia X, Lo D, Wang X, Zhou B (2015d) Dual analysis for recommending developers to resolve bugs. J Softw Evol Process 27(3):195–220
Article Google Scholar
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016a) Hydra: Massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng 42 (10):977–998
Article Google Scholar
Xia X, Lo D, Wang X, Yang X (2016b) Collective personalized change classification with multiobjective search. IEEE Trans Reliab 65(4):1810–1829
Article Google Scholar
Xia X, Shihab E, Kamei Y, Lo D, Wang X (2016c) Predicting crashing releases of mobile applications. In: Proceedings of the 10th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, p 29
Google Scholar
Xia X, Lo D, Ding Y, Al-Kofahi JM, Nguyen TN, Wang X (2017) Improving automated bug triaging with specialized topic model. IEEE Trans Softw Eng 43(3):272–297
Article Google Scholar
Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering. ACM, pp 51–62
Google Scholar
Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, pp 127–137
Chapter Google Scholar
Yang XL, Lo D, Xia X, Huang Q, Sun JL (2017) High-impact bug report identification with imbalanced learning strategies. J Comput Sci Technol 32:1
Article Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML, vol 97, pp 412–420
Zazworka N, Shaw MA, Shull F, Seaman C (2011) Investigating the impact of design debt on software quality. In: Proceedings of the 2nd workshop on managing technical debt. ACM, pp 17– 23
Google Scholar
Zazworka N, Spínola RO, Vetro A, Shull F, Seaman C (2013) A case study on effectively identifying technical debt. In: Proceedings of the 17th international conference on evaluation and assessment in software engineering. ACM, pp 42–47
Google Scholar
Zhang Y, Lo D, Xia X, Le TDB, Scanniello G, Sun J (2016) Inferring links between concerns and methods with multi-abstraction vector space model. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 110–121
Chapter Google Scholar
Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports. In: 2012 34th international conference on software engineering (ICSE). IEEE, pp 14–24
Chapter Google Scholar

Download references

Acknowledgements

The authors thank to all the developers who participated in this study. This research was supported by NSFC Program (No. 61602403 and 61572426), and National Key Technology R&D Program of the Ministry of Science and Technology of China (No. 2015BAH17F01).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Qiao Huang, Xin Xia & Shanping Li
Data-driven Analysis of Software (DAS) Lab at the Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
Emad Shihab
Department of Computer Science, University of British Columbia, Vancouver, Canada
Xin Xia
School of Information Systems, Singapore Management University, Singapore, Singapore
David Lo

Authors

Qiao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Emad Shihab
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xia
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar
Shanping Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Xia.

Additional information

Communicated by: Andrian Marcus

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Q., Shihab, E., Xia, X. et al. Identifying self-admitted technical debt in open source projects using text mining. Empir Software Eng 23, 418–451 (2018). https://doi.org/10.1007/s10664-017-9522-4

Download citation

Published: 09 May 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10664-017-9522-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying self-admitted technical debt in open source projects using text mining

Abstract

Access this article

Similar content being viewed by others

Identifying self-admitted technical debt in issue tracking systems using machine learning

Identifying Self-admitted Technical Debt with Context-Based Ladder Network

Wait for it: identifying “On-Hold” self-admitted technical debt

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Identifying self-admitted technical debt in open source projects using text mining

Abstract

Access this article

Similar content being viewed by others

Identifying self-admitted technical debt in issue tracking systems using machine learning

Identifying Self-admitted Technical Debt with Context-Based Ladder Network

Wait for it: identifying “On-Hold” self-admitted technical debt

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation