Quantifying and characterizing clones of self-admitted technical debt in build systems

Xiao, Tao; Zeng, Zhili; Wang, Dong; Hata, Hideaki; McIntosh, Shane; Matsumoto, Kenichi

doi:10.1007/s10664-024-10449-5

Quantifying and characterizing clones of self-admitted technical debt in build systems

Published: 26 February 2024

Volume 29, article number 54, (2024)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Tao Xiao ORCID: orcid.org/0000-0003-4070-585X¹,
Zhili Zeng²,
Dong Wang³,
Hideaki Hata⁴,
Shane McIntosh² &
…
Kenichi Matsumoto¹

128 Accesses
3 Altmetric
Explore all metrics

Abstract

Self-Admitted Technical Debt (SATD) annotates development decisions that intentionally exchange long-term software artifact quality for short-term goals. Recent work explores the existence of SATD clones (duplicate or near duplicate SATD comments) in source code. Cloning of SATD in build systems (e.g., CMake and Maven) may propagate suboptimal design choices, threatening qualities of the build system that stakeholders rely upon (e.g., maintainability, reliability, repeatability). Hence, we conduct a large-scale study on 50,608 SATD comments extracted from Autotools, CMake, Maven, and Ant build systems to investigate the prevalence of SATD clones and to characterize their incidences. We observe that: (i) prior work suggests that 41–65% of SATD comments in source code are clones, but in our studied build system context, the rates range from 62% to 95%, suggesting that SATD clones are a more prevalent phenomenon in build systems than in source code; (ii) statements surrounding SATD clones are highly similar, with 76% of occurrences having similarity scores greater than 0.8; (iii) a quarter of SATD clones are introduced by the author of the original SATD statements; and (iv) among the most commonly cloned SATD comments, external factors (e.g., platform and tool configuration) are the most frequent locations, limitations in tools and libraries are the most frequent causes, and developers often copy SATD comments that describe issues to be fixed later. Our work presents the first step toward systematically understanding SATD clones in build systems and opens up avenues for future work, such as distinguishing different SATD clone behavior, as well as designing an automated recommendation system for repaying SATD effectively based on resolved clones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Keyword-labeled self-admitted technical debt and static code analysis have significant relationship but limited overlap

Article Open access 16 November 2023

Automatic identification of self-admitted technical debt from four different sources

Article Open access 15 April 2023

On the documentation of self-admitted technical debt in issues

Article 20 September 2022

Data Availability

To support the open science, we publish a full replication package (Xiao et al. 2023b) online, including all the datasets, manually labeled data, supplementary materials (e.g., additional results), and scripts. This replication package is also available at https://github.com/NAIST-SE/SATDClonesInBuildSystem.

Notes

References

Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions of stack overflow. In: Proceedings of the 13th IEEE/ACM working conference on mining software repositories, IEEE, pp 402–412
Alves NS, Ribeiro LF, Caires V, Mendes TS, Spínola RO (2014) Towards an ontology of terms on technical debt. In: Proceedings of the sixth international workshop on managing technical debt, pp 1–7
Bavota G, Russo B (2016) A large-scale empirical study on self-admitted technical debt. In: Proceedings of the 13th international conference on mining software repositories, pp 315–326
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful... really? In: Proceedings of the 24th IEEE international conference on software maintenance, pp 337–345
Cliff N (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 114:494–509
Article Google Scholar
Cunningham W (1992) The WyCASH portfolio management system. SIGPLAN OOPS Mess 4:29–30
Article Google Scholar
Dabic O, Aghajani E, Bavota G (2021) Sampling projects in GitHub for MSR studies. In: 2021 IEEE/ACM 18th international conference on mining software repositories (MSR), pp 560–564
Eisenhardt KM (1989) Building theories from case study research. Acad Manage Rev 14:532–550
Article Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, p 226-231
Fisher RA (1970) Statistical methods for research workers. In: Breakthroughs in statistics: methodology and distribution, Springer, pp 66–70
Gallaba K, McIntosh S (2018) Use and misuse of continuous integration features: an empirical study of projects that (MIS) use TRAVIS CI. IEEE Trans Softw Eng 46:33–50
Article Google Scholar
Guo Z, Liu S, Liu J, Li Y, Chen L, Lu H, Zhou Y, Xu B (2019) MAT: a simple yet strong baseline for identifying self-admitted technical debt. arXiv:1910.13238
Hirao T, McIntosh S, Ihara A, Matsumoto K (2019) The review linkage graph for code review analytics: a recovery approach and empirical study. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, p 578-589
Hong Y, Tantithamthavorn C, Thongtanunam P, Aleti A (2022) CommentFinder: a simpler, faster, more accurate code review comments recommendation. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, pp 507–519
Huang Q, Shihab E, Xia X, Lo D, Li S (2018) Identifying self-admitted technical debt in open source projects using text mining. Empiri Softw Eng 23:418–451
Article Google Scholar
Juergens E (2011) Research in cloning beyond code: a first roadmap. In: Proceedings of the 5th international workshop on software clones, pp 67–68
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub. In: Proceedings of the 11th working conference on mining software repositories, pp 92–101
Kamienski A, Hindle A, Bezemer CP (2023) Analyzing techniques for duplicate question detection on Q &A websites for game developers. Empir Softw Eng 28:1–41
Article Google Scholar
Kashiwa Y, Nishikawa R, Kamei Y, Kondo M, Shihab E, Sato R, Ubayashi N (2022) An empirical study on self-admitted technical debt in modern code review. Inf Softw Technol 146:106855
Article Google Scholar
Koschke R (2007) Survey of research on software clones. In: Dagstuhl seminar proceedings
Kumfert G, Epperly T (2002) Software in the DOE: the hidden overhead of “the build”. Tech. rep., Lawrence Livermore National Lab., CA (US)
Li Z, Yu Y, Zhou M, Wang T, Yin G, Lan L, Wang H (2022) Redundancy, context, and preference: an empirical study of duplicate pull requests in OSS projects. IEEE Trans Softw Eng 48:1309–1335
Article Google Scholar
Li J, Ernst MD (2012) CBCD: cloned buggy code detector. In: Proceedings of the 34th international conference on software engineering, pp 310–320
Liu Z, Huang Q, Xia X, Shihab E, Lo D, Li S (2018) SATD detector: a text-mining-based self-admitted technical debt detection tool. In: Proceedings of the 40th international conference on software engineering: companion proceeedings, pp 9–12
Maipradit R, Treude C, Hata H, Matsumoto K (2020) Wait for it: identifying “on hold’’ self-admitted technical debt. Empir Softw Eng 25:3770–3798
Article Google Scholar
Maipradit R, Lin B, Nagy C, Bavota G, Lanza M, Hata H, Matsumoto K (2020a) Automated identification of on-hold self-admitted technical debt. In: Proceedings of the 20th IEEE international working conference on source code analysis and manipulation, pp 54–64
Maldonado EdS, Shihab E (2015) Detecting and quantifying different types of self-admitted technical debt. In: 2015 IEEE 7Th international workshop on managing technical debt (MTD), IEEE, pp 9–15
Maldonado EdS, Shihab E, Tsantalis N (2017) Using natural language processing to automatically detect self-admitted technical debt. IEEE Trans Softw Eng 43:1044–1062
Article Google Scholar
Mann HB, Whitney DR (1947) Ann Math Stat 18:50–60
Article Google Scholar
Manning C, Klein D (2003) Optimization, maxent models, and conditional estimation without magic. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology: tutorials - vol 5, pp 8
McIntosh S, Adams B, Nguyen TH, Kamei Y, Hassan AE (2011) An empirical study of build maintenance effort. In: Proceedings of the 33rd international conference on software engineering, pp 141–150
McIntosh S, Poehlmann M, Juergens E, Mockus A, Adams B, Hassan AE, Haupt B, Wagner C (2014) Collecting and leveraging a benchmark of build system clones to aid in quality assessments. In: Companion proceedings of the 36th international conference on software engineering, pp 145–154
Miyake Y, Amasaki S, Aman H, Yokogawa T (2017) A replicated study on relationship between code quality and method comments, pp 17–30
Mondal M, Roy B, Roy CK, Schneider KA (2019) An empirical study on bug propagation through code cloning. J Syst Softw 158:110407
Article Google Scholar
Muse BA, Nagy C, Cleve A, Khomh F, Antoniol G (2022) FIXME: synchronize with database! an empirical study of data access self-admitted technical debt. Empir Softw Eng 27:130
Article Google Scholar
Nejati M, Alfadel M, McIntosh S (2023) Code review of build system specifications: prevalence, purposes, patterns, and perceptions. In: Proceedings of the 44th international conference on software engineering, p To appear
Potdar A, Shihab E (2014) An exploratory study on self-admitted technical debt. In: Proceedings of the 30th IEEE international conference on software maintenance and evolution, pp 91–100
Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 3982–3992
Ren X, Xing Z, Xia X, Lo D, Wang X, Grundy J (2019) Neural network-based detection of self-admitted technical debt: From performance to explainability. ACM Trans Softw Eng Methodol 28
Rigby PC, Storey MA (2011) Understanding broadcast based peer review on open source software projects. In: Proceedings of the 33rd international conference on software engineering, pp 541–550
Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the NSSE and other surveys: are the t-test and Cohen’s d indices the most appropriate choices? In: Annual meeting of the southern association for institutional research, pp 1–51
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s Sch Comput Tech Rep 541:64–68
Google Scholar
Scikit-Learn library (2023a) Countvectorizer. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Scikit-Learn library (2023b) Dbscan. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Sierra G, Shihab E, Kamei Y (2019) A survey of self-admitted technical debt. J Syst Softw 152:70–82
Article Google Scholar
Smith P (2011) Software build systems: principles and experience. Addison-Wesley Professional
Tsuru T, Nakagawa T, Matsumoto S, Higo Y, Kusumoto S (2021) Type-2 code clone detection for Dockerfiles. In: Proceedings of the 15th IEEE international workshop on software clones, pp 1–7
van Bladel B, Demeyer S (2020) Clone detection in test code: an empirical evaluation. In: Proceedings of the 27th IEEE international conference on software analysis, evolution and reengineering, pp 492–500
Vidoni M (2021) Self-admitted technical debt in R packages: an exploratory study. In: Proceedings of the 18th IEEE/ACM international conference on mining software repositories, pp 179–189
Viera AJ, Garrett JM et al (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37:360–363
Google Scholar
Wang D, Kula RG, Ishio T, Matsumoto K (2021) Automatic patch linkage detection in code review using textual content and file location features. Inf Softw Technol 139:106637
Article Google Scholar
Wehaibi S, Shihab E, Guerrouj L (2016) Examining the impact of self-admitted technical debt on software quality. In: Proceedings of the 23rd IEEE international conference on software analysis, evolution, and reengineering, pp 179–188
Xavier L, Montandon JE, Ferreira F, Brito R, Valente MT (2022) On the documentation of self-admitted technical debt in issues. Empir Softw Eng 27:1–34
Article Google Scholar
Xavier L, Ferreira F, Brito R, Valente MT (2020) Beyond the code: mining self-admitted technical debt in issue tracker systems. In: Proceedings of the 17th IEEE/ACM international conference on mining software repositories, pp 137–146
Xiao T, Wang D, McIntosh S, Hata H, Kula RG, Ishio T, Matsumoto K (2022) Characterizing and mitigating self-admitted technical debt in build systems. IEEE Trans Softw Eng 48:4214–4228
Article Google Scholar
Xiao T, Baltes S, Hata H, Treude C, Kula RG, Ishio T, Matsumoto K (2023) 18 million links in commit messages: purpose, evolution, and decay. Empir Softw Eng 28:91
Article Google Scholar
Xiao T, Zeng Z, Wang D, Hata H, McIntosh S, Matsumoto K (2023b) Replication package. https://doi.org/10.5281/zenodo.10055463
Yasmin J, Sheikhaei MS, Tian Y (2022) A first look at duplicate and near-duplicate self-admitted technical debt comments. In: 2022 IEEE/ACM 30th international conference on program comprehension (ICPC), pp 614–618
Zampetti F, Fucci G, Serebrenik A, Di Penta M (2021) Self-admitted technical debt practices: a comparison between industry and open-source. Empir Softw Eng 26:1–32
Article Google Scholar
Zanaty FE, Hirao T, McIntosh S, Ihara A, Matsumoto K (2018) An empirical study of design discussions in code review. In: Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement

Download references

Acknowledgements

This work was supported by JSPS Grant-in-Aid for JSPS Fellows JP23KJ1589, JSPS KAKENHI Grant Numbers JP20H05706, JP23K16864, and JST PRESTO Grant Number JPMJPR22P6.

Author information

Authors and Affiliations

Nara Institute of Science and Technology, Ikoma, Japan
Tao Xiao & Kenichi Matsumoto
University of Waterloo, Waterloo, Canada
Zhili Zeng & Shane McIntosh
College of Intelligence and Computing, Tianjin University, Tianjin, China
Dong Wang
Shinshu University, Matsumoto, Japan
Hideaki Hata

Authors

Tao Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Zhili Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Dong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hideaki Hata
View author publications
You can also search for this author in PubMed Google Scholar
Shane McIntosh
View author publications
You can also search for this author in PubMed Google Scholar
Kenichi Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tao Xiao or Dong Wang.

Ethics declarations

Conflicts of interest

The authors declare that Hideaki Hata, is a member of the EMSE Editorial Board. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report.

Additional information

Communicated by: Davide Falessi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xiao, T., Zeng, Z., Wang, D. et al. Quantifying and characterizing clones of self-admitted technical debt in build systems. Empir Software Eng 29, 54 (2024). https://doi.org/10.1007/s10664-024-10449-5

Download citation

Accepted: 23 January 2024
Published: 26 February 2024
DOI: https://doi.org/10.1007/s10664-024-10449-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quantifying and characterizing clones of self-admitted technical debt in build systems

Abstract

Access this article

Similar content being viewed by others

Keyword-labeled self-admitted technical debt and static code analysis have significant relationship but limited overlap

Automatic identification of self-admitted technical debt from four different sources

On the documentation of self-admitted technical debt in issues

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Quantifying and characterizing clones of self-admitted technical debt in build systems

Abstract

Access this article

Similar content being viewed by others

Keyword-labeled self-admitted technical debt and static code analysis have significant relationship but limited overlap

Automatic identification of self-admitted technical debt from four different sources

On the documentation of self-admitted technical debt in issues

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation