Skip to main content
Log in

Quantifying and characterizing clones of self-admitted technical debt in build systems

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Self-Admitted Technical Debt (SATD) annotates development decisions that intentionally exchange long-term software artifact quality for short-term goals. Recent work explores the existence of SATD clones (duplicate or near duplicate SATD comments) in source code. Cloning of SATD in build systems (e.g., CMake and Maven) may propagate suboptimal design choices, threatening qualities of the build system that stakeholders rely upon (e.g., maintainability, reliability, repeatability). Hence, we conduct a large-scale study on 50,608 SATD comments extracted from Autotools, CMake, Maven, and Ant build systems to investigate the prevalence of SATD clones and to characterize their incidences. We observe that: (i) prior work suggests that 41–65% of SATD comments in source code are clones, but in our studied build system context, the rates range from 62% to 95%, suggesting that SATD clones are a more prevalent phenomenon in build systems than in source code; (ii) statements surrounding SATD clones are highly similar, with 76% of occurrences having similarity scores greater than 0.8; (iii) a quarter of SATD clones are introduced by the author of the original SATD statements; and (iv) among the most commonly cloned SATD comments, external factors (e.g., platform and tool configuration) are the most frequent locations, limitations in tools and libraries are the most frequent causes, and developers often copy SATD comments that describe issues to be fixed later. Our work presents the first step toward systematically understanding SATD clones in build systems and opens up avenues for future work, such as distinguishing different SATD clone behavior, as well as designing an automated recommendation system for repaying SATD effectively based on resolved clones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

To support the open science, we publish a full replication package (Xiao et al. 2023b) online, including all the datasets, manually labeled data, supplementary materials (e.g., additional results), and scripts. This replication package is also available at https://github.com/NAIST-SE/SATDClonesInBuildSystem.

Notes

  1. https://www.gnu.org/software/m4/manual/html_node/Dnl.html

  2. https://docs.python.org/3/library/xml.etree.elementtree.html

  3. https://github.com/v6d-io/v6d/blob/9c6bb0c2cb31c54d6e798555d5a92cb067490083/CMakeLists.txt#L263

  4. https://github.com/turican0/remc2/blob/92e8b255b696fdfb4942324bab5e5f092c4ed69f/sdl/configure.in#L8

References

  • Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions of stack overflow. In: Proceedings of the 13th IEEE/ACM working conference on mining software repositories, IEEE, pp 402–412

  • Alves NS, Ribeiro LF, Caires V, Mendes TS, Spínola RO (2014) Towards an ontology of terms on technical debt. In: Proceedings of the sixth international workshop on managing technical debt, pp 1–7

  • Bavota G, Russo B (2016) A large-scale empirical study on self-admitted technical debt. In: Proceedings of the 13th international conference on mining software repositories, pp 315–326

  • Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful... really? In: Proceedings of the 24th IEEE international conference on software maintenance, pp 337–345

  • Cliff N (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 114:494–509

    Article  Google Scholar 

  • Cunningham W (1992) The WyCASH portfolio management system. SIGPLAN OOPS Mess 4:29–30

    Article  Google Scholar 

  • Dabic O, Aghajani E, Bavota G (2021) Sampling projects in GitHub for MSR studies. In: 2021 IEEE/ACM 18th international conference on mining software repositories (MSR), pp 560–564

  • Eisenhardt KM (1989) Building theories from case study research. Acad Manage Rev 14:532–550

    Article  Google Scholar 

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, p 226-231

  • Fisher RA (1970) Statistical methods for research workers. In: Breakthroughs in statistics: methodology and distribution, Springer, pp 66–70

  • Gallaba K, McIntosh S (2018) Use and misuse of continuous integration features: an empirical study of projects that (MIS) use TRAVIS CI. IEEE Trans Softw Eng 46:33–50

    Article  Google Scholar 

  • Guo Z, Liu S, Liu J, Li Y, Chen L, Lu H, Zhou Y, Xu B (2019) MAT: a simple yet strong baseline for identifying self-admitted technical debt. arXiv:1910.13238

  • Hirao T, McIntosh S, Ihara A, Matsumoto K (2019) The review linkage graph for code review analytics: a recovery approach and empirical study. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, p 578-589

  • Hong Y, Tantithamthavorn C, Thongtanunam P, Aleti A (2022) CommentFinder: a simpler, faster, more accurate code review comments recommendation. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, pp 507–519

  • Huang Q, Shihab E, Xia X, Lo D, Li S (2018) Identifying self-admitted technical debt in open source projects using text mining. Empiri Softw Eng 23:418–451

    Article  Google Scholar 

  • Juergens E (2011) Research in cloning beyond code: a first roadmap. In: Proceedings of the 5th international workshop on software clones, pp 67–68

  • Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub. In: Proceedings of the 11th working conference on mining software repositories, pp 92–101

  • Kamienski A, Hindle A, Bezemer CP (2023) Analyzing techniques for duplicate question detection on Q &A websites for game developers. Empir Softw Eng 28:1–41

    Article  Google Scholar 

  • Kashiwa Y, Nishikawa R, Kamei Y, Kondo M, Shihab E, Sato R, Ubayashi N (2022) An empirical study on self-admitted technical debt in modern code review. Inf Softw Technol 146:106855

    Article  Google Scholar 

  • Koschke R (2007) Survey of research on software clones. In: Dagstuhl seminar proceedings

  • Kumfert G, Epperly T (2002) Software in the DOE: the hidden overhead of “the build”. Tech. rep., Lawrence Livermore National Lab., CA (US)

  • Li Z, Yu Y, Zhou M, Wang T, Yin G, Lan L, Wang H (2022) Redundancy, context, and preference: an empirical study of duplicate pull requests in OSS projects. IEEE Trans Softw Eng 48:1309–1335

    Article  Google Scholar 

  • Li J, Ernst MD (2012) CBCD: cloned buggy code detector. In: Proceedings of the 34th international conference on software engineering, pp 310–320

  • Liu Z, Huang Q, Xia X, Shihab E, Lo D, Li S (2018) SATD detector: a text-mining-based self-admitted technical debt detection tool. In: Proceedings of the 40th international conference on software engineering: companion proceeedings, pp 9–12

  • Maipradit R, Treude C, Hata H, Matsumoto K (2020) Wait for it: identifying “on hold’’ self-admitted technical debt. Empir Softw Eng 25:3770–3798

    Article  Google Scholar 

  • Maipradit R, Lin B, Nagy C, Bavota G, Lanza M, Hata H, Matsumoto K (2020a) Automated identification of on-hold self-admitted technical debt. In: Proceedings of the 20th IEEE international working conference on source code analysis and manipulation, pp 54–64

  • Maldonado EdS, Shihab E (2015) Detecting and quantifying different types of self-admitted technical debt. In: 2015 IEEE 7Th international workshop on managing technical debt (MTD), IEEE, pp 9–15

  • Maldonado EdS, Shihab E, Tsantalis N (2017) Using natural language processing to automatically detect self-admitted technical debt. IEEE Trans Softw Eng 43:1044–1062

    Article  Google Scholar 

  • Mann HB, Whitney DR (1947) Ann Math Stat 18:50–60

    Article  Google Scholar 

  • Manning C, Klein D (2003) Optimization, maxent models, and conditional estimation without magic. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology: tutorials - vol 5, pp 8

  • McIntosh S, Adams B, Nguyen TH, Kamei Y, Hassan AE (2011) An empirical study of build maintenance effort. In: Proceedings of the 33rd international conference on software engineering, pp 141–150

  • McIntosh S, Poehlmann M, Juergens E, Mockus A, Adams B, Hassan AE, Haupt B, Wagner C (2014) Collecting and leveraging a benchmark of build system clones to aid in quality assessments. In: Companion proceedings of the 36th international conference on software engineering, pp 145–154

  • Miyake Y, Amasaki S, Aman H, Yokogawa T (2017) A replicated study on relationship between code quality and method comments, pp 17–30

  • Mondal M, Roy B, Roy CK, Schneider KA (2019) An empirical study on bug propagation through code cloning. J Syst Softw 158:110407

    Article  Google Scholar 

  • Muse BA, Nagy C, Cleve A, Khomh F, Antoniol G (2022) FIXME: synchronize with database! an empirical study of data access self-admitted technical debt. Empir Softw Eng 27:130

    Article  Google Scholar 

  • Nejati M, Alfadel M, McIntosh S (2023) Code review of build system specifications: prevalence, purposes, patterns, and perceptions. In: Proceedings of the 44th international conference on software engineering, p To appear

  • Potdar A, Shihab E (2014) An exploratory study on self-admitted technical debt. In: Proceedings of the 30th IEEE international conference on software maintenance and evolution, pp 91–100

  • Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 3982–3992

  • Ren X, Xing Z, Xia X, Lo D, Wang X, Grundy J (2019) Neural network-based detection of self-admitted technical debt: From performance to explainability. ACM Trans Softw Eng Methodol 28

  • Rigby PC, Storey MA (2011) Understanding broadcast based peer review on open source software projects. In: Proceedings of the 33rd international conference on software engineering, pp 541–550

  • Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the NSSE and other surveys: are the t-test and Cohen’s d indices the most appropriate choices? In: Annual meeting of the southern association for institutional research, pp 1–51

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  • Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s Sch Comput Tech Rep 541:64–68

    Google Scholar 

  • Scikit-Learn library (2023a) Countvectorizer. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

  • Scikit-Learn library (2023b) Dbscan. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

  • Sierra G, Shihab E, Kamei Y (2019) A survey of self-admitted technical debt. J Syst Softw 152:70–82

    Article  Google Scholar 

  • Smith P (2011) Software build systems: principles and experience. Addison-Wesley Professional

  • Tsuru T, Nakagawa T, Matsumoto S, Higo Y, Kusumoto S (2021) Type-2 code clone detection for Dockerfiles. In: Proceedings of the 15th IEEE international workshop on software clones, pp 1–7

  • van Bladel B, Demeyer S (2020) Clone detection in test code: an empirical evaluation. In: Proceedings of the 27th IEEE international conference on software analysis, evolution and reengineering, pp 492–500

  • Vidoni M (2021) Self-admitted technical debt in R packages: an exploratory study. In: Proceedings of the 18th IEEE/ACM international conference on mining software repositories, pp 179–189

  • Viera AJ, Garrett JM et al (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37:360–363

    Google Scholar 

  • Wang D, Kula RG, Ishio T, Matsumoto K (2021) Automatic patch linkage detection in code review using textual content and file location features. Inf Softw Technol 139:106637

    Article  Google Scholar 

  • Wehaibi S, Shihab E, Guerrouj L (2016) Examining the impact of self-admitted technical debt on software quality. In: Proceedings of the 23rd IEEE international conference on software analysis, evolution, and reengineering, pp 179–188

  • Xavier L, Montandon JE, Ferreira F, Brito R, Valente MT (2022) On the documentation of self-admitted technical debt in issues. Empir Softw Eng 27:1–34

    Article  Google Scholar 

  • Xavier L, Ferreira F, Brito R, Valente MT (2020) Beyond the code: mining self-admitted technical debt in issue tracker systems. In: Proceedings of the 17th IEEE/ACM international conference on mining software repositories, pp 137–146

  • Xiao T, Wang D, McIntosh S, Hata H, Kula RG, Ishio T, Matsumoto K (2022) Characterizing and mitigating self-admitted technical debt in build systems. IEEE Trans Softw Eng 48:4214–4228

    Article  Google Scholar 

  • Xiao T, Baltes S, Hata H, Treude C, Kula RG, Ishio T, Matsumoto K (2023) 18 million links in commit messages: purpose, evolution, and decay. Empir Softw Eng 28:91

    Article  Google Scholar 

  • Xiao T, Zeng Z, Wang D, Hata H, McIntosh S, Matsumoto K (2023b) Replication package. https://doi.org/10.5281/zenodo.10055463

  • Yasmin J, Sheikhaei MS, Tian Y (2022) A first look at duplicate and near-duplicate self-admitted technical debt comments. In: 2022 IEEE/ACM 30th international conference on program comprehension (ICPC), pp 614–618

  • Zampetti F, Fucci G, Serebrenik A, Di Penta M (2021) Self-admitted technical debt practices: a comparison between industry and open-source. Empir Softw Eng 26:1–32

    Article  Google Scholar 

  • Zanaty FE, Hirao T, McIntosh S, Ihara A, Matsumoto K (2018) An empirical study of design discussions in code review. In: Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement

Download references

Acknowledgements

This work was supported by JSPS Grant-in-Aid for JSPS Fellows JP23KJ1589, JSPS KAKENHI Grant Numbers JP20H05706, JP23K16864, and JST PRESTO Grant Number JPMJPR22P6.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tao Xiao or Dong Wang.

Ethics declarations

Conflicts of interest

The authors declare that Hideaki Hata, is a member of the EMSE Editorial Board. All co-authors have seen and agree with the contents of the manuscript and there is no financial interest to report.

Additional information

Communicated by: Davide Falessi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, T., Zeng, Z., Wang, D. et al. Quantifying and characterizing clones of self-admitted technical debt in build systems. Empir Software Eng 29, 54 (2024). https://doi.org/10.1007/s10664-024-10449-5

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-024-10449-5

Keywords

Navigation