1 Introduction

Software maintenance and evolution tasks require a lot of effort due to activities such as continuous modification of the code to create new features, correct faults, or improve some quality attributes (Lehman, 1996). However, developers need to prioritize the activities, and sometimes implementing a new feature might have a higher priority than refactoring activities (Martini et al., 2015). Consequently, refactoring, which is important too, is down prioritized and the related activities are postponed, which can potentially cause the introduction of the so-called technical debt (Cunningham, 1992).

Technical debt (TD) represents “a sub-optimal design or implementation solutions that yield a benefit in the short term but make changes more costly or even impossible in the medium to long term” (Avgeriou et al., 2016b). In other words, TD symbolizes the tacit compromise between delivering fast and producing high-quality code. However, the harder debt is accumulated in the code, the more negative the consequences are. The code becomes unmanageable anymore and the refactoring activities became difficult and complex to be done (Li et al., 2015; Lenarduzzi et al., 2021).

In the vast majority of cases, developers unconsciously accumulate TD without realizing that the task that they are doing will increase the TD global amount in the code. In other cases, developers spontaneously “admit” TD just by adding a comment (such as “FIXME” or “TODO”) as “something is not right and should be fixed as soon as possible” on the corresponding portion of the code responsible for the TD introduction. This kind of debt is called “self-admitted” TD (SATD) (Maldonado & Shihab, 2015).

A specific sub-type of SATD comments is keyword-labeled SATD comments (KL-SATD). These are defined as SATD comments, which have a SATD-related keyword in them such as “TODO” or “FIXME” (Rantala et al., 2020). Speaking out in favor of SATD, since it is intentionally introduced and declared and well documented, it should be more simple to identify and remove when compared with the non-SATD ones.

In this study, we analyzed how SATD relates to code-related TD, utilizing TD-related metrics calculated by SonarQube. We focused on the introduction and removal of KL-SATD comments in the source code. We designed and conducted an empirical study among 33 open-source projects from the technical debt dataset by Lenarduzzi et al. (2019b). We detected and extracted keyword-labeled SATD comments and we determined when each comment was introduced and removed. Then, we first considered corresponding SonarQube reports (TD index or “sqale index,” reliability remediation effort, and security remediation effort) between the introduction and removal of KL-SATD. Secondly, we considered the relationship between code-level issues created by SonarQube and the introduction and removal of KL-SATD. Thirdly, we investigated whether KL-SATD comments are in context of the reported SonarQube issues, and if these comments address that specific issue.

In principle, one could think that the addition of SATD, e.g., comments with “TODO” or “FIXME,” is related to an increase in technical debt measured from the source code while the removal of SATD comments would be related to the decrease in technical debt measured from the source code.

Investigating whether a correlation exists between KL-SATD comments and code-level technical debt is important, as this can help in creating new tools related to TD and SATD. One example could be automated SATD generation based on code metrics or the creation of a SATD prediction tool based on code-level TD metrics.

In addition, we need to understand whether static analysis tools can capture the essence of a SATD comment or not. If not, then they should be used with care when trying to predict SATD with them. This information can also help developers to understand if SATD and static code metrics point out the same problems in the code or not. Even if they point to different things, they can still be correlated, e.g., if you see a house with a broken window, you can anticipate that the house might have other problems due to lack of maintenance like a leaking roof. Furthermore, the generated tools or guidelines based on this work can benefit the developers. For example, discovering TD types that are correlated with SATD removal can be used to guide developers to pay extra attention not to introduce these code smells while refactoring their code.

The contribution of this paper is threefold:

  • Investigating the relationship between SonarQube technical debt (sqale index) and remediation efforts and KL-SATD introduction and removal

  • Investigating relationships between SonarQube issues to KL-SATD introduction and removal

  • Qualitative analysis of KL-SATD comments and their relation to SonarQube issues

The remainder of this paper is structured as follows. In Section 2, we introduce the background in this work. Section 3 describes the case study design, while Section 4 presents the obtained results. Section 5 discusses the results and the benefits of our open-source work to practitioners and researchers, and Section 6 identifies threats to validity. Section 7 describes the related works, while Section 8 draws the conclusion highlighting the future works.

2 Background

Here, we present the background to our study. We start by introducing the concepts of TD and SATD and then proceed with the description of SonarQube, which is the leading automated static analysis tool for TD detection.

2.1 Technical debt

Software companies need to manage sub-optimal solutions. The presence of TD is inevitable (Martini et al., 2015) and even desirable under some circumstances (Besker et al., 2018) for several reasons, which may often be related to unpredictable business or environmental forces internal or external to the organization.

The concept of TD was introduced by Cunningham (1992) as “The debt incurred through the speeding up of software project development which results in a number of deficiencies ending up in high maintenance overheads.” Later, Avgeriou et al. (2016a) defined it as “A collection of design or implementation constructs that are expedient in the short term, but set up a technical context that can make future changes more costly or impossible. TD presents an actual or contingent liability whose impact is limited to internal system qualities, primarily maintainability and evolvability.”

Technical debt (TD) can be considered as a metaphor to represent sub-optimal design or implementation solutions that yield a benefit in the short term but make changes more costly or even impossible in the medium to long term (Avgeriou et al., 2016b).

In their systematic mapping study consisting of 94 selected studies, Li et al. (2015) classified technical debt into ten different categories. In their classification, they included: requirement TD, architectural TD, design TD, code TD, test TD, build TD, documentation TD, infrastructure TD, versioning TD, and defect TD.

2.2 Self-admitted technical debt

Potdar and Shihab (2014) introduced the concept of self-admitted technical debt (SATD). This concept refers to a sub-type of TD, where the developer leaves a note of TD appearance for example, by typing in a code comment. Rantala et al. (2020) further refined the relationship by introducing the concept of keyword-labeled SATD (KL-SATD). This concept refers to code comments using a specific keyword such as “TODO,” “FIXME,” and “HACK.”

Previous work by Wehaibi et al. (2016) investigated the relationship between defects and SATD. Their results show that when SATD is introduced into a project, the files containing SATD have a higher number of defect-fixing activities than before. However, overall SATD changes did have a lower defect-inducing rate than changes without SATD. They did not discover any statistical differences between defect severities and SATD. However, their defect classification severities are based on issue tracking reports and not on automatic tool classification like in our case. Finally, they show that SATD introduction can lead to a rise in the complexity of the software.

Bavota and Russo (2016) conducted a large empirical study of SATD, where they also labeled a statistically significant sample of all SATD comments into different TD categories. They discovered that in the majority of the cases, the SATD comment implied code debt, which was followed by equal amounts of defect debt and requirement debt, and then design debt, documentation debt, and finally test debt. It is worth noting, that they base their classifications on analyzing the SATD comments, while our work classifies SATD based on the description of a TD issue obtained from SonarQube. They did not discover any correlations between code file quality and SATD instances. Finally, they discovered that once SATD is introduced, it can remain in the system for a long time and that projects have an increasing trend in SATD introductions over their lifetime.

Work by Iammarino et al. (2021) analyzed the relationship between refactoring actions and SATD removals. They discovered that refactoring actions co-occur more with SATD removals than other changes. At the same time, only a small portion of refactoring actions remove SATD. Most of them occur by chance or as a result of the SATD removal.

Finally, Tan et al. (2020) investigated how developers self-fix SATD. They investigated 20 Python projects with SonarQube and discovered that SonarQube issues labeled as defects had the highest self-fixing rate while testing-related issues had the lowest fixing rate. Our work does not look at self-fixing rates, but rather we look at what kind of a relationship different kinds of issues have in SATD introduction and removal in Java projects.

2.3 SonarQube

Code and design TD items are detectable by different automated static analysis tools (ASAT\(_s\)) (Avgeriou et al., 2021), SonarQube is one of the ASAT\(_s\) more adopted by developers (Vassallo et al., 2019; Avgeriou et al., 2021), and it is provided as a service by the sonarcloud.io platform, or it can be downloaded and executed on a private server.

SonarQube provides a TD index and two remediation estimates.Footnote 1 The TD index (also called sqale index) is related to “the effort (minutes) to fix all code smells.” We use the term sqale index throughout the paper. The remediation estimates are related to “the effort to fix all bug issues” (reliability remediation effort), and to “the effort to fix all vulnerability issues” (security remediation effort). In addition to these metrics, SonarQube verifies the code’s compliance against a specific set of “coding rules” defined for most common development languages. If the analyzed source code violates a coding rule, or if a metric is outside a predefined threshold (also named “quality gate”), SonarQube generates a “TD issue.”Footnote 2 The coding rules included in SonarQube are related to maintainability, reliability, and security rules. Therefore, they shadow sqale index and the two remediation efforts.

3 The empirical study

We designed and conducted the empirical study by following the guidelines proposed by Runeson and Höst (2009). In this section, we present the goal, research questions, and metrics for the case study. Based on them, we outline the study context, the data collection, and the data analysis.

3.1 Goal and research questions

We formulated our goal according to the goal question metrics (GQM) approach (Basili et al., 1994). The goal here is defined to have a purpose, process, and specific viewpoint. The questions are derived from the goal, and these questions are then answered using a set of metrics.

We aim at analyzing the introduction and removal of keyword-labeled self-admitted technical debt, for the purpose of evaluating with the respect to the relationship on technical debt and remediation estimates and “TD issues” created by SonarQube, from the point of view of developers, in the context of projects written with Java.

We further divide the general research problem into the following five research questions:

RQ\(_1\):

What is the relationship between introducing keyword-labeled self-admitted technical debt comment on technical debt and remediation efforts calculated by SonarQube?

RQ\(_2\):

What is the relationship between removing keyword-labeled self-admitted technical debt comment on technical debt and remediation efforts calculated by SonarQube?

RQ\(_3\):

What is the relationship between introducing keyword-labeled self-admitted technical debt comment on SonarQube issues?

RQ\(_4\):

What is the relationship between removing keyword-labeled self-admitted technical debt comment on SonarQube issues?

RQ\(_5\):

What is the overlap between keyword-labeled self-admitted technical debt comments and SonarQube issues?

Investigating what kind of SonarQube reports are connected to KL-SATD can help us to understand the characteristics underlining KL-SATD introduction and removal. It is a natural idea to think that increases and decreases of KL-SATD would be reflected in TD increases and decreases captured via program analysis and consequently in the SonarQube reports.

3.2 Context

For this study, we used the projects included in the technical debt dataset gathered by Lenarduzzi et al. (2019b). The complete description of the data can be found in the original paper, but to make this paper self-sufficient, we briefly describe it here. The data set contains 33 Java projects from the Apache Software Foundation (ASF) repository. Projects were selected based on “criterion sampling” (Patton, 2002), which fulfills all of the following criteria: developed in Java, older than 3 years, more than 500 commits and 100 classes, and usage of an issue tracking system with at least 100 issues reported.

All of the commits in every repository were analyzed with SonarQube. Each of these analyses produces a report, which includes on commit level sqale index, the remediation efforts and lines of code, and on file level 23 anti-patterns and code smells from Ptidej, and finally, 1817 different SonarQube rules (issues) violated in the code.

As noted in the original paper for the dataset (Lenarduzzi et al., 2019b), the metrics were not available for every commit in every repository, due to compilation problems. The whole dataset consists of 77,929 commits, and from these 7526 commits had zeroes as values for lines of code, sqale index, and the remediation efforts. We exclude these commits as errors.

It is also important to note that we considered only data related to the master branches of the projects. Therefore, data points related to other branches were not considered.

3.3 Data collection

Here, we describe our data collection methods. We first explain our approach for extracting KL-SATD comments, and when we consider them to be removed. After this, we look at the data collected from the technical debt dataset.

3.3.1 KL-SATD comment extraction

To extract KL-SATD, we identify the commits where KL-SATD keywords appear by analyzing the source code and extracting single-line and block comments from it. We consider multiple adjacent single-line comments to be one single comment (instead of multiple single-line comments).

In this paper, we focus on KL-SATD comments containing either of the keywords “TODO” or “FIXME.” Prior work from Ren et al. (2019) identified them as highly related to SATD, stating that comments with these keywords had over 0.97 probability of having SATD. These two keywords are also explicitly mentioned in SonarQube’s rules.Footnote 3

For tracking the files, we follow loosely the methods from Maldonado et al. (2017). We consider the version of the file where the KL-SATD comment was added as the introduction moment for KL-SATD. Similarly, we look at the commit of either removing the KL-SATD comment or the deletion of the whole file as the removal moment.

We tracked all the individual files through the whole project’s life cycle, taking also into account possible renaming actions of the individual files.

3.3.2 Detecting sqale index, remediation efforts, and issues

From the technical debt dataset, we collected several different metrics. For commit-level analyses, we considered the sqale index, and the two other remediation efforts (reliability remediation effort and security remediation effort). The SonarQube websiteFootnote 4 defines these metrics as follows:

  • Sqale index is the effort in minutes to fix all code smells.

  • Reliability remediation effort is the effort in minutes to fix all bug issues.

  • Security remediation effort is the effort in minutes to fix all vulnerability issues.

We also include lines of code (LOC) to the commit level analyses as a baseline metric. LOC has been shown to have an association with several software quality issues like bugs (Valdivia-Garcia et al., 2018) and faults (Ostrand et al., 2005). In addition, LOC has been used in the mixed-model analysis for investigating co-occurrences of refactoring actions and SATD removal (Iammarino et al., 2021).

3.4 Data analysis

We investigated our RQs by comparing the changes in different metrics. For commit-level metrics (sqale index, reliability, and remediation efforts) we compare against the previous commit from the main branch of the project. For file-level metrics (issues), we compare the individual files to their previous versions. Thus, we create pairs where one part represents the situation before KL-SATD addition or deletion and the other one presents the situation after one of these actions has happened. However, if one of these pairs had missing data, we opted to delete the whole pair from the dataset. We opted for deletion, as utilizing other methods like replacing missing values with mean or median values is not applicable due to the nature of the paired data and the difference in the numbers of different pairs. Some pairs were from an earlier point in time of the project, and some were from later dates, so their respective numbers can be quite different.

3.4.1 Commit-level commit pair information

Table 1 lists a summary of comment pairs on the commit level both when KL-SATD was added, and when it was deleted. The KL-SATD in commits columns lists on the commit level, how many KL-SATD comments were added or deleted. From this frequency information, we list the minimum, median, mean, and maximum amounts.

Table 1 Summary of commit information

For commit pairs, which deal with KL-SATD addition, we have 6904 commit pairs. Most of the commit pairs had relatively few KL-SATD introductions (\(Mdn = 1\), \(M = 3.667\)). There were also large commit pairs present, where the number of new KL-SATD additions was very large (\(max. = 481\)).

The commits dealing with KL-SATD deletion showed similar results as the ones with KL-SATD additions. Here, we have 5301 commit pairs. We see again, that most of the commits had very few KL-SATD commits removed (\(Mdn = 1\), \(M = 3.226\)). The largest commit pairs had again lots of KL-SATD comments removed (\(max. = 389\)).

In both cases, the data seems highly skewed. We performed the Shapiro-Wilk test for normality against the null hypothesis that the data is normally distributed. In both cases the results were statistically significant (\(p < 0.001\)). Therefore, we can reject the null hypotheses, and conclude that in both cases the data is not normally distributed.

3.4.2 File-level commit pair information

Table 2 lists a summary of comment pairs on file level both when KL-SATD was added, and when it was deleted. The KL-SATD in commits columns lists on commit- and file-level, how many KL-SATD comments were added or deleted. From this frequency information, we list the minimum, median, mean, and maximum amounts.

Table 2 Summary of file-pair information

The pair amount is significantly lower when compared to pairs listed in Section 3.4.1. This is due to the fact, that for file-level information, the possibility of having pairs deleted due to missing data is increased.

For file pairs, which deal with KL-SATD addition, we have a total of 825 commit pairs. Most of the pairs have relatively few KL-SATD introductions (\(Mdn = 1\), \(M = 2.035\)). The largest amount of additions in a file pair was 66 KL-SATD comment additions.

The file pairs dealing with KL-SATD deletion showed similar results as the ones with KL-SATD additions. Here, we have 1096 commit pairs. We see again, that most of the pairs have very few KL-SATD commits removed (\(Mdn = 1\), \(M = 4.097\)). The largest pair had again lots of KL-SATD comments removed (\(max. = 306\)).

In both cases, the data seems again highly skewed. We perform the Shapiro-Wilk test for normality against the null hypothesis that the data is normally distributed. In both cases, the results are again statistically significant (\(p < 0.001\)), and we can reject the null hypotheses. Therefore, we conclude that in both cases the data is not normally distributed.

3.4.3 Data normalization

We perform analysis using sqale index, reliability remediation effort, and security remediation effort on commit-level analysis, and TD issues on file-level analysis. Since the data set consists of several projects with varying sizes and also varying files within them, we performed data normalization before running the analysis. This normalization is performed differently depending on whether the analysis is done on the commit or file level.

Commit level analysis is done using sqale index, reliability remediation effort, and security remediation effort as the metrics. The data set consists of several projects of varying sizes, which means there are large variations not only within the projects but also between them. We normalize all of the metrics within each project, utilizing min-max normalization so that within all projects the values of the metrics are scaled between 0 and 1. It is a typical standardization, which has been used in conjunction with mixed models earlier (see, e.g. Iammarino et al. (2021)).

The file-level analysis is done using SonarQube issues as the metrics. The data set consists of several projects with varying sizes, as well as several files with varying sizes. This means there can be large variations not only within the projects but also between the files in them. We normalize all of the metrics again with min-max normalization within files in projects. This means that the metrics will be normalized between 0 and 1 for each file within each project.

3.4.4 Dealing with multi-collinearity

Before running the analysis on the commit level with sqale index, and the 2 remediation efforts, we wanted to avoid multi-collinearity issues with different metrics. This was done by utilizing the redun-function from Hmisc-package in R. We used the default threshold for cutoff (\(R^{2} \ge 0.9\)).

Looking at the impact of KL-SATD comment introduction on commit-level (RQ\(_1\)), the cutoff results for the sqale index and lines of code are 0.001 apart from one another. Therefore, we opted to run the analysis with both of them as well as cut off one of them at a time. The other metrics were further away from the cutoff point and therefore were always included in the analyses.

For the impact of KL-SATD comment removal on commit-level (RQ\(_2\)), the analysis did not yield any redundant metrics, but lines of code and sqale index were again very close to each other and the cutoff point (lines of code=0.884, sqale index = 0.883). Therefore, we again opted to run the analysis with both of them as well as cutting off one of them at a time. The other metrics were further away from the cutoff point and therefore were always included in the analyses.

3.4.5 Generalized linear mixed model analysis

Our data comes from a large dataset composed of several projects. These projects are not equal in size, so the data is not equally distributed. Within each project, we have pairs of data that measure the change in different predictors at either the commit- or file level. This means, that for each pair, we have a repeated measure with measuring points just before the KL-SATD comment was either introduced or removed. Both the projects and the pairs are a source of randomness in the data, and we can think of the projects and the pairs within them as nested random effects. For these circumstances, we elected to build a generalized linear mixed model in R.

The dependent variable is always the introduction or removal of KL-SATD (0 or 1), and the independent variables are either sqale index and the two remediation efforts (commit-level), or TD issues (file-level). The independent variables can have either positive or negative relationships with the dependent variable. A relationship means that there exists a connection between the independent variable and the dependent variable. The relation is positive when the independent variable is above zero, and negative when zero. A positive relationship signifies that an increase in the independent variable has a relationship with the KL-SATD introduction or removal. Vice versa, a negative relationship signifies that a decrease in the independent variable has a connection with the KL-SATD introduction or removal.

To find the best model, we evaluate them against each other using Akaike information criterion (AIC) (Akaike, 1998). It is used to find a model with the lowest information loss and is therefore suitable for evaluating different models with each other. We also report other estimation metrics, as they may be of interest to researchers more accustomed to using them instead of AIC.

We calculate the evidence ratio (ER) (Kenneth & David, 2002) for every model. The evidence ratio is calculated as follows:

$$\begin{aligned} ER = \frac{exp(-\frac{1}{2}\Delta _{min})}{exp(-\frac{1}{2}\Delta _i)} \end{aligned}$$
(1)

where \(\Delta _{min}\) is the difference between the lowest AIC score compared to itself, meaning it’s always 0. \(\Delta _i\) is the difference between the AIC\(_i\) score compared to the lowest AIC score. The ER, therefore, presents how many times better the lowest AIC model is when compared to other models for minimizing loss of information. As noted by Kenneth and David (2002), it is recommended to look at the differences between AIC scores rather than just the evidence ratio (ER) (Kenneth & David, 2002) for every model

3.5 Qualitative analysis

In this work, we investigate the relationship between KL-SATD comments and SonarQube reports with statistical analysis. However, a statistical relationship can exist even when KL-SATD comments and SonarQube point out different problems in the code. The KL-SATD and program analysis-based TD may or may not reflect the same latent property of TD in the source code. Comments like “TODO: This method is too long” are likely to be found by SonarQube’s code analysis while comments like “TODO: We need to update to the latest version of the library” are unlikely to be found by SonarQube’s source code analysis. Furthermore, our results may reflect the fact that the appearing issues are not related to the actual KL-SATD comments but rather appear elsewhere in the file at the same time.

To investigate this matter, we conducted an exploratory qualitative analysis on a random sample of comments. We wanted to find out, how KL-SATD comments and issues overlap. Specifically, how many of KL-SATD comments lie in the context of an issue, and how many of these comments address the issue they were in the context of.

3.5.1 Definition of the context of an issue

To investigate whether a KL-SATD lies in the context of an issue, we need to first define what context means in our research. We define that a KL-SATD comment is in the context of an issue if it fills one of the three following criteria:

  1. 1.

    KL-SATD comment is within a structure of an issue. (Line 236 in Fig. 1)

  2. 2.

    KL-SATD comment is present on code lines right before an issue. (Line 239 in Fig. 1)

  3. 3.

    KL-SATD comment is present on code lines right after an issue. (Line 250 in Fig. 1)

The first criteria “KL-SATD comment is within a structure of an issue” refers to a case, where an issue spans multiple lines of code and the KL-SATD comment is placed in one of these lines. One example of such a case is a structure containing many nested if-statements, and within this structure, a developer has left a KL-SATD comment. We define in this case that since the comment is within this structure, it is then in the context of this said issue. The next two criteria “KL-SATD comment is present on code lines right before/after an issue” are valid for multi-line issues, but they also take into account single-line issues. An example of a single-line issue with a KL-SATD in their context would be enforcing a naming convention of a variable. If there is a KL-SATD comment right before or after this variable, then it is in the scope of that particular issue. Figure 1 demonstrates these three rules with a single imaginary example of an issue.

Fig. 1
figure 1

Three different contexts of an issue

3.5.2 Addressing an issue

Looking at all of the three comments in Fig. 1, even when they all are within the context of the same issue only one of them addresses it. The KL-SATD comment on line 236 makes a statement, that something should be done to clean up the code below, e.g., too many nested if statements. Therefore, even if there are three KL-SATD comments within the same context, only one of them addresses the issue. This is an important distinction to make. It demonstrates how this one comment is likely to be flagged by the static analysis tool, also. The other two KL-SATD comments themselves are talking about properties that can not be captured with the tool.

3.5.3 Sample size determination

We are asserting in our qualitative analysis, whether KL-SATD comments are in the context of a SonarQube issue and whether they address them or not. We are therefore dealing with a dichotomous variable. In addition, we do not know in how many cases this is true. We wanted to take a sample from this population with unknown distribution, which would guarantee a confidence interval of 95%. Therefore, we determined the sample size using the following formula:

$$\begin{aligned} \textit{n} = p(1-p)(\frac{Z}{E})^2 \end{aligned}$$
(2)

where p is the unknown population proportion, Z is the value to be used from the standard normal distribution which reflects the desired confidence interval, and E is the desired margin of error. To maximize the sample size, we used 0.5 for the value of p. This gives us a sample size of 385 cases with a 95% confidence interval.

3.5.4 Annotation instructions and test annotation

Before annotating all of these cases, all of the authors were given a small test sample of 30 cases along with the annotation instructions. The instructions are listed in Appendix. Disagreements discovered in this initial round were discussed between all of the authors to reach a consensus. After this, the main author proceeded to annotate the remaining 355 cases. The full annotated sample set is available in our replication package.

3.6 Replicability

To allow our study to be replicated, we will publish a package allowing for the replication of the results of this paper.Footnote 5

4 Results

This section answers our RQs. We start by looking at the relationship between KL-SATD comment introductions and removals on the commit level with the sqale index, and the 2 remediation efforts (RQ\(_1\) & RQ\(_2\)). Then, we move on to examining the relationship between KL-SATD comment introductions and removals on file-level utilizing TD issues (RQ\(_3\) & RQ\(_4\)).

4.1 The relationship between KL-SATD on technical debt and remediation efforts (RQ\(_1\) & RQ\(_2\))

We look at what kind of relationship the introduction or removal of KL-SATD comments has on technical debt and the two remediation efforts. We examine this using general linear mixed model analysis and examining the metrics with normalized values. To determine which of the remediation efforts was the most significant in predicting KL-SATD introduction, we ran a generalized linear mixed model analysis on the commit level for the three remediation effort variables as well as for the lines of code, which serves as our baseline. We constructed three different models for both introduction and removal. One model has all of the metrics, the second model excludes ncloc, and the third model excludes sqale index. The results of these different models are presented in Tables 34, and 5 for KL-SATD introduction and in Tables 67, and 8 for KL-SATD removal. We will go over first the results regarding KL-SATD introduction and after that the results for KL-SATD removal.

Table 3 Remediation efforts for KL-SATD introduction - Model 1 (RQ\(_1\))
Table 4 Remediation efforts for KL-SATD introduction - Model 2 (RQ\(_1\))
Table 5 Remediation efforts for KL-SATD introduction - Model 3 (RQ\(_1\))

Looking at the results, we can see that the lowest AIC score (19,090.1) is achieved when all of the metrics are included in the model (Model 1). To verify the result, we calculated the ER for the other models in relation to Model 1. The ER for Model 2 was 4.7, and for Model 3 it was 420,836. This means that the likelihood of either of these two models producing a result with as less information loss as Model 1 is low.

Findings 1: The general linear mixed model analysis shows that when compared to the other remediation efforts, only sqale index has a positive relationship with KL-SATD introduction (RQ\(_1\)), while reliability remediation effort and security remediation effort do not have a statistically significant effect. Lines of code metric has a negative relationship. This finding connects KL-SATD comments to the appearance of code smells, rather than to bugs or security issues. Also, adding more and more code lines to projects tends to have a negative relationship to KL-SATD appearance.

Similarly to the introduction of KL-SATD comments, we now look at what kind of relationship the removal of KL-SATD comments has on technical debt and the two remediation efforts. The results are presented in Tables 67, and 8.

Table 6 Remediation efforts for KL-SATD removal - Model 4 (RQ\(_2\))
Table 7 Remediation efforts for KL-SATD removal - Model 5 (RQ\(_1\))
Table 8 Remediation efforts for KL-SATD removal - Model 6 (RQ\(_2\))

The results show that again, the model with the lowest AIC (14,622.1) is Model 4, which has all of the metrics included in it. Calculating the ER shows that other models have a low probability of minimizing information loss, with Model 5 having an ER of 380,788 and Model 6 having an ER of 1212. Therefore, we consider Model 4 to be the one to use regarding KL-SATD removal.

Model 4 has three statistically significant predictors. Lines of code metrics have a positive relationship with KL-SATD comment removal, while sqale index and reliability remediation effort have a negative relationship. Therefore, an increase in the lines of code seems to have a positive relationship with KL-SATD removal.

Findings 2: The general linear mixed model analysis shows that lines of code metric have a positive relationship with KL-SATD removal (RQ\(_2\)), while sqale index and reliability remediation effort have negative relationships. The results indicate, that the growth of the project and fixing of both maintainability issues and bugs are correlated with the removal of KL-SATD comments.

4.2 The relationship between KL-SATD on technical debt and issues (RQ\(_3\) & RQ\(_4\))

We look at SonarQube’s issues and how they relate on file level on KL-SATD introduction, building again a model using general linear mixed model analysis. The results are summarized in Table 9, where we present only the statistically significant predictors to save space.

Table 9 Issues for KL-SATD introduction - Model 7(RQ\(_3\))

Looking at the random effects, the results show that the largest source for variance is found between projects (2.879e\(-\)02), and the second-largest source is the files within the projects (1.369e\(-\)02). From all the issues used in Model 7 presented in Table 9, there are 13 issues, which have a statistically significant correlation with the appearance of KL-SATD comments. Of these 9 have a positive relationship and 4 have a negative relationship. Table 10 lists these issues in the same order as in Table 9 but shows their names, the direction of their relationship to KL-SATD introduction (positive or negative), their severity classification, and their type.

Table 10 Issue summary in relation to introduction of KL-SATD (RQ\(_3\))

The issues are almost exclusively of the type code smell, with only one issue labeled as a vulnerability. The severity of the statistically significant issues varied from minor to critical. In total, there are 6 minor level code smells, 6 major level code smells, and 1 critical level code smell. From the issues with a positive relationship, 4 are minor code smells, 4 are major code smells, and 1 is a critical code smell. From the issues with the negative relationship, 1 is a minor code smell, 1 is a minor vulnerability, and 2 are major code smells.

Findings 3: As a conclusion, an introduction of KL-SATD comment seems to mainly have a positive relationship with the appearance of code smells (RQ\(_3\)). The issues vary in their severity from minor to critical.

For the removal of KL-SATD comments, we follow the same procedures as described in Section 4.2. The results are summarized in Table 11, where we again present only the statistically significant predictors in order to save space.

Table 11 Issues for KL-SATD removal - Model 8 (RQ\(_4\))

The results from random effects show that the largest source of variance was found from files within the projects (6.633e\(-\)01) and that the next largest source was again the projects themselves (3.414e\(-\)02). Model 8 in Table 11 lists a total of 22 issues, which are deemed statistically significant for KL-SATD removal. Of these 9 have a positive relationship with KL-SATD removal, while 13 have a negative relationship. Table 12 lists these issues, along with their relationship (positive or negative), the normalized odds ratio, their severity, and their type.

Table 12 Issue summary in relation to removal of KL-SATD (RQ\(_4\))

The results are again overwhelmingly of the type classified as a code smell, with 2 issues being labeled as vulnerabilities. The severity of the issues varies again a lot. A total of 9 issues are labeled as minor in their severity, 8 as major, 3 as critical, and 2 as blocker. The severity of issues with positive relationships is as follows: 4 are minor, 3 are major, and 2 are critical. From issues that have a negative relationship, 5 are minor, 5 are major, 1 is critical, and 2 are blockers.

Findings 4: In conclusion, removal of the KL-SATD comment seems to have a mixed relationship with code smells (RQ\(_4\)), as there are almost equal amounts of positive and negative relationships. A slight majority of the code smells have a negative relationship with KL-SATD removal. The issues vary in their severity from minor to critical.

4.3 Qualitative analysis of keyword-labeled self-admitted technical debt comments and sonar issues (RQ\(_5\))

Our whole work relating to SonarQube issues is based on a dataset, which has an accuracy of a file-level. To investigate whether these issues have a KL-SATD comment in their context and whether that comment addresses that particular issue, we performed an annotation and analysis on a randomly selected sample from the KL-SATD introducing comments. We labeled each sample whether it was in the context of an issue and also if it addressed that particular issue. There can be cases, where the comment is, e.g., inside a very deeply nested structure of statements, but rather than making a note of this it talks about something else in the code.

4.3.1 Annotation results

A total of 138 comments were in context to at least one SonarQube issue, and a total of 56 addressed the issue they were in the context of. As a result, a total of 35.84% (0.3584) of the KL-SATD comments are in the context of a SonarQube issue with a 95% confidence interval (0.3104, 0.4064). Similarly, 14.55% (0.1455) of KL-SATD comments addressed at least one SonarQube issue with a 95% confidence interval (0.1103, 0.1807). Table 13 lists all the SonarQube issues that had in their context a KL-SATD comment. The first column describes the issue, and the second column shows how many times that issue was present in the random sample. The third column shows how many times that issue’s context had a KL-SATD comment, while the fourth column shows how many times the KL-SATD comments addressed the issue they were in the context of. It was possible that the KL-SATD comment was in the context of several issues, and that the comment addressed none, one, or more than one of them.

Table 13 Related issues and addressing issues

The most numerous example from an issue that had a KL-SATD in its context is the one addressing the complexity of a method. This is explained by our annotation instructions, where a KL-SATD comment is considered to be in the context of an issue when that comment resides within the issue. Therefore, whenever an issue is within a complex method, it gets tagged as being in its context. However, only two of the comments addressed the complexity. The second most issue that had KL-SATD comments in their context dealt with commented-out code lines. Here, a large amount of the KL-SATD comments also addressed this issue, meaning that the developers left a note regarding the removal or alteration of the code snippet in question.

On the other side, numerous SonarQube issues do not have in their context any KL-SATD comments. There can be several explanations for this. The first one is that many of these issues go unnoticed during development work. This in turn can be due to several possible scenarios. In the first one SonarQube produces lots of issues for a commit, which can be seen as noise by the developers. Noise is one of the reasons developers dislike using static analysis tools, e.g., for bug hunting (Johnson et al., 2013). The second one is that the developers are using another tool for static analysis. Other tools can be configured differently, and issues reported by SonarQube either differ from their definitions in them or they are not present at all. Lastly, developers might not even use static analysis tools at all.

The second reason why many issues are not marked with a KL-SATD might lie in the reason that developers do not deem it necessary to mark them with one. But this again might lead to a situation where a lot of technical debt goes unnoticed and is ignored when it should be dealt with.

4.3.2 Comments which are context and address an issue

As shown in Table 13, nearly 36% of the KL-SATD comments were in the context of an issue, while almost 15% addressed at least one of the issues. Looking at the comments from both of these cases can shed light on their differences and the possible hidden qualities that the KL-SATD comments which lie in a context of a SonarQube issue while not addressing it.

The KL-SATD comments addressing an issue were relatively short, with the exception of commented-out code lines. Here the KL-SATD comment is combined with the commented-out code, as they are located adjacent to each other. The comments were also very concrete in nature, using language like “check category,” “write test,” “throw an exception,” “fix exception,” “evaluate other error codes,” and so on. There were also cases where a catch block or method was generated automatically, resulting in issues of empty nested blocks of code, where catch blocks did not throw the right exception or just printed a stack trace, and so on.

The KL-SATD comments which are in the context of an issue but do not address it are also relatively short. They also use concise language like “fix config versions,” “allow repository overriding,” “validate against meta data,” and “Don’t add port.” However, these comments seem to address deeper problems in the code, rather than just formal ones. There are also comments, that point out future plans, such as “once JDK 8+ becomes the minimum for this project, make it a default method instead of this class,” and “this is currently not supported. We may wish to add this support in the future.” The KL-SATD comments which are in the context of an issue but do not address it, therefore, seem to deal with deeper and more complicated problems, future plans, or other things which can not be captured with static analysis tools.

5 Discussions

In this section, we discuss our findings in more detail. We start with the analysis of technical debt and the two remediation efforts on both KL-SATD introduction and removal. We then take a closer look into the issues chosen as predictors on both project and file levels. Specifically, we are comparing the similarities of the predictors for KL-SATD introduction and removal. We also perform a qualitative analysis between SonarQube issues and KL-SATD comments to see, whether the comments were in the context of specific issues and if they addressed them or not. Lastly, we present the implications of our work to practitioners and researchers.

5.1 Detecting keyword-labeled self-admitted technical debt comment from sonar measures

The analyses done with a generalized linear mixed model show that sqale index is both related to KL-SATD introduction and removal. The best-performing model in the KL-SATD introduction assigned sqale index a statistically significant positive relationship. For KL-SATD removal, sqale index had again statistically significant relationship but this time it was negative. Reliability Remediation Effort was present in KL-SATD removal, where it had a statistically significant negative relationship.

The results indicate that changes in sqale index seem to capture the changes in KL-SATD introduction and removal. When the sqale index increases, the odds of having a new KL-SATD comment increase also, and vice versa for the decrease. This connects KL-SATD comment appearance and disappearance to code smells and how they are introduced and removed from the projects. Reliability remediation effort had a statistically significant negative relationship with KL-SATD removal. This ties KL-SATD removal to fixing bugs. But whether bugs are fixed because there are KL-SATD comments, or if the comments get removed as a side-effect of bug fixing, we cannot say.

The possible problem with all of these metrics lies in their granularity, as they are only available for projects and not for file levels. Therefore, inserting a KL-SATD comment into one file in a commit might not generate enough changes in the metric, while at the same time all the other changes made into other files can cause unexpected swings in the metric.

Lines of code were found to be statistically significant in KL-SATD removal, where it had a statistically significant positive relationship with it. This means, that as the projects age and grow, KL-SATD comments get removed from them.

As for KL-SATD introduction, the lines of code cannot be considered to be statistically significant. The reason lies in its volatility depending on what other metrics were included in the model. When sqale index was in the model, lines of code had a negative relationship. When it was not in the model, then lines of code had a positive relationship with the KL-SATD Introduction. Due to this changing from positive to negative, we cannot hold lines of code as a reliable metric for the KL-SATD Introduction.

5.2 Detecting keyword-labeled self-admitted technical debt comment from sonar issues

The results when looking at KL-SATD introduction and removal showed that they were almost exclusively connected to issues labeled as code smells. In both cases, the severity of the issues varied a lot from minor to critical or even blocker.

Shared issues for introduction and removal

Looking at the conclusions obtained when the KL-SATD comment was introduced and when it was removed, we see that they share some issues. Table 14 lists all of the shared issues as well as their relationship status, whether it was positive or negative.

Table 14 Shared issues between KL-SATD introduction and removal

There are in total 6 shared issues between KL-SATD introduction and removal. All of them had the same direction in their relationship with both KL-SATD introduction and removal, which means that they correlate with KL-SATD activity. To see whether a KL-SATD comment was added or removed cannot be told from these issues.

The issues themselves seem to deal with complexity, error handling, dead code, bloated code, and coding conventions. These seem to indicate common problems, which can appear when KL-SATD comments are introduced or deleted. Therefore, it is possible that when introducing new functionality, the developer creates a class that is too complex. Similarly, when a developer removes a KL-SATD comment and fixes that issue, it is possible that they inadvertently create too much complexity in a class. This is merely an example and requires more in-depth analysis in future work.

The shared issues show that simply utilizing the appearance or disappearance of specific issues for predicting KL-SATD introduction or removal in a project might not be a very accurate method. It is important always to consider both cases to see the similarities and differences, and these might vary from project to project. However, even if we cannot use the appearance or disappearance of issues for predictions, we can still infer information from them.

5.3 Overlap between KL-SATD comments and SonarQube issues

It appears that SonarQube issues and KL-SATD comments have limited overlap, i.e., they are complementary to each other, or to a large degree mutually exclusive. We found that while 36% of KL-SATD comments are within the context of SonarQube issues only 15% of KL-SATD comments address a particular SonarQube issue. This means that 85% of KL-SATD comments of our study are beyond the detection capabilities of SonarQube and that only 15% KL-SATD issues are reporting the same thing that SonarQube is reporting. Therefore, to get a holistic picture of code maintainability one should look at both KL-SATD comments and SonarQube issues as they highlight different aspects of code maintainability. This finding is in line with past studies from software defect detection showing that peer reviews, analysis tools, and testing find different defect types  (Basili & Selby, 1987; Boehm & Basili, 2001; Wagner et al., 2005). Even studies focusing solely on software testing have shown that different techniques  (Leon & Podgurski, 2003) and even different individuals  (Mäntylä & Itkonen, 2013; Farooq & Quadri, 2013) find different defects. So although the finding that SonarQube issues and KL-SATD comments have limited overlap may first sound surprising, after considering the past work in different sub-topics of software engineering, it becomes reasonable or perhaps something we would have expected.

One may wonder how can we find a statistically significant relationship on the file level between SonarQube reports and KL-SATD comments while also finding limited overlap in a more detailed analysis. We think there might be a latent (hidden) factor explaining both the creation of KL-SATD and SonarQube issues. Usual suspects for such latent factors would be time pressure during development, lack of developer knowledge, or software evolution that causes the software system to become increasingly complex according to Lehman’s laws. So, both SonarQube reports and KL-SATD would simply be manifestations of this latent factor but future works are needed to address this hypothesis.

5.4 Implications for practitioners

Our work presents practitioners with an overview of how KL-SATD is related to technical debt. The key findings are that KL-SATD instances are mainly correlated with code smells of varying severity and that there is a gap between KL-SATD comments and technical debt detected via static analysis tools.

The mixed model analysis shows that the appearances of KL-SATD comments have a positive relationship with code smells. This means that whenever a developer sees a KL-SATD comment, it’s likely correlated with a code smell, rather than a vulnerability or a bug. This can mean that KL-SATD comments do not necessarily present an immediate threat to the development. However, the classification of SonarQube issues relating to whether or not they are bug-inducing has been called into question in earlier research by Lenarduzzi et al. (2020). Therefore, they should still be dealt with swiftly to retain a high standard in the quality of the software.

The deletion of KL-SATD comments does not necessarily lead to better quality in code. This is evident from the mixed model results, where some issues had a positive relationship with KL-SATD removal. Therefore, practitioners should in general be mindful of not introducing new issues while repairing the existing problems and should pay extra attention to the issues reported here.

As shown by the qualitative analysis, there is a substantial gap between SonarQube issues and KL-SATD comments when considering whether the comments are in the context of a specific issue. There is an additional gap between the issues and whether the comment addresses that issue or not.

There is a possibility that this is intentional, and rather than focusing on commenting code smells, the developers instead focus on the functionality of the software. The notable exception in this matter is the code lines that are commented out. Nearly 30% of commented-out code lines had KL-SATD comments in their context, and from these almost all also addressed this particular smell. Typically, the comment was referring to removing the old code after it was deemed unnecessary and therefore commented out.

Nonetheless, as discussed earlier, even if SATD comments and code metrics do not necessarily point to the exactly same spot, the emergence of code smells can point out that there are other quality issues present as well.

As an overview, the most direct result from the findings to the practitioners is that a lot of SonarQube issues are not addressed with KL-SATD comments in any way and can, therefore, be even invisible to the developers. The visibility of TD issues is dependent on using analysis tools, which can differ substantially from one another.

5.5 Implications for researchers

The results of our study can aid researchers to establish new research topics related to technical debt and static analysis tools. Our work shows that KL-SATD comments are primarily correlated with code smells, as evidenced by the mixed model results. However, at the same time care should be taken when creating such models using static analysis tools, as there can be overlap with the results when considering whether the KL-SATD comment was added or removed.

Our qualitative analysis shows that there is a large gap between KL-SATD comments and SonarQube issues. Only around 36% are in the context of a SonarQube issue, and 15% directly address one of the issues they are in the context of. This indicates that the essence of KL-SATD comments can not be fully captured by analyzing the metrics produced by static analysis tools. Therefore, utilizing purely static analysis tools to create, e.g., a machine learning model for automatic KL-SATD comment creation to increase TD visibility does not necessarily lead to the best results. This calls for further research and new ways of analyzing the relationship between KL-SATD comments and source code.

6 Threats to validity

In this section, we discuss the threats to validity, including internal, external, construct validity, and reliability.

Construct validity

Threats relating to construct validity deal with the theory and observations. SonarQube is one of the most adopted static analysis tools by developers (Vassallo et al., 2019; Avgeriou et al., 2021). To conduct our research, we used a large technical debt dataset published previously by Lenarduzzi et al. (2019b). For a full description of the dataset, we refer the reader to the original paper. We cannot completely exclude the presence of false positives or false negatives in the detected warnings of that dataset; further analyses on these aspects are part of our future research agenda. As for code smells, the dataset was created employing a manually-validated oracle, hence avoiding possible issues due to the presence of false positives and negatives.

The dataset was created using SonarQube’s default settings, and there exists the possibility that individual projects would have benefited from individually tuned settings. Furthermore, the projects present in the dataset did not use SonarQube in their development when the data was gathered and analyzed. Therefore, the results do not reflect how developers would work when using SonarQube. They rather show how developers using different static analysis tools or no tools at all would work, and how SonarQube would reflect on their work. Previous research papers are suffering from this same issue, see, e.g., (Palomba et al., 2018).

Internal validity

The factors of internal validity deal with possible issues within our study. For data analysis, the missing values in the dataset present a possible threat to validity. As mentioned in Section 3, not all of the commits were successfully analyzed by SonarQube. This means that the data set has commits with missing values, which can have an effect on the analyses.

The second threat to internal validity comes from a specific extract operation where a developer moves KL-SATD comment or comments from one file to another. We consider the cases where a file has been renamed within a commit to avoid this duplication situation. Cases, where a piece of code along with the SATD comment has been copied to a completely different file, are not accounted for. Tracking this copying is not trivial and would warrant a complete research paper on its own.

There can be also theoretical cases, where a developer changes a large number of lines in one method for bug fixes and removes only one SATD comment in a different method because she/he notices the SATD comment was not removed at the previous commit. In such a case, the SATD comment would not be related to the lines that are used for the calculation of the sqale index, reliability remediation effort, and security remediation effort. Tackling the issue of whether KL-SATD comments were removed this way is out of the scope of this study.

External validity

The external validity threats deal with issues relating to the generalization of results. Our study considered the 33 Java open-source software projects with different scopes and characteristics included in the technical debt dataset. All 33 Java projects are members of the Apache Software Foundations that incubate only certain systems that follow specific and strict quality rules. Our empirical study was therefore not based only on one application domain. The selected projects stem from a very large set of application domains, ranging from external libraries, frameworks, and web utilities to large computational infrastructures.

The dataset only included Java projects. We are aware that different programming languages, and projects at different maturity levels could provide different results.

Conclusion validity

Finally, threats relating to conclusion validity deal with issues between the experiment and the results. We adopted the generalized mixed model as our analysis tool, as it has been used successfully previously in SATD-related tasks, such as to investigate refactoring actions and SATD removals by Iammarino et al. (2021). We also addressed possible issues due to the multi-collinearity of the commit-level metrics, and possible data imbalance problems due to the different sizes of the projects. Lastly, we employed AIC and evidence ratio as our measures for evaluating the different models. As mentioned in the earlier literature (Kenneth & David, 2002), it is the relative values of the models which are important, and not the individual AIC values as these can vary a lot. We recognize, that using other statistical or machine-learning techniques might affect the results.

7 Related works

Here, we introduce related works and compare our results to prior works. We start with code-related technical debt and self-admitted technical debt. Then, we compare our work to previous works, including SATD and source metrics, and TD issue classification.

7.1 Code technical debt

Code technical debt has been investigated considering the point of view of approaches, and strategies (Seaman & Guo, 2011; Zazworka et al., 2013; Guo et al., 2016; Lenarduzzi et al., 2019a), and how to measure it (Tollin et al., 2017; Digkas et al., 2018; Saarimäki et al., 2019). Code technical debt is detectable by different automated static analysis tools (Avgeriou et al., 2021), such as SonarQube,Footnote 6 CAST,Footnote 7 Sonargrah,Footnote 8 or NDepend.Footnote 9 SonarQube is one of the automated static analysis tools more adopted by developers in industry (Vassallo et al., 2019; Lenarduzzi et al., 2020; Avgeriou et al., 2021).

However, only a few works estimated technical debt based on SonarQube rules, focusing on the change- and fault-proneness (Falessi et al., 2017; Tollin et al., 2017; Lenarduzzi et al., 2020). Other works investigated the diffuseness of technical debt measured by SonarQube (Digkas et al., 2017, 2018; Saarimäki et al., 2019).

The largest percentage of technical debt repayment is created by a small subset of issue types (Digkas et al., 2018), and the most frequently introduced technical debt items are related to low-level coding issues (Saarimäki et al., 2019).

SonarQube technical debt items detected as class level have a negative influence on increasing change-proneness (Tollin et al., 2017; Lenarduzzi et al., 2020).

Considering the different types and severity assigned by SonarQube to the technical debt items, there is no significant difference between the clean and infected classes (Lenarduzzi et al., 2020). All the technical debt items have a statistically significant but very small effect on change-proneness (Lenarduzzi et al., 2020). However, all the technical debt items classified as Code Smell affect change-proneness, even if their impact on the change-proneness is very low (Lenarduzzi et al., 2020).

Considering the fault-proneness, there is no significant difference. Among the technical debt items that SonarQube claims to increase the fault-proneness (classified as Bug), only one out of 36 has a very limited effect (Lenarduzzi et al., 2020), and 26 hardly ever led to failure. Unexpectedly, all the remaining Bugs resulted in a slight increase in the change-proneness instead (Lenarduzzi et al., 2020).

Moreover, by removing any technical debt items, developers can prevent 20% of faults in the source code (Falessi et al., 2017).

However, comparing the effort needed by developers to repay technical debt with the one proposed by SonarQube, remediation time is generally overestimated by the tool compared to the actual time for patching technical debt items. The most accurate estimations are related to code smells, while the least accurate to Bugs (Saarimaki et al., 2019; Baldassarre et al., 2020).

7.2 Self-admitted technical debt

Self-admitted technical debt (SATD) has been investigated by several researchers, especially in the last years. Researchers focused on the presence and removal of SATD. The vast majority of SATD research focused on detection to improve SATD comprehension and on its repayment (Sierra et al., 2019).

Some works investigated the SATD introduction (Potdar & Shihab, 2014; Bavota & Russo, 2016) from a different point of view. SATD is generally introduced by senior developers in less than 30% of the code and removed less than 63%. Moreover, SATD increases over time and tends to survive a long time in the system (Bavota & Russo, 2016), and in almost 60% of them, it refers to design flaw (Xavier et al., 2020).

Looking at the removal, SATD is removed unintentionally in 20–50% of the cases, but only the 8% is reported in commit messages (Zampetti et al., 2018). Moreover, SATD is removed over a period ranging between 18 and 172 days (Maldonado et al., 2017) or even if 872.3 h (Li et al., 2020). The task is performed by the same person responsible for the introduction (Maldonado et al., 2017; Li et al., 2020).

Research also focused on SATD detection approaches based on mining the source code comments (de Freitas Farias et al., 2015; Huang et al., 2017) or natural language processing (NLP) (Maldonado & Shihab, 2015). Other approaches adopted machine learning models for automating SATD detection (Liu et al., 2018; Flisar & Podgorelec, 2019; Ren et al., 2019). These studies revealed promising results. For example, convolutional neural networks can detect SATD with an average accuracy of 73% (Zampetti et al., 2020).

Despite the research attention on SATD admitted by developers in the source code comments, SATD in issue trackers is not well unexplored (Li et al., 2020).

7.3 Comparison to prior works

Here, we look at the prior work relating to SATD and source code metrics, and SATD and TD issue classification, and compare our findings to them.

SATD and source code metrics

A study by Zampetti et al. (2017) examined a machine learning tool for SATD recommendations for 9 different projects. They found out, that for within-project predictions, top features for all projects included source code metrics such as readability and lines of code. Our study ties the SATD appearance to metrics related to code smells.

A recent study by Iammarino et al. (2021) investigated how refactoring actions co-occur with SATD removal. The aim was to investigate whether SATD is removed when changes are made to projects, which should improve code quality, which in this case refers to refactoring actions. They built a logistic regression mixed-effect generalized linear model (GLM) and used SATD removal as a dependent variable. Lines of code and other quality metrics such as depth of inheritance tree, coupling between objects, and lack of cohesion of methods were used as independent variables. The results show that increases in coupling between objects’ metric and lines of code had a positive relationship to SATD removal. Our study confirms the latter part, as lines of code had a positive relationship to SATD removal in our model.

Wehaibi et al. (2016) examined SATD’s impact on software quality. Their results show that technical debt impact is not related to defects. Our results confirm this result, as most of the issues were code smell type, and none were labeled as bugs. Further, all issues with a positive relationship towards KL-SATD introduction or removal were code smell type.

Technical debt issue classification

Our classification of issues into different categories was similar to the work done by Tan et al. (2020), but with a couple of notable differences. First, they excluded issues labeled as minor in their severity, claiming that these issues are too trivial and therefore have low impact. They also note that developers might not treat these issues as technical debt due to this. We do not exclude minor issues in our study, as we deemed them to be too important to be included. First of all, excluding minor code issues contradicts previous research. Li et al. (2015) criticized in their mapping study one research paper that did similar exclusion, saying that this is against the existing technical debt literature. Secondly, the severity categorizations for rules listed by SonarQube can be questioned. Taking as an example the rule “Lines should not be too long” which appears in the work by Tan et al. (2020). SonarQube’s documentationFootnote 10 explains this major smell in the following way: “Having to scroll horizontally makes it harder to get a quick overview and understanding of any piece of code.” Whether or not this is a major code smell is debatable, as in the same class we have issues relating to complex classes and long methods. For these reasons, we did not exclude any code smells based on their severity listing in SonarQube.

Thirdly, Tan et al. (2020) analyzed projects written in Python, and we analyzed projects written in Java. SonarQube lists different rules for each coding language, so some of the issues listed for one language might not exist in others. Still, there is an overlap between our issues and the issues listed by Tan et al. (2020), and these issues are as follows:

  • Statements should be on separate lines

  • Collapsible “if” statements should be merged

  • Nested blocks of code should not be left empty

In light of this, we can deduce that certain types of SATD transcend language barriers. These are related to code debt, more specifically to coding conventions and bloat categories. Rather than being tied to a specific programming language, each of these rules can be applied to several languages respecting their conventions.

SATD, issues, and context

Our definition of when KL-SATD comments lie in the context of an issue affects heavily the results of the qualitative analysis. Our definition of when a KL-SATD comment is within an issue was that it was either inside a multi-line issue or directly before or after either a multi-line or single-line issue. Previous research which has also considered this question of context with SATD and SonarQube issues is the work done by de Lima et al. (2022). They define that issues located in the same context as SATD are associated. This context is defined further by declaring that the issue and the SATD comment have to be within the same code block, including nested code blocks. Figure 2 shows the definition of context as defined by de Lima et al. (2022). Every context is marked with a different color.

Fig. 2
figure 2

Context as defined by de Lima et al. (2022)

Within Fig. 2, we have two SATD comments, which are located on lines 23 and 33. According to de Lima et al. (2022), the context for the SATD comment on line 33 is Code Block 1.1.1 spanning lines 30 to 34, and the issues within those code lines are associated with that SATD comment. The context for the SATD comment on line 23 is the whole Code Block 1 including all the nested Code Blocks 1.1, 1.1.1, and 1.2. This means that all the issues within lines 21 to 41 are associated with the SATD comment on line 23.

This definition of the context of issues is crucially different from ours, and we will now discuss the key differences. Looking at the KL-SATD comment on line 33, we would only consider the single-line issue related to Instruction 1.1.1.2, or multi-line issues that span whole Code Blocks 1.1.1, 1.1, or 1, as then the KL-SATD comment would be counted to be within that multi-line issue. The rules by de Lima et al. (2022) would instead connect all issues within Code Blocks 1.1.1, 1.1, or 1 to this KL-SATD comment. The key difference here is that, if there’s a single-line issue e.g., in line 27 dealing for example with a naming issue for that line, we would not consider the KL-SATD comment to be in the context of that issue while de Lima et al. (2022) would consider it to fall in that issue’s context.

Looking at the KL-SATD comment on line 23, the difference in contexts is even greater. In our definition of context, we would only consider single-line issues related to Instructions 1.1 and 1.2, and multi-line issues that start from line 21 or 22 and continue at least until line 24. In the definition by de Lima et al. (2022), all the issues present in all of the presented Code Blocks are considered to be in the context of the KL-SATD comment. This means that also issues present in Code Blocks 1.1, 1.1.1, and 1.2 are considered even when they most likely do not have anything to do with the KL-SATD comment on line 23.

We wanted to avoid this problem, and therefore our definition of when KL-SATD comment is in the context of an issue is on purpose very strict.

8 Conclusions

In our research, we looked at answers to the research questions about the impact of KL-SATD introduction and removal in source code. Our work has the following main contributions:

  • Discovering commit-level metrics related to KL-SATD introduction and removal

  • Discovering on file level the types and severity of issues linked to KL-SATD introduction and removal

  • Performing a qualitative analysis on whether the KL-SATD comments are in the context of a SonarQube issue and if they address these issues directly

Our analyses show that KL-SATD introduction has mainly a relationship with code smells. This is evident from the relationship the sqale index has on both KL-SATD introduction and removal, as well as from the types of issues related to KL-SATD introduction and removal. Sqale index measures the effort of fixing code smells in the project, and it had a positive relationship with KL-SATD introduction and a negative with KL-SATD removal. Thus, we can conclude that the KL-SATD comment introduction is connected to worsening the maintainability of a project while removing KL-SATD comments improve the project’s maintainability rating. In addition, removal had a positive relationship with reliability remediation effort, connecting the removal of KL-SATD with the fixing of bugs.

Secondly, nearly all of the issues related to KL-SATD were of the type code smell. For the KL-SATD introduction, only one predictor was of type vulnerability, while KL-SATD removal had two vulnerability-class issues as predictors. There were no bug-classified predictors, which again points to the fact that KL-SATD comments are mainly related to maintainability of a project.

Thirdly, the Tables 10 and 12 lists the issues and the relationships with either KL-SATD introduction or removal. While not claiming causality, developers should be aware of the issues with positive relationships to these actions. They represent the rules which are violated in KL-SATD introduction or removal and therefore present possible future problems in the development. Especially the introduction of issues (positive relationship) while removing KL-SATD comments can act as a generic guide on what issues to avoid when removing KL-SATD comments.

The qualitative analysis revealed that nearly 36% of KL-SATD comments were in the context of a SonarQube issue, while 15% addressed at least one of the issues. The addressing comments were generally short and provided actionable guidance on what should be done. The comments which did not address any issue instead dealt with more complicated matters in the code, future plans, or other matters which go beyond what can be captured using static analysis tools.

For future work, we aim to look more in detail at how KL-SATD introduction and removal differ from one another. This is done by looking more closely at the appearance and disappearance of KL-SATD comments, the related changes performed in the code, as well as the related metrics pertaining to these changes. There were also several code smells with a positive relationship with KL-SATD removal, and this is also worth investigating in the future, as it might show that not all KL-SATD removal activities have purely positive effects. Lastly, expanding the qualitative analysis to a more extensive and in-depth one would shed even more light on the complex relationship between KL-SATD comments and code metrics. What are the reasons behind the phenomenon that some issues are commented on and others are not? Is there a way to capture the essence of the KL-SATD comments, and could we use it to create a manually annotated oracle from the topic for machine-learning purposes? These are all avenues for future research.