Wait for it: identifying “On-Hold” self-admitted technical debt

Self-admitted technical debt refers to situations where a software developer knows that their current implementation is not optimal and indicates this using a source code comment. In this work, we hypothesize that it is possible to develop automated techniques to understand a subset of these comments in more detail, and to propose tool support that can help developers manage self-admitted technical debt more effectively. Based on a qualitative study of 333 comments indicating self-admitted technical debt, we first identify one particular class of debt amenable to automated management: on-hold self-admitted technical debt (on-hold SATD), i.e., debt which contains a condition to indicate that a developer is waiting for a certain event or an updated functionality having been implemented elsewhere. We then design and evaluate an automated classifier which can identify these on-hold instances with an area under the receiver operating characteristic curve (AUC) of 0.98 as well as detect the specific conditions that developers are waiting for. Our work presents a first step towards automated tool support that is able to indicate when certain instances of self-admitted technical debt are ready to be addressed.


I. INTRODUCTION AND MOTIVATION
The metaphor of technical debt is used to describe the tradeoff many software developers face when developing software: how to balance near-term value with long-term quality [1]. Practitioners use the term technical debt as a synonym for "shortcut for expediency" [2] as well as to refer to bad code and inadequate refactoring [3]. Technical debt is widespread in the software domain and can cause increased software maintenance costs as well as decreased software quality [4].
In many cases, developers know when they are about to cause technical debt, and they leave documentation to indicate its presence [5]. This documentation often comes in the form of source code comments, such as "TODO: This method is too complex, lets break it up" and "TODO no methods yet for getClassname". 1 Previous work [6] has explored the use of visualization to support the discovery and removal of self-admitted technical debt, incorporating gamification mechanisms to motivate developers to contribute to the debt removal. Current research is largely focused on the detection and classification of self-admitted technical debt, but falls short in addressing it.
Previous work [5] has developed an approach based on natural language processing to automatically detect self-admitted 1 Examples from ArgoUML and Apache Ant, respectively, see [5].
technical debt comments and to classify them into either design or requirement debt. Self-admitted design debt encompasses comments that indicate problems with the design of the code while self-admitted requirement debt includes all comments that convey the opinion of a developer suggesting that the implementation of a requirement is incomplete. In general terms, design debt can be resolved by refactoring whereas requirement debt indicates the need for new code.
In this work, we hypothesize that it is possible to use automated techniques based on natural language processing to understand a subset of the technical debt categories identified in previous work in more detail, and to propose tool support that can help developers manage self-admitted technical debt more effectively. We make three contributions: • A qualitative study of the removal of self-admitted technical debt. To understand what kinds of technical debt could be addressed or managed automatically, we annotated a statistically representative sample of instances of selfadmitted technical debt removal from the data set made available by the authors of previous work [7]. While the focus of our annotators was on the identification of instances of self-admitted technical debt that could be automatically addressed, as part of this annotation, we also performed a partial replication of recent work by Zampetti et al. [8], who found that a large percentage of self-admitted technical debt removals occur accidentally. We were able to confirm this finding: in 58% of the cases in our sample, the self-admitted technical debt was not actually addressed, but the admission was simply removed. • The definition of "on-hold" self-admitted technical debt.
Our annotation revealed one particular class of selfadmitted technical debt amenable to automated management: "on-hold" self-admitted technical debt. We define "on-hold" self-admitted technical debt as self-admitted technical debt which contains a condition to indicate that a developer is waiting for a certain event or an updated functionality having been implemented elsewhere. Figure 1 shows an example of "on-hold" self-admitted technical debt from the Apache Camel project. The developer is waiting for an external event (the visibility of doParse() changing or an external bug being resolved) // TODO the following code is copied from AbstractSimpleBeanDefinitionParser // it can be removed if ever the doParse() method is not final! // or the Spring bug http://jira.springframework.org/browse/SPR-4599 is resolved Fig. 1. Motivating Example, cf. https://github.com/apache/camel/blob/53177d55053a42f6fd33434895c60615713f4b78/components/camel-spring/src/main/ja va/org/apache/camel/spring/handler/BeanDefinitionParser.java and the comment admitting the debt is therefore "on hold". • The design and evaluation of a classifier for self-admitted technical debt. Since software developers must keep track of many events and updates in any software ecosystem, it is unrealistic to assume that developers will be able to keep track of all self-admitted technical debt, and of events that signal that certain self-admitted technical debt is now ready to be addressed. To support developers in managing self-admitted technical debt, we designed a classifier which can automatically identify those instances of self-admitted technical debt which are "on hold", and detect the specific events that developers are waiting for. Our classifier achieves a precision of 0.81 (F 1 -score: 0.72) for the identification and 84% of the specific conditions are detected correctly. This is a first step towards automated tool support that can recommend to developers when certain instances of self-admitted technical debt are ready to be addressed. The remainder of this paper is structured as follows: In Section II, we present our research questions and the methods that we used for collecting and analyzing data for the qualitative study. The findings from this qualitative study are presented in Section III. Section IV describes the design of our classifier to identify "on-hold" self-admitted technical debt, and we present the results of our evaluation of the classifier in Section V. Section VI discusses the implications of this work, before Section VII highlights the threats to validity and Section VIII summarizes related work. Section IX outlines the conclusions and highlights opportunities for future work.

II. RESEARCH METHODOLOGY
In this section, we detail our research questions as well as the methods for data collection and analysis used in our qualitative study. The methods for designing and evaluating our classifier are detailed in Sections IV and V. We also describe the data provided in our online appendix.

A. Research Questions
Our research questions focus on identifying how selfadmitted technical debt is typically removed and whether the fixes applied to this debt could be applied to address similar debt in other projects. To guide our work, we first ask about the different kinds of self-admitted technical debt that can be found in our data (RQ1.1), whether the commits which remove the corresponding comments actually fix the debt (RQ1.2), and if so, what kind of fix has been applied (RQ1.3). To understand the removal in more detail, we also investigate whether the removal was the primary reason for the commit (RQ1.4), before investigating the subset of self-admitted technical debt that could be managed automatically (RQ1.5). Based on the definition of "on-hold" self-admitted technical debt which emerged from our qualitative study to answer these questions, we then investigate its prevalence (RQ1.6) and the accuracy of automated classifiers to identify this particular class of selfadmitted technical debt (RQ2.1) and its specific sub-conditions (RQ2.  in an online appendix 2 which contains 2,599 instances of a commit removing self-admitted technical debt, after removing duplicates. The first two columns of Table I show the number of commits for each of the five projects available in this data set. Based on this data set of commits which removed a comment indicating self-admitted technical debt, we created a statistically representative and random sample (confidence level 95%, confidence interval 5) of 335 commits. The last column of Table I shows the number of commits from each project in our sample.

C. Data Analysis
To answer our first research question "How do developers remove self-admitted technical debt?" and its sub-questions, we performed a qualitative study on the sample of 335 commits which had removed self-admitted technical debt according to the data provided by Maldonado et al. [7].
In the first step, the second and third author of this paper independently analyzed twenty commits from the sample to determine appropriate questions to be asked during the qualitative study, aiming to obtain insights into how developers remove self-admitted technical debt and to identify the kinds of debt that could be addressed or managed automatically. After several iterations and meetings, the second and third author agreed on seven questions that should be answered for each of the 335 commits during the qualitative study. These questions along with their motivation and answer ranges are shown in Table II.
The first author annotated all 335 commits following this annotation schema, and the second and third answer annotated 50% of the data each, ensuring that each commit was annotated according to all seven questions by two researchers. Note that not all questions applied to all commits. For example, all instances which we classified as not representing self-admitted technical debt were not considered for future questions, and all commits which we classified as not fixing self-admitted technical debt were not considered for questions such as "Could the same fix be applied to similar Self-Admitted Technical Debt in a different project?".
After the annotation, the first three authors conducted multiple meetings in which they determined consistent coding schemes for the two questions which allowed for open answers and collaboratively resolved all disagreements in the 2 http://das.encs.concordia.ca/uploads/2017/07/maldonado icsme2017.zip Does the comment represent Self-Admitted Technical Debt? Fig. 2. Distribution of answers to "Does the comment represent Self-Admitted Technical Debt?". Initial agreement among the annotators before resolving disagreements: 94.33% across 335 comments. annotation until reaching consensus on all ratings. We report the initial agreement for each question before the resolution of disagreements as part of our findings in the next section.

D. Online Appendix
Our online appendix contains descriptive information on the 335 commits which were labeled as removing self-admitted technical debt according to Maldonado et al. [7] along with our qualitative annotations in response to the seven questions. The appendix is available at https://tinyurl.com/onholdSATD.

III. QUALITATIVE FINDINGS
In this section, we describe the findings derived from our qualitative study, separately for each sub-question for RQ1.

A. Initial Analysis
As shown in Figure 2, we found that not all commits which were automatically classified as removing self-admitted technical debt by the work of Maldonado et al. [7] actually removed a comment indicating debt. In some cases (9%)indicated as N/A in Figure 2-the comment was not removed but only edited, and in other cases (6%), the comment had been incorrectly tagged as self-admitted technical debt, e.g., in the case of "It is always a good idea to call this method when exiting an application".

B. RQ1.1 What kinds of self-admitted technical debt do developers indicate?
Our first research question explores the different kinds of self-admitted technical debt found in our sample. Figure 3 shows the final result of our coding after consolidating the coding schema. The two most common kinds of debt in our sample are requirements debt (44%, coded as "functionality needed") and design debt (17%, coded as "refactoring needed"). An example for the former is the comment "TODO handle known multi-value headers" while "XXX move message resources in this package" is an example for the latter. We also identified a number of clarification requests (15%), such as "TODO: why not use millis instead of nano?".
We coded self-admitted technical debt comments that explicitly stated that they were temporary as workaround (8%), e.g., "TODO this should subtract resource just assigned TEMPROARY". We identified some comments which indicated that the developer was waiting Observation that even for those commits which addressed self-admitted technical debt, this was not necessarily their main purpose RQ1.5 Could the same fix be applied to similar selfadmitted technical debt in a different project? possibly/no Identifying fixes that could potentially be applied automatically RQ1.6 Does the self-admitted technical debt comment include a condition?
Exploring the phenomenon of "on-hold" self-admitted technical debt-which emerged from answering the previous question-in more detail for something (5%), such as "TODO remove these methods if/when they are available in the base class!!!". We will focus our discussion on these comments in the next section. Finally, some comments which indicated technical debt describe bugs (4%, e.g., "TODO this causes errors on shutdown...") or focus on explaining the code (2%, e.g., "some OS such as Windows can have problem doing rename IO operations so we may need to retry a couple of times to let it work"). Note that for this annotation, we assigned each comment to exactly one category.
C. RQ1.2 Do commits which remove the comments indicating self-admitted technical debt actually fix the debt?
For the majority of commits (58%) which removed the comment indicating technical debt, the commit did not actually fix the problem described in the comment, see Figure 4. Instead, these commits often removed the comment along with the surrounding code. These findings are in line with recent work by Zampetti et al. [8] who found that between 20% and In the cases where the commit fixed the self-admitted technical debt, we also coded the kind of fix that was applied. Figure 5 show the results of this coding: Debt was either fixed by implementing new code (58%), by refactoring existing code (15%), by removing code (12%), by uncommenting code that had been previously commented out (7%), or by removing a workaround (4%).  III  TYPES OF SELF-ADMITTED TECHNICAL DEBT AND THE CORRESPONDING FIXES   implementation refactoring removing code uncommenting code removing workaround other not fixed  functionality needed  56  1  0  0  0  0  69  refactoring needed  2  16  1  0  0  1  29  clarification request  5  0  4  0  0  1  33  workaround  2  0  2  3  5  0  12  wait  2  0  2  3  0  0  6 Table III shows the relationship between the two coding schemes that emerged from our qualitative data analysis: one for the kinds of technical debt indicated in developer comments, and one for the kinds of fixes applied to this debt. Unsurprisingly, many instances where new functionality was needed were addressed by the implementation of said functionality, and cases where refactoring was needed were addressed by refactoring. Interestingly, all comments of developers explaining technical debt were removed without addressing the debt, and waits could sometimes be addressed by uncommenting code that had been written in anticipation of the fix. A large number of comments indicating debt were not addressed-for example, out of 43 comments which we coded as clarification request, 33 (77%) were "resolved" by simply deleting the comment.
E. RQ1.4 Is the removal of self-admitted technical debt the primary reason for the commits which remove the corresponding comments?
The removal of technical debt was often not the primary reason for commits which removed self-admitted debt, see Figure 6. This is in line with findings reported by Zampetti et al. [8] who found that only 8% of the technical debt removal is acknowledged in commit messages. We did not attempt to resolve disagreements between annotators for this question as the concept of "primary reason" can be ambiguous. Instead, instances where annotators disagreed are shown as "unclear" in Figure 6.
An example of a commit which removed self-admitted technical debt even though it was not the main purpose of the com- mit is Apache Camel commit f47adf. 3 The commit removed the following comment: "TODO: Support ordering of interceptors", but this was part of a much larger refactoring as described in the commit message: "Overhaul of JMX". On the other hand, the commit message of commit 88ca35 4 from the same project "Added onException support to DefaultErrorHandler" is very similar to the self-admitted technical debt comment that was removed in this commit "TODO: in the future support onException", which suggests that removing the debt was the primary reason for this commit.
F. RQ1.5 Could the fixes applied to address self-admitted technical debt be applied to address similar debt in other projects?
We identified two kinds of self-admitted technical debt that could possibly be handled automatically. The first kind are comments which are fairly specific, e.g., "TODO gotta catch RejectedExecutionException and properly handle it". Automated tool support could be built to at least catch the exception based on this description. The second kind are comments which indicate that a developer is waiting for something, which we will discuss further in the next subsection. Figure 7 shows the ratio of fixes that could possibly be automated and applied in other settings, which is one third of all fixes. This finding supports Zampetti et al. [8] who found that most changes addressing self-admitted technical debt require complex source code changes. Note that we counted all those comments as "possibly" that were rated as "possibly" by at least one annotator.
G. RQ1.6 How many of the comments indicating self-admitted technical debt contain a condition to specify that a developer is waiting for a certain event or an updated functionality having been implemented elsewhere?
A theme that emerged from answering the previous research question is the concept of self-admitted technical debt comments which include a condition to indicate that a developer is waiting for a certain event or an updated functionality having been implemented elsewhere. We refer to this kind of debt as "on-hold" self-admitted technical debt-the comment is "on hold" until the condition is met (see Figure 1 for examples). In our sample, we identified 27 such comments, see Figure 8.

H. Summary
In summary, we were able to confirm the findings of previous work which indicate that the main categories of technical debt are related to requirements and design [5]. In the majority of cases, a commit which removes a comment admitting technical debt does not actually fix the debt, and even in the remaining cases, fixing the debt is often not the primary reason for the commit. When self-admitted technical debt is fixed, this is usually done through the implementation of new functionality, but we also identified cases where debt could be addressed by either removing or uncommenting code. We identified a particular sub-class of self-admitted technical debt comments which might be amenable to automated tool support, i.e., "on-hold" self-admitted technical debt. Next, we will describe the classifier we built to detect this subclass of self-admitted technical debt automatically. Figure 9 shows the overview of our classifier for "on-hold" self-admitted technical debt identification and the detection of the specific conditions that developers are waiting for. Given self-admitted technical debt comments, two preprocessing steps are applied before classifying them into on-hold or not, namely, term abstraction and n-gram feature extraction. Within identified "on-hold" self-admitted technical debt comments, specific conditions are detected.

A. Term Abstraction
Similar to a previous text classification study [9], we perform abstraction as a preprocessing step. We target the following specific terms: date expression, version, bug id, URL, and product name. Each term is abstracted into a different string: @abstractdate, @abstractversion, @abstractbugid, @abstracturl, and @abstractproduct. We apply this process because we are more interested in the existence of these types rather than the actual terms, which do not appear frequently. For example, considering the comment "TODO: CAMEL-1475 should fix this", CAMEL-1475 will be changed to the string "@abstractproduct @abstractbugid". Table IV summarizes the regular expressions we used for identifying targeted terms. Replacements using the regular expressions are conducted from top to bottom in the table. Subsequently, URLs linking to specific ids of bugs are abstracted to "@abstracturl @abstractbugid". For stop word removal, we use the regular expression [ˆA-Za-z0-9]+ to ignore symbols which do not represent words.

B. Feature Extraction using N-gram IDF
We extract features by applying N-gram IDF [10], [11]. Inverse Document Frequency (IDF) has been widely used in many applications because of its simplicity and robustness; however, IDF cannot handle phrases that are composed of more than one term. Because IDF gives more weight to terms occurring in fewer documents, rare phrases are assigned more weight than good phrases that would be useful in text classification. N-gram IDF is a theoretical extension of IDF for handing multiple terms and phrases by bridging the theoretical gap between term weighting and multi-word expression extraction [10], [11].
Terdchanakul et al. reported that for classifying bug reports into bugs or non-bugs, classification models using features from N-gram IDF outperform models using topic modeling features [12]. In addition to this, we consider that n-gram word features are beneficial for comment classification rather than topic modeling because source code comments are generally short and contain only few words.
In this study, we use an N-gram Weighting Scheme tool [13], which uses enhanced suffix array [14] to enumerate valid n-grams. We obtain a list of all valid n-gram terms in the self-admitted technical debt comments. Although previous work applied feature selection methods to decrease the number of n-gram terms used in classification models [12], we use all n-gram terms in this study. Compared to more than ten thousand n-gram terms derived from bug reports [12], we obtained about two thousand n-gram terms from our selfadmitted technical debt comments.

C. Classifier Learning
Given the set of n-gram term features from the previous step, we build a classifier that can identify "on-hold" selfadmitted technical debt by classifying self-admitted technical debt comments into on-hold or not. We prepare two classifiers for our evaluation, namely, random forests and automated machine learning. We use the random forest classifiers to compare the performance of n-gram term features with oneterm features (as a benchmark), and to investigate important features for both classifiers. Automated machine learning aims to optimize choosing a good algorithm and feature preprocessing steps [15]. To obtain the best performance (RQ2.1), we apply the auto-sklearn tool [15].
For classifier learning, we prepare feature vectors with N-gram TF-IDF scores of all n-gram words. The score is calculated with the following formula: where |D| is the total number of comments, sdf is the document frequency of a set of terms composing an n-gram, and gtf is the global term frequency.

D. On-hold Condition Detection
After "on-hold" self-admitted technical debt comments are identified, we try to detect their on-hold conditions. During our annotation, we found that specific bug IDs, product versions, and dates tend to form these conditions. As we have already replaced these terms with specific keywords shown in Table IV, we can derive conditions by recovering the original terms. The following is our detection process.
1) Extract keywords of @abstractdate, @abstractversion, @abstractbugid, and @abstractproduct by preserving the order of appearance in the identified "on-hold" self-admitted technical debt comments. 2) Group keywords to make valid conditions. Only the following sets of keywords are considered to be valid conditions, and other keywords that do not match the following orders are ignored.
• {@abstractproduct, @abstractversion, ...}: a product name followed by one or more version expressions, to indicate specific versions of the product. • {@abstractproduct, @abstractbugid, ...}: a product name followed by one or more bug ID expressions, to indicate specific bugs of the product. Identifying these keywords as conditions is not trivial, because they also frequently appear in comments that do not indicate "on-hold" self-admitted technical debt. Since we limit this detection to the identified on-hold comments, we expect that this simple process can work. The results for this step are described in Section V-E.

A. Data Preparation and Annotation
As shown in Figure 8, we found fewer than 30 "onhold" self-admitted technical debt comments in the sample of 335 comments. Since it is difficult to train classifiers on such a small number of instances, we investigated all 2,599 comments again to prepare data for our classification. Among all 2,599 comments, we found 92 duplicate comments (i.e., the exact same comment appeared in more than one location). After removing duplicate comments, the first and third author separately annotated the remaining comments in terms of (i)   Figure 2) and (ii) whether the self-admitted technical debt comments included a condition (similar to Figure 8). All conflicts in this annotation were resolved by the second author. Table V shows the result of this data preparation. From 2,599 comments, 92 duplicate comments and 277 comments that do not represent self-admitted technical debt are excluded. We obtained 161 on-hold comments and 2,069 other comments, which are used for our classification.

B. Evaluation Metrics
We measure the classification performance in terms of precision, recall and F 1 .
where tp is the number of true positives, f p is the number of false positives, and f n is the number of false negatives.

C. N-gram TF-IDF vs. (1-gram) TF-IDF
To assess the effectiveness of n-gram features in classifying "on-hold" self-admitted technical debt comments, we compare the performances of classifiers using N-gram TF-IDF and traditional TF-IDF [16]. Except for feature extraction, the two classifiers are prepared using the same settings including term abstraction. As stated in Section IV-C, random forest classifiers are used to compare the performances. Because of the relatively small size of classification data shown in Table V, we adopt leave-one-out cross-validation, which divides the data into one instance for test and the others for training, and repeats this division to feature every instance as test.
Performance results for both methods are shown in Table VI  and the top 20 useful features for random forest classifiers  are presented in Table VII. The usefulness can be estimated from feature importance with forests of trees. For the (1gram) TF-IDF classifier, although the precision is high, recall is low, and as a result, the F 1 -score is lower. Considering the useful features in Table VII, this performance can be explained by the fact that the TF-IDF classifier can identify onhold comments which contain explicit terms such as "once", "when", "until", and abstraction terms (@abstractbugid, @abstractversion, and @abstractproduct), but it misses on-hold comments that do not contain such terms. As seen in Table VII, the N-gram TF-IDF classifier learned a variety of phrases. There are no intuitive time-related terms in the top 20 features for N-gram TF-IDF, and abstraction terms only appear in the 35th position. By making use of n-gram terms (n 1), the N-gram TF-IDF classifier improved recall compared to the TF-IDF result and achieved an F 1 score of 0.70. Therefore, we conclude that N-gram TF-IDF outperforms traditional TF-IDF in this scenario.
D. RQ2.1 What is the best performance of a classifier to automatically identify "on-hold" self-admitted technical debt?
We use the automated machine learning tool autosklearn [15] to assess the best performance a classifier can achieve when classifying "on-hold" self-admitted technical debt comments. In machine learning, two problems are known: (1) no single machine learning method performs best on all data sets, and (2) some machine learning methods rely heavily on hyperparameter optimization. Auto-sklearn addresses these problems as a joint optimization problem [15]. Auto-sklearn includes 15 base classification algorithms, and produces results from an ensemble of classifiers derived by Bayesian optimization [15].
Since automated machine learning takes a lot of time and we achieved a better performance with N-gram TF-IDF compared to TF-IDF (see Section V-C), we only evaluate the automated machine learning performance using N-gram TF-IDF. Because of the same time reason, we use ten-fold crossvalidation instead of leave-one-out cross-validation. Ten-fold cross-validation divides the data into ten sets and every set is used as test set once while the others are used for training. Due to the imbalance between the number of positive and negative instances, we use the Stratified ShuffleSplit cross validator of scikit-learn, 5 which intends to preserve the percentage of samples from each class. Because of this process, some instances can appear multiple times in different sets. Therefore we report the mean values of the evaluation metrics across all ten runs as the performance. As shown in Table VIII, our classifier achieved a mean precision of 0.81, a mean recall of 0.67, and a mean F 1 -score of 0.72. Compared to the previous result only with random forests shown in Table VI, recall improved substantially with still high precision and a higher F 1 -score. We consider that precision is more important than recall for the kind of recommendation system that this classifier enables since false positives (i.e., unwarranted recommendations) will annoy developers more than false negatives (i.e., recommendations that the system could have made but did not).
E. RQ2.2 How well can we automatically identify the specific conditions in "on-hold" self-admitted technical debt?
Because of our treatment of imbalanced data (see Section V-D), some comments can appear multiple times in the test set. We consider that an on-hold comment is correctly identified only if it has been classified correctly in all cases where it was part of the test set. Our classifier was able to identify 121 on-hold comments correctly. Among them, 66 comments contain abstraction keywords which indicate a specific condition, and all those instances were confirmed to be specific conditions by manual investigation. Some comments do not mention specific conditions, such as "This crap is required to work around a bug in hibernate". Among the 19 false positives (incorrectly identified comments), 12 comments contain abstraction keywords, but these keywords are used for references and not for conditions that a developer is waiting for. In summary, 84% (66/(12+66)) of the detected specific conditions are correct, and for 55% (66/121) of the on-hold comments, we were able to identify the specific condition that a developer was waiting for.

VI. IMPLICATIONS
The ultimate goal of our work is to enable the automated management of self-admitted technical debt. Previous work [8] has found that most changes which address self-admitted technical debt require complex code changes-as such, it is unrealistic to assume that automated tool support could handle all kinds of requirement debt and design debt that developers admit in source code comments. Thus, in this work we set out to first identify a sub-class of self-admitted technical debt amenable to automated management and second develop a classifier which can reliably identify this sub-class of debt.
Our qualitative study revealed one particular class of selfadmitted technical debt potentially amendable to automated tooling: "on-hold" self-admitted technical debt, i.e., comments in which developers express that they are waiting for a certain external event or updated functionality from an external library before they can address the debt that is expressed in the comment. In other words, the comment is "on hold" until the condition has been met.
Based on the data set made available by Maldonado et al. [7], we identified a total of 121 comments which indicate "on-hold" self-admitted technical debt, confirming that this phenomenon is prevalent and exists in different projects. Our classifier to identify "on-hold" self-admitted technical debt was able to reach a precision of 0.81 in identifying comments that belong to this sub-class. In addition, we were able to identify specific conditions contained within these comments (84% of conditions are detected correctly).
Given all the events and new releases that happen in a software project at any given point in time, it is unrealistic to assume that developers will be able to stay on top of all instances of technical debt that are ready to be addressed once a condition has been met. Instead, there is a risk that developers forget to go back to these comments and debt instances even when the event they were originally waiting for has occurred. This work builds a first step towards the design of automated tools that can support developers in addressing self-admitted technical debt. In particular, based on the classifier introduced in this work, it is now possible to build tool support which can monitor the specific external events we have identified in this work (e.g., certain bug fixes or the release of new versions of external libraries) and notify developers as soon as a particular debt is ready to be addressed.

VII. THREATS TO VALIDITY
In terms of construct validity, we use precision, recall and F 1 -score as our evaluation metrics, similar to previous work that requires classification [7], [12]. Due to the limitation of time and computation, we cannot provide a comparison between N-gram TF-IDF and traditional TF-IDF on automated machine learning. However, our comparison between N-gram TF-IDF and traditional TF-IDF using random forests showed that N-gram TF-IDF outperforms traditional TF-IDF.
Regarding threats to internal validity, it is possible that we introduced bias through our manual annotation. While we generally achieved high agreement regarding the annotation questions listed in Table II, the initial agreement regarding RQ1.1 was low which is explained by the nature of the open-ended question. We resolved all disagreements through multiple co-located coding sessions with the first three authors of this paper. Note that we do not use the results of RQ1.1 as an input for our classifier.
For external validity, while we analyzed a statistically representative sample of commits for RQ1 and the entire data set made available by Maldonado et al. [7] for RQ2, we cannot claim generalizablity beyond the five projects contained in this data set. The limited data set allowed us to perform an in-depth qualitative analysis, and future work will need to investigate the applicablility of our results to other projects.
VIII. RELATED WORK Self-admitted technical debt has been a popular research topic in the software engineering community in recent years. In this section, we introduce key research related to our study.
Maldonado et al. [7] studied the removal of self-admitted technical debt by applying NLP to self-admitted technical debt. They found that (i) the majority of self-admitted technical debt was removed, (ii) self-admitted technical debt was often removed by the person who introduced it, and (iii) selfadmitted technical debt lasts between 18 to 172 days (median). Using a survey, the authors also found that developers mostly use self-admitted technical debt to track bugs and code that requires improvement. Developers mostly remove selfadmitted technical debt when they are fixing bugs or adding new features.
Zampetti et al. [8] conducted an in-depth quantitative and qualitative study of self-admitted technical debt. They found that (i) 20% to 50% of the corresponding comments were accidentally removed when entire methods or classes were dropped, (ii) 8% of self-admitted technical debt removals were indicated in the commit messages, and (iii) most of the self-admitted technical debt requires complex changes, often changing method calls or conditionals.
Potdar and Shihab [17] tried to identify self-admitted technical debt by looking into source-code comments in four open source project (i.e., Eclipse, Chromium OS, Apache HTTP Server, and ArgoUML). Their study showed that (i) the amount of debt in these project ranged between 2.4% and 31% of all files, (ii) debt was created mostly by developers with more experience, and time pressures and code complexity did not correlate with the amount of self-admitted technical debt, and (iii) only 26.3% to 63.5% of self-admitted technical debt comments were removed.
Wehaibi et al. [18] studied the relation between selfadmitted technical debt and software quality based on five open source projects (i.e., Hadoop, Chromium, Cassandra, Spark, and Tomcat). Their result showed that (i) there is no clear evidence that files with self-admitted technical debt had more defects than other files, (ii) compared with self-admitted technical debt changes, non-debt changes had a higher chance of introducing other debt, but (iii) changes related to selfadmitted technical debt were more difficult to achieve.
Mensah et al. [19] introduced a prioritization scheme. After running this scheme on four open source projects, they found four causes of self-admitted technical debt which was code smells (23.2%), complicated and complex task (22.0%), inadequate code testing (21.2%), and unexpected code performance (17.4%). The result also showed that self-admitted technical design debt was prone to software bugs, and that for highly prioritized self-admitted technical debt tasks, more than ten lines of code were required to address the debt.
There are studies classifying self-admitted technical debt comments. Maldonado et al. tried identifying design-related and requirement-related self-admitted technical debt using a maximum entropy classifier [5]. Huang et al. tried classifying comments in terms of whether they contained self-admitted technical debt or not, and reported that their proposal outperformed the baseline method [20]. Since these studies used 1-gram terms as features, our proposal of using n-gram term features may improve the above classification performances.
In our study we use the same data set as previous research [7], [8].

IX. CONCLUSIONS AND FUTURE WORK
Self-admitted technical debt refers to situations in which software developers explicitly admit to introducing technical debt in source code comments, arguably to make sure that this debt is not forgotten and that somebody will be able to go back later to address this debt. In this work, we hypothesize that it is possible to develop automated techniques to manage a subset of self-admitted technical debt.
As a first step towards automating a part of the management of self-admitted technical debt, in this paper, we contribute (i) a qualitative study on the removal of self-admitted technical debt in which we annotated a statistically representative sample of 335 technical debt comments using seven questions that emerged as part of the qualitative analysis; (ii) the definition of "on-hold" self-admitted technical debt (debt which contains a condition to indicate that a developer is waiting for a certain event or an updated functionality having been implemented elsewhere) which emerged from this qualitative analysis as a particular class of self-admitted technical debt that can potentially be managed automatically; and (iii) the design and evaluation of a classifier for self-admitted technical debt which can detect "on-hold" debt with a precision of 0.81 as well as identify the specific conditions that developers are waiting for.
Building on these contributions, in our future work we intend to build the tool support that our classifier enables: a recommender system which can indicate for a subset of self-admitted technical debt in a project when it is ready to be addressed. We found that self-admitted technical debt is sometimes addressed by uncommenting source code that has already been written in anticipation of the debt removal. As another step towards the automation of technical debt removal, in future work, we will explore whether it is possible to address such debt automatically.