1 Introduction

Development teams aim at implementing software projects of high quality, on time, and within budget. In real-world scenarios however, quality sometimes is traded in order to deliver the software on time. Developers tend to apply quick fixes or temporary implementations that are not necessarily ideal for the long term, which can lead to the incurring of technical debt (Brown et al. 2010; Tom et al. 2013).

Technical Debt (TD) is a term introduced by Ward Cunningham (Cunningham 1993) to describe the situation where accomplishing short-term goals is chosen over long-term code quality. Just like financial debt, TD can accumulate interest if it is not dealt with quickly. Poor coding practices can cause the presence of TD. Previous studies have highlighted the widespread occurrence of TD and its impact on software quality, complexity, and maintenance. The presence of TD makes changes to the system more frequent (Zazworka et al. 2011) and harder to implement (Wehaibi et al. 2016). Developers recognise that TD is unavoidable and, therefore, in need of careful management (Lim et al. 2012). Repaying TD comes in the form of re-structuring and refactoring the software (Huang et al. 2018).

Despite the importance of TD management, especially in its early stages, the identification of TD in itself is a challenge. Moreover, TD needs to be well-understood once it is identified in order to be properly managed. For example, if a developer implements a workaround in the code and wraps it within a conditional statement (e.g. if-statement), other developers in the team may not know that this is a TD-carrying statement. To make it visible, the developer writes a comment that describes the TD. Such comments flag Self-Admitted Technical Debt (SATD) (Potdar and Shihab 2014).

The majority of existing work (Potdar and Shihab 2014; Maldonado and Shihab 2015; de Freitas Farias et al. 2015; da Silva Maldonado et al. 2017; Huang et al. 2018; Yan et al. 2018; Wattanakriengkrai et al. 2018; Maipradit et al. 2020, 2020) has focused on developing tool support for detecting if a code comment flags SATD. They presume the existence of such comments attached to code fragments that contain technical debt. However, this presumption does not always hold in practice. There are many instances where TD in code is not explicitly acknowledged in the form of a comment. In these cases, there is a need for automated machinery which can: (i) determine if a given code fragment introduces technical debt; and if so, (ii) generate the appropriate (self-admitting) comment that can be attached to the code. However, proposals to provide this kind of support is currently missing in the literature.

In this paper, we propose a framework that provides two levels of support: SATD recommendation and SATD description. Given a code fragment as input, our Self-Admitted Technical Debt Identification and Description framework (SATDID) determines/identifies if technical debt should be self-admitted for this code fragment (level 1), and then automatically generates a comment admitting and describing the detected technical debt instance (level 2). SATDID can be used on-the-fly to recommend and describe potential SATD as the developers writes the code. This way, TD can be prevented before its occurrence. We explore and evaluate the capabilities of different machine/deep learning approaches in implementing SATDID and providing these levels of support.

Although our approach can generally be applied to any size and type of code fragments, our focus in this study is on conditional statements. Conditionals are prominent in the context of technical debts. Kruchten et al. (2019) place “quick-and-dirty” conditional statements on the top of their TD example list for managing technical debt. In addition, previous studies (e.g. Zampetti et al. 2020, 2018) found that SATD comments were often associated with conditional statements.

Our contribution in this paper is as follows:

  1. 1.

    A dual-layered framework for SATD management: To the best of our knowledge, SATDID is the first to provide two levels of support for SATD management, i.e. SATD recommendation and SATD comment generation.

  2. 2.

    Leveraging machine/deep learning for SATD management: SATDID consists of different machine/deep learning components that are carefully developed and improves the results over all the baselines.

  3. 3.

    We build and publish the first dataset of labelled SATD and non-SATD code-comment pairs consisting of conditional statements and their associated comments. We also made our code-base and experiment reports publicly availableFootnote 1.

The rest of the paper is outlined as follows. Sect. 2 gives a motivating example. Sect. 3 illustrates the architecture of our framework (SATDID). The paper in the following sections describes the key components of SATDID in details. Sect. 4 describes the first two modules, namely Data Processing and Data Vectorisation. Section 5 describes the SATD Identification module. Section 6 describes the SATD Comment Generation module. After that, we explain the model training processes in Sect. 7. We evaluate and discuss our approach in Sects. 8 and 9. We present the related work in Sect. 10 before we conclude our study in Sect. 11.

2 Motivating example

Previous research (Wehaibi et al. 2016) studied the impact of Technical Debt (TD) on software complexity and changeability. TD needs to be addressed, and the ultimate goal is to remove (repay) TD instances from the software. In order to manage and eventually repay TD, we first need to identify its occurrences in the codebase. This is a challenging task, especially in large software projects where codebases can grow to millions of lines of code. Figure 1 depicts several scenarios to illustrate various challenges of this problem.

Fig. 1
figure 1

In both Scenarios 1.a and 1.b, the conditional statement is accompanied with a comment while there is no comment accompanying the conditional statement in Scenarios 2.a and 2.b. The conditional statement in Scenarios 1.a and 2.a (highlighted with light green) is TD-free. The conditional statement in Scenario 1.b (highlighted with light yellow) contains TD which is noted by the preceding SATD comment. The conditional statement in Scenario 2.b (highlighted with light red) also contains TD but a SATD comment is missing

Scenario 1.a in Fig. 1 depicts the case of a TD-free conditional statement with an associated comment that describes what the if-statement does. However, in Scenario 1.b, there is TD in the conditional statement, which is explicitly acknowledged in the associated comment that suggests code revision. The challenge in Scenarios 1.a and 1.b is to detect which comment is a SATD comment and which one is not. Existing work in SATD (e.g. Potdar and Shihab 2014; da Silva Maldonado et al. 2017; Huang et al. 2018) focuses on addressing this challenge.

The existing approaches rely on the comments provided with the code to detect SATD. However, there are many cases where technical debt is not explicitly admitted in the form of a comment. Scenarios 2.a and 2.b depicts conditional statements that are not accompanied with comments. While the code in Scenario 2.a does not contain TD, the one in Scenario 2.b does.

Scenario 2.b shows a conditional statements extracted from the infrastructure of project OpenflexoFootnote 2. In short, the if-statement was written to forcibly re-validate invalid expressions in the project. This is a temporary workaround. Ideally, the code should be written in a way that assures only valid expression production rather than let it potentially produce invalid expressions and re-validate them. It is important to note that this TD is not self-admitted in a comment.

Automated support is thus needed to assist software engineers in scenarios similar to 2.a and 2.b. If a comment is not provided, the automated support analyses the source code to determine if it contains TD that should be self-admitted, and brings this to the software developer’s attention. Upon receiving the developer’s confirmation, the automated support generates an appropriate SATD comment and attach it with the code fragment. When used on-the-fly, the automated support raises a warning when a SATD comment is needed so that the developer decides to either modify the code or let the tool generate the appropriate SATD comment. In Sect. 3, we illustrate the architectural design of our proposed framework that provides this kind of automated support.

Fig. 2
figure 2

SATDID architectural design

3 Architectural design

We propose an automated framework called SATDID that addresses the problems introduced by scenarios like the ones in Sect. 2. There are two main technical challenges facing SATDID, namely i) recommending when hidden instances of technical debt in code should be self-admitted, and ii) generating SATD comments describing the hidden TD in the identified code fragments. SATDID’s design consists of multiple components that are distributed across four processing modules (see Fig. 2) to address these challenges. The four processing modules are Data Processing, Data Vectorisation, SATD Identification, and SATD Comment Generation. The first two modules prepare for and facilitate facing the technical challenges while the last two modules are responsible for the direct handling with them. These framework components operate in a chronological manner to achieve the main objective of providing SATD recommendation and description services.

In the Data Processing module, the input source code passes through the component responsible for code processing (e.g. parsing a conditional statement and building its Abstract Syntax Tree). For the next two modules, the user can choose between two configurations: either using a traditional machine learning classifier (e.g. Multinomial Naive Bayes (MNB), Support Vector Machines (SVM), and Random Forest (RF)) or deep learning classifier (e.g. Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN)). There are two main reasons for choosing this configuration setup. The first reason is to study and report the differences between the performances of the two configurations. The second reason is to provide the ability to the end user to choose their preferred model based on their available data and machinery when using our framework. Section 9.1.1 is dedicated to further discuss this point. Each configuration has its own data vectorisation technique. In the Data Vectorisation module, the Word Embedding component creates an embedding (i.e. vector) for each token in the processed source code and passes it to the deep learning component in the next module. The Vector Space Model (VSM) component creates a vector that represents the entire processed source code and passes it to the traditional machine learning component in the next module. The SATD Identification module contains the machine/deep learning components responsible for identifying if there is TD in the code that should be self-admitted. If this is the case, SATDID flags it and passes its vector representation to the SATD Comment Generation module. The SATD Comment Generation module contains a deep learning component which generates the appropriate SATD comment that can be attached to the input code fragment. More details on our framework modules are presented in Sects. 4, 5, and 6.

4 Data processing and vectorisation

In this section, we present our implementation for the first two modules of SATDID, namely the Data Processing and Data Vectorisation modules. These modules are responsible for transforming an input code fragment into the appropriate format required by the machine learning components in the next modules. Sects. 4.1 and 4.2 describe the Data Processing and Data Vectorisation modules, respectively, in details.

4.1 Data processing

The Data Processing module consists of a source code processing component and a comment processing component. Note that we process the SATD comments in the training data used for generating SATD comments. We do not use them for SATD identification since our approach caters for cases where those comments do not exist. Hence, we describe here the code processing component. We discuss the comment processing component in Sect. 8.1.2, where we describe our pre-processing of the data used for training our models.

4.1.1 Source code processing

First, we create the Abstract Syntax Tree (AST) of the input source code fragment. Second, we create a sequential representation of the tree using a method proposed by Hu et al. (2018) called AST with Structure-Based Traversal (SBT). Classical traversal methods that convert ASTs to sequences (e.g. pre-order traversal) can be ambiguous in the way that different code fragments may produce the same sequence representation. We adapt the SBT representation proposed in Hu et al. (2018) to ensure unique sequence representations for different code inputs.

Suppose we have an AST with only three nodes: a parent node and two child nodes. Let us call the parent ‘1’, the left-child ‘2’, and the right-child ‘3’. The SBT method will represent the tree with the following sequence: (1(2)2(3)3)1.

In ASTs, non-terminal nodes represent the structural information of the code, and they have a “type”. Terminal nodes have a “type” and a “value”, where “value” is the concrete source code token and “type” is its type. In SBT, non-terminal nodes are represented by their types, while terminal nodes are represented by their types and values. Figure 3 shows how the Source Code Processing component in the Data Processing module parses the conditional statement in Scenario 2.b in Fig. 1 to its AST then applies the SBT method to produce the sequence representation of the tree.

Fig. 3
figure 3

The Abstract Syntax Tree (AST) of the conditional statement in Scenario 2.b in Figure 1 is on the left-hand side. The sequence representation of the AST using the Structure-Based Traversal (SBT) method is on the right-hand side

4.2 Data vectorisation

The Data Vectorisation module receives the processed code fragment from the Data Processing module. The Data Vectorisation module is responsible for creating vector representations of code fragments, on/from witch the learning components of the framework train/predict. This module contains two components, namely the Word Embedding component (see Sect. 4.2.1), which is linked to the deep learning component in the next module, and the Vector Space Model (VSM) component (see Sect. 4.2.2), which is linked to the traditional machine learning component in the next module.

4.2.1 Word embedding

Text data typically are highly sparse (Bingham and Mannila 2001) and of high dimensionality (Aggarwal and Zhai 2012). If we use one-hot encoding to create the vector representation of a word, we will end up with a vector that is the size of the entire vocabulary with all 0s except a 1 at the word’s position, and that is for each word in the vocabulary of our dataset. Note that “word” here refers to either an AST token (in source code) or a textual word (in comments). To alleviate this problem, we use a technique called word embedding (Gal and Ghahramani 2016) which aims to represent each word in the vocabulary as a fixed-length continuous vector (also called an embedding). The values in those embeddings are learnt and adjusted during model training. Word embedding has the trait of finding semantic relations between words/tokens according to how the values inside their embeddings relate to each other. Embeddings that have common semantics tend to cluster together.

Central to this is an embedding matrix \(\mathcal {M}\in \mathcal {R}^{d\times |\mathcal {V}|}\) where \(\varvec{d}\) is the embedding size and \(\varvec{|\mathcal {V}|}\) is the number of words in our vocabulary \(\varvec{\mathcal {V}}\). The embedding matrix \(\varvec{\mathcal {M}}\) acts as a lookup table where each row is a vector representation of a word in our vocabulary. Each word will have an index, and the word with index \(\varvec{i}\) will have its vector representation (i.e. embedding) at the \(\varvec{i^{th}}\) row of the matrix \(\varvec{\mathcal {M}}\). Machine/Deep learning models only deal with these indices and their vector representations and do not have access to the actual words/tokens. SATDID will generate indices that then can be converted to their associated words from our vocabulary in order to display the generated comment sentences.

4.2.2 Vector space model

Vector Space Model (VSM) (Salton et al. 1975) is the compatible vectorisation technique with traditional machine learning. While every word/token is represented as a vector in word embedding (Sect. 4.2.1), the entire source code fragment is represented by a vector in VSM. Let us call a source code fragment a document \(\varvec{d}\), and a code token a term \(\varvec{t}\). Every document \(\varvec{d}\) is represented by a vector, every vector is a data point, and every term \(\varvec{t}\) is a dimension in these vectors. We calculate the weights of these terms by using a scheme called Term Frequency-Inverse Document Frequency (TF-IDF). This scheme determines the “importance” of a term \(\varvec{t}\) in a document \(\varvec{d}\). Term’s importance to a document increases by two factors: its frequency in that document and its rarity in the entire document set. TF-IDF is computed as follows:

$$\begin{aligned}&idf(t) = \log (\frac{|\mathcal {D}|}{df(t)}) + 1 \end{aligned}$$
(1)
$$\begin{aligned}&tfidf(t, d) = tf(t, d) \times idf(t) \end{aligned}$$
(2)

where \(\varvec{|\mathcal {D}|}\) is the total number of documents, \(\varvec{df(t)}\) is the number of documents the term \(\varvec{t}\) appears in, and \(\varvec{tf(t, d)}\) is the number of times the term \(\varvec{t}\) appears in document \(\varvec{d}\).

5 SATD identification

This module handles the vectors produced by the Data Vectorisation module (Sect. 4.2). It contains the learning components for identifying source code fragments which contain TD that should be self-admitted. There are two configurations here: using either a deep learning (Sect. 5.1) or traditional machine learning (Sect. 5.2) component. Later, we will evaluate and discuss the implications of these configurations in Sects. 8 and 9.

5.1 Deep learning detector

Our implementation of the deep learning detector (as well as the deep learning generator in Sect. 6) is based on the Long Short-Term Memory (LSTM) models (Hochreiter and Schmidhuber 1997). Our deep learning detector has an embedding layer (see Sect. 4.2.1) as its first layer. Let \(\varvec{W_1},...,\varvec{W_n}\) be an input sequence produced by the programming language processing unit (see Sect. 4.1.1). The embedding layer converts the elements of the input sequence into their word embeddings (vectors) \(\varvec{V_1},...,\varvec{V_n}\). The layer (or stack of layers) following the embedding layer is an LSTM layer (or a stack of LSTM layers, i.e. an LSTM network). An LSTM layer consists of a sequence of LSTM units. All of these units share the same model parameters since LSTMs are Recurrent Neural Networks (RNNs). At a time step \(\varvec{t}\), an LSTM unit reads an input vector \(\varvec{V_t}\) and the output state from the previous LSTM unit \(\varvec{S_{t-1}}\), and returns the current output state \(\varvec{S_t}\). Then, the output state \(\varvec{S_t}\) is passed onto two directions: the next layer (whether it is an LSTM layer or a Dense layer) and the LSTM unit in the next time step \(\varvec{t+1}\). The bottom two layers in Figure 4 depict the job of the embedding and LSTM layers.

There are several variations of our implementation of the deep learning detector which produce different results. The variations include three different down-sampling techniques (Sect. 5.1.1) and different hyper-parameter settings (explained in Sect. 8.2 and stated in Sect. 8.4).

Fig. 4
figure 4

The structure of our deep learning detector (single-layered). The number of layers is determined by the number of stacked LSTM layers. An output closer to 1.0 recommends a SATD comment, while an output closer to 0.0 does not recommend a SATD comment

5.1.1 Pooling

While the deep learning model produces as many output state vectors as the number of LSTM units in the last layer of the LSTM network, the Dense layer accepts only one vector as an input. To down-sample the network’s output to one vector, we examine the detector’s performance with and without using a “pooling” technique. With pooling, we experiment with max-pooling and mean-pooling. Without pooling, we only consider the last output state vector \(\varvec{S_t}\) at time step \(\varvec{t}\) since it holds information from all the previous output states \(\varvec{S_1...S_{t-1}}\), thanks to LSTM dynamics, and pass it to the Dense layer.

Pooling is a down-sampling method that reduces multiple inputs to the desired size (in our case, the size of the output state vector). At a time step \(\varvec{t}\), max-pooling considers only the maximum values in every element position in the output state vectors \(\varvec{S_1...S_{t}}\), while mean-pooling averages them. The resulting vector is then passed to the Dense layer.

Let us illustrate the three techniques using the following example. Suppose we have an input sequence of three items, \([W_1 \quad W_2 \quad W_3]\), and an output state of size 2. Suppose that the following are the final output states of these three items:

$$\begin{aligned}&\quad S_1 = S(W_1) = [5.2 \quad 3.3]\\&\quad S_2 = S(W_2) = [4.7 \quad 7.5]\\&\quad S_3 = S(W_3) = [9.1 \quad 0.6] \end{aligned}$$

We want to create one vector that captures information from all these three output states and use it as an input to the Dense layer. If no pooling technique is used, we consider the last output state \(\varvec{S_3}\). In max-pooling, we pool the elements of the same position and consider only the maximum value. In mean-pooling, we pool the elements of the same position and take the average:

$$\begin{aligned}&\quad NoPool(S_1, S_2, S_3)\;\;\;\;\,=[9.10 \quad 0.60]\\&\quad MaxPool(S_1, S_2, S_3)\;\,=[9.10 \quad 7.50]\\&\quad MeanPool(S_1, S_2, S_3)=[6.33 \quad 3.80] \end{aligned}$$

5.1.2 Sigmoid activation

As explained earlier, every LSTM unit is assigned for processing an input item, starting from the first item in the input sequence through the last item. At each time step \(\varvec{t}\), the vector resulting from down-sampling is passed to the Dense layer. The Dense layer has a sigmoid activation function of the following formula:

$$\begin{aligned} S(x) = \frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1} \end{aligned}$$
(3)

The sigmoid activation function returns a value between 0 and 1. This value represents what the detector “thinks” regarding the input sequence under investigation. If the value is closer to 1, it means that the detector leans towards deciding that there is hidden TD in the input code fragment which should be self-admitted. If the value is closer to 0, it means the detector votes for the opposite.

5.2 Traditional machine learning detector

An alternative implementation to deep learning is using traditional machine learning algorithms. We feed the TF-IDF vectors prepared by the previous module (see Sect. 4.2.2) to our traditional machine learning detector. Based on our experimentation with many machine learning algorithms, Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB) provide the best comparable performances to deep learning for SATD identification. We also experimented with Random Forest (RF) as part of replicating the benchmark’s approach (see Sect. 8.4.1).

SVM is a machine learning algorithm that maximises the margin between the class-separating line, i.e. the hyperplane, and the closest data points of the dataset’s classes (Joachims 2001). MNB implements the naive Bayes algorithm for multinomially distributed data (Rennie et al. 2003). RF (Breiman 2001) is an ensemble of Decision Trees, where each tree depends on an independent random vector and all the trees share the same distribution. SVM is a leading approach for text categorisation problems as suggested by Kibriya et al. (2004). In addition, McCallum et al. (1998) argue that MNB proves effectiveness with large vocabulary sizes. We refer the reader to (Kibriya et al. 2004; Joachims 1998; Xu et al. 2012) for further details on SVM, MNB, and RF for text classification.

6 SATD comment generation

The purpose of the SATD Comment Generation module is to generate an appropriate SATD comment that describes the TD in a code fragment. We implement the comment generator in this module using the deep learning encoder-decoder model which employs the sequence-to-sequence (seq2seq) learning method (Cho et al. 2014; Sutskever et al. 2014). The encoder and decoder are two LSTM-based networks. We also incorporate the Attention mechanism (see Sect. 6.1) and Beam search (see Sect. 6.3) into our generator. We examine the generator’s performance in multiple hyper-parameter settings (explained in Sect. 8.5). An internal view of our SATD comment generator is depicted in Fig. 5 (adapted from Luong et al. 2015).

The dynamics in which the embedding and LSTM layers of our comment generator operate is the same as the layers in our deep learning detector (described in Sect. 5.1). The difference between the generator and the detector (other than the generator being a composite of two LSTM-based networks) is highlighted in the top layers. In every time step \(\varvec{t}\), the generator passes the output state \(\varvec{S_t}\) to the Attention layer (see Sect. 6.1) instead of a Dense layer. In addition, the encoder passes its last output state \(\varvec{S_n}\) to the first LSTM unit in the decoder to accompany the embedding of the pre-first target comment word \(\varvec{W_{out\_0}}\) (which we define as <sos>). The LSTM layer(s) in the decoder produces the first output state \(\varvec{S_1}\). \(\varvec{S_1}\) is passed to next LSTM unit as well as the Attention layer. The vector resulting from the Attention layer is passed to the Dense layer in order to predict the first target word \(\varvec{W_{out\_1}}\). For predicting every target word \(\varvec{W_{out\_t}}\), the decoder is fed with the previous target word \(\varvec{W_{out\_t-1}}\). This training technique is called teacher-forcing [35], where the decoder trains to generate the same target sequence but offset by one time step.

Fig. 5
figure 5

The structure of our deep learning generator (single-layered encoder and decoder). The number of layers is determined by the number of stacked LSTM layers

6.1 Attention mechanism

The Attention Mechanism has demonstrated remarkable improvements in Neural Machine Translation (NMT) tasks (Bahdanau et al. 2014). We implement an Attention layer into our SATD comment generator to align between certain items in the input and output sequences. When predicting an output comment word \(\varvec{W_{out\_t}}\) at a time step \(\varvec{t}\), the Attention layer determines the amount of contribution each token in the input sequence \(\varvec{W_{in\_1}},...,\varvec{W_{in\_n}}\) has on generating the current output word \(\varvec{W_{out\_t}}\). Without the Attention layer, all the input tokens \(\varvec{W_{in\_1}},...,\varvec{W_{in\_n}}\) would have the same weight when predicting \(\varvec{W_{out\_t}}\), which is less practical since certain input tokens can map more closely than others to the current output comment word. The Attention layer adjusts the weight mappings of input-output sequences gradually during training.

6.2 Softmax activation

The last layer in the decoder is a Dense layer with a softmax activation function:

$$\begin{aligned} \sigma (x)_i = \frac{e^{x_i}}{\sum _{v=1}^{|\mathcal {V}|} e^{x_v}} \end{aligned}$$
(4)

If we have \(|\mathcal {V}|\) words in our target vocabulary (in this case, the comment vocabulary), the softmax activation function gives a probability value between 0 and 1 to each word in the vocabulary for the prediction at the current time \(\varvec{t}\), where the sum of all these values is 1. The model then nominates the word with the highest probability value to be the predicted output word \(\varvec{W_{out\_t}}\) for the current position \(\varvec{t}\) in the comment sentence.

6.3 Beam search

By default, the decoder uses greedy search to predict the likelihoods for the next word in the output sequence. Although this approach is often effective, it is non-optimal in some cases. In beam search, all the possible output words for the next step are generated, and the algorithm keeps track of the most likely \(\varvec{k}\) candidate sequences (in our case, comment sub-sentences up to the next output word). \(\varvec{k}\) is also known as the beam width. Therefore, greedy search is a special case of beam search where \(\varvec{k} = 1\). Increasing the number of generated candidates \(\varvec{k}\) typically increases the possibility of finding the best candidate output sequence at the expense of a potential drastic decrease in decoder speed (Yoav and Graeme 2017; Russell and Norvig 2002; Freitag and Al-Onaizan 2017). We incorporate the beam algorithm during the comment generation process, which allows our model to generate multiple candidate comments for every code fragment that requires a SATD comment (more details in Sect. 9.2.4).

7 Model training

The training data is fed to our deep learning components in batches. For each batch, the neural network performs two training tasks: the feed-forward task and then the back-propagation task. In the feed-forward task, the model processes the input batch and calculates the predictions. In the back-propagation task, the model measures the error distance between the actual outputs (i.e. the ground truth) and the predicted outputs of the current batch, and tweaks its parameters accordingly. By performing the two training tasks, the model completes one training step. Feeding the data in batches to the model has multiple benefits. Firstly, it accelerates the training process compared with feeding the model only one example at a time. Secondly, it introduces noise to the model which helps preventing over-fitting to the training data. Nonetheless, large batch sizes can be computationally exhaustive and reduce the prediction accuracy. Batch size is one of the model’s hyper-parameters that we consider during hyper-parameter tuning (Sect. 8.2.1).

We apply an over-fitting prevention strategy called dropout (Srivastava et al. 2014). At every training step, this strategy selects a random proportion of the neural network’s nodes and stops it from processing the batch’s examples. This is useful because some nodes tend to dominate the training weights. By setting a dropout rate (we set ours to 20%), we let the kept network nodes (80%) process the current batch, which helps to avoid the weight dominance issue.

In the SATD Identification module, the deep learning detector’s objective is to maximise the likelihood of predicting the target label (i.e. 1 if an input code fragment contains TD, and 0 otherwise). Suppose that the target label for a data point is \(\varvec{y}\) and the model prediction is \(\varvec{p}\). We already know the true value of \(\varvec{y}\) from the ground truth. The model uses this information to learn its weights. We measure the accuracy of our prediction \(\varvec{p}\) by calculating the log-loss (also known as the cross entropy) between \(\varvec{y}\) and \(\varvec{p}\):

$$\begin{aligned} -(y\log (p)+(1-y)\log (1-p)) \end{aligned}$$
(5)

In the SATD Comment Generation module, the generator’s objective is to maximise the likelihood of predicting the next target comment word. The target word at each time step is a word from our vocabulary \(\mathcal {V}\). We can look at this situation as a multi-class classification problem. Let \(\varvec{M}\) be the number of words in our target vocabulary (\(M=|\mathcal {V}|\)). We treat the problem as if we have \(\varvec{M}\) different classes. Suppose that at the current time step we are trying to predict the word todo, it is the fifth word in the vocabulary (\(\varvec{c}=5\)), and we have only ten words in our vocabulary (\(\varvec{M}=10\)). Since it is a multi-class classification problem, we will have 10 different binary indicators (\(\varvec{y_{c,o}}\)) for the current class. Only if the observation \(\varvec{o}\) is the same as the actual class (in this case, todo for both \(\varvec{c}\) and \(\varvec{o}\)), then \(\varvec{y_{c,o}}\) is 1. \(\varvec{y_{c,o}}\) is 0 for the remaining 9 binary indicators (e.g. \(\varvec{c}\) is todo and \(\varvec{o}\) is hack). Therefore, the cross entropy for the current word prediction is calculated as follows:

$$\begin{aligned} \sum _{c=1}^{M} y_{c,o}\log (p_{c,o}) \end{aligned}$$
(6)

Once the cross entropy is computed, a model optimiser is used to update the model parameters on the opposite direction of the gradient of the log-loss. We use the Adam optimiser (Kingma and Ba 2014) in our deep learning detector and the RMSprop optimiser (Choetkiertikul et al. 2018) in our generator to obtain the best model weights (i.e. model parameters) possible.

8 Evaluation

We implement SATDID using scikit-learnFootnote 3, a machine learning library in Python, and KerasFootnote 4, a Python deep learning library running on top of TensorFlow (Abadi et al. 2015), a machine learning platform. We also utilise JavaParserFootnote 5 to help build the ASTs of source code fragments. To contribute to the software engineering research community, we have made our source code, dataset, and results publicly accessibleFootnote 6.

We explain our dataset collection and pre-processing in Sect. 8.1. We describe our experimental setup in Sect. 8.2. The evaluation metrics used for our study are presented in Sect. 8.3. We present the results of our experiments for SATD Identification and SATD Comment Generation in Sects. 8.4 and 8.5, respectively.

8.1 Dataset

In Sect. 8.1.1, we explain the criteria and the procedures of our dataset collection. In Sect. 8.1.2, we describe the steps taken to pre-process our dataset in order to have it in a framework-ready state.

8.1.1 Data collection

To train our framework, we need to prepare a dataset of code-comment pairs, where some of the pairs are SATD pairs and some of them are not. Since conditional statements are said to be error-prone program elements (Martinez and Monperrus 2015; Xuan et al. 2017), [46], this study focuses on conditional statement and comment pairs.

Code comments were collected with the same procedure in the previous study (Hata et al. 2019), which had targeted active software development repositories on GitHub. We targeted repositories written in Java. Active software development repositories were selected from the MySQL dabase dump 2018-04-01 of GHTorrent datasets (Gousios 2013) with the following criteria (Hata et al. 2019): (i) more than 500 commits (the same threshold used in previous work (Aniche et al. 2018)), and (ii) at least 100 commits in the most active two years (to remove long-term less active projects and short-term repositories, which may not be software development projects (Munaiah et al. 2017)).

From the collected 4,995 Java repositories, single comments and the conditional statements immediately following them were collected as pairs of code and comment. By analysing the AST of each source file with an ANTLR4-based Java parser, “outermost” if-statements were identified. We ignored inner conditional statements enclosed in another if-statement. A sequence of else-if (e.g. if-else-if-else-if ...else) is regarded as a single if-statement. An if-statement is linked to a comment if the comment satisfies the following two conditions: (i) It appears between the if keyword and its previous non-comment token, and (ii) The character position in line is the same as the if-statement. Although multiple comments may link to an if-statement, we removed them from our dataset.

From the extracted comments, we prepare SATD and non-SATD comments using the following keywords shown in the previous study (Huang et al. 2018).

  • SATD comments: including at least one of the common 14 single keywords of todo, fixme, hack, workaround, yuck, ugly, stupid, nuke, kludge, retarded, barf, crap, silly, and kaboom.

  • Non-SATD comments: excluding all the above 14 keywords and other frequently appearing 22 keywords of implement, fix, ineffici, xxx, broken, ill, should, need, here, better, why, method, could, work, probabl, not, move, more, make, code, but, and author.

We obtained 5,313 SATD code-comment pairs and 839,431 non-SATD code-comment pairs. In the collected 5,313 SATD pairs, there are 2,851 distinct comment contents.

To understand the characteristics of the collected SATD code-comment pairs, a statistically representative sample of the distinct SATD comments was analysed. The required sample size was calculated so that the ratio of publication citations would generalise to all comments with a confidence level of 95% and a confidence interval of 5, and we obtained a sample of 339 code-comment pairsFootnote 7.

Three authors independently investigated the same 20 pairs to determine whether (i) the comment represents technical debt in code, and (ii) conditional statements are single or multipleFootnote 8. The Kappa agreement levels were (i) 0.90 and (ii) 0.88, which indicate “almost perfect” (Viera and Garrett 2005). Based on this encouraging result, the remaining data was then investigated by a single author. In the statistically representative sample of 339 code-comment pairs, we found (i) 298 (88%) are actually SATD pairs. Within the 298 SATD pairs, 270 (91%) code segments are single if-statements. We consider this result promising for our experiments as we collected SATD code-comment pairs with a small amount of noise and that the obtained conditional statements were not too complex, which is beneficial for learning SATD code-comment patterns.

8.1.2 Data pre-processing

To avoid Out Of Memory (OOM) and data noise issues, we set the maximum lengths for input sequences and comment sentences to 1500 and 150 token/word, respectively. We do not truncate AST sequences and comments. Truncation could be useful in accelerating classification tasks (e.g. SATD Identification). However, it could harm the SATD Comment Generation task since truncated words in the output sentence could map to tokens in the input sequence and vice versa. Thus, data points longer than the maximum lengths are ignored. We reserve an <UNKN/PAD> token for (i) padding during model training and, (ii) at model validation/testing, replacing input tokens that have not been seen during model training.

For processing the comments, we ignore numbers, non-English text, and special characters. A start-of-sentence token, <sos>, is added to the beginning of every comment, and an end-of-sentence token, <eos>, is added to the end of every comment. As can be seen in Fig. 5, we need the start-of-sentence token to signal to the deep learning model to generate the actual first word in the comment, and we need the end-of-sentence token as a signal for the model to stop the comment generation process.

To ensure that there is no bias towards a subset of the dataset, and to avoid data leakage to the training set (Kaufman et al. 2012), we enforce a strict rule that removes all duplicate instances in the dataset so that every data point in the dataset is unique. Additionally, we apply a data randomisation procedure using the Mersenne Twister pseudorandom number generator (Matsumoto and Nishimura 1998) to avoid order biases that may or may not have occurred during the time of data collection.

After performing the pre-processing steps, the number of SATD code-comment pairs shrinks from 5,313 to 3,022. Previous studies (e.g. Potdar and Shihab 2014; Zampetti et al. 2017; Huang et al. 2018) suggest that the percentage of SATD code in software projects ranges around 0.5–31%. For the SATD Identification experiment, we have followed (Potdar and Shihab 2014) where they prove that the average percentage of SATD in software projects is 10.4%. Therefore, we applied down-sampling to the non-SATD class so that the ratio of SATD to non-SATD pairs in our dataset is around 1.4:8.6. We use both SATD and non-SATD pairs in order to teach the intelligent detectors to distinguish between the characteristics of SATD and non-SATD pairs. We use the rest of the non-SATD pairs (that were not used to train the detectors) in the pre-training experiment (Sect. 8.4.2). The dataset has 105,671 unique input tokens and 9,058 unique comment words.

8.2 Experimental setup

We perform and report the results of the 10-fold Cross Validation (CV) of the intelligent components in our framework. For the deep learning components, we perform a Hyper-Parameter Tuning step first to find the appropriate hyper-parameter settings for the CV step.

8.2.1 Hyper-parameter tuning

This step is performed to search for the optimal set of model hyper-parameters. This step requires a separate validation set that will not be used during CV for testing. We refer to this set as the “tuning” set instead of validation set to avoid confusion with cross validation. The tuning set is a stratified 10% proportion of the entire dataset. In the hyper-parameter tuning step, we train our deep learning models on the remaining 90% of the dataset multiple times while tuning the hyper-parameters every time. The best hyper-parameter settings are chosen according to the the models’ performance results on the tuning set. The nominated hyper-parameter settings will then be used in the main CV step (Sect. 8.2.2).

We experiment with a set of four hyper-parameters: the batch size, number of layers, layer size, and embedding size. The batch size is discussed in Sect. 7. The number of layers determines how many layers our LSTM network has. We experiment with one, two, and three layers. For the last two hyper-parameters (i.e. layer size and embedding size), we combine them in a super hyper-parameter and call it the latent dimensionality. When we experiment with one and two layers, the size of the embedding and LSTM layers remain the same as the latent dimension. When we experiment with three layers, for the detector, the size of the embedding layer and the second LSTM layer are the same as the latent dimension, while the first LSTM layer is double the size of the latent dimension and the last LSTM layer is half. For the generator, all the layers are equal to the size of the latent dimension.

8.2.2 10-Fold cross validation

We perform 10-fold cross validation (CV) [55] on the entire dataset except the tuning set introduced in Sect. 8.2.1. By that, we guarantee every data point in the dataset is tested against. During CV, we will use the tuning set for training but not testing. In other words, the tuning set will be included in the training set of each one of the folds in the 10-fold CV. This will give the model more observations to learn from as we put the tuning set to use instead of neglecting it. For the detectors, CV is stratified. CV is not stratified for the generator since that is not applicable. 10-Fold CV is the main step whose performance will be evaluated and discussed next.

8.3 Evaluation metrics

8.3.1 Precision, recall, and F-1 scores

We treat SATD Identification as a classification problems. Thus, we use Precision, Recall, and F1-Score to evaluate our explored approaches and compare their performances against the benchmarks. Precision indicates the rate in which the classifier is correct when claiming that a group of instances is a SATD group. Recall indicates the rate in which the classifier is able to catch the SATD instances. Depending on the requirements of the project/situation, if practitioners do not care about identifying all SATD observations as much as the correctness of the identified ones, models with higher Precision should be considered. On the other hand, if they are aiming at identifying as many SATD observations as possible and do not care as much about the correctness of the identified ones, models with higher Recall should be adopted. Nevertheless, the F1-Score is a measure that combines Precision and Recall together and is ideal for situations where the two metrics are equally important.

8.3.2 Bleu-n score

We treat SATD comment generation as a translation problem. We use variations of the cumulative Bleu score (Papineni et al. 2002) to evaluate our approach. It has became a standard practice to use the Bleu score to evaluate the performance of Neural Machine Translation (NMT). Bleu measures the similarity between the generated comments (the candidates) and the original comments from the ground truth in the dataset that were written by the developers (the references). Bleu score produces a value between 0 and 1, inclusive, indicating how close the candidates are to the references, the higher the closer. For example, if a candidate is identical to a reference, the Bleu score is 1. Bleu-n calculates the cumulative similarity of n-grams of text. For example, Bleu-4 calculates the similarity of 1-gram, 2-grams, 3-grams, and 4-grams, and then computes their weighted geometric mean. We report the results of Bleu-1, Bleu-2, Bleu-3, and Bleu-4 in a percentage style.

8.3.3 Acceptability and understandability

We also perform human evaluation between two authors on the comments generated by SATDID using two criteria, namely Acceptability and Understandability (Oda et al. 2015), to evaluate if the generated comments are easy to understand, especially for inexperienced programmers. We assigned a 5-level score (from 1 to 5) to indicate the acceptance of the generated comments, and a 6-level score (from 0 to 5) to show how well the annotators are in understanding of the generated comments.

8.4 SATD identification

We experimented with the following hyper-parameter setsFootnote 9:

  • Latent Dimensionality: (8, 16, 32, 64, 128, 256)

  • Number of Layers: (1, 2, 3)

  • Batch Size: (8, 16, 32, 64, 128, 265, 512)

We also experimented with mean-pooling, max-pooling, and last-vector (i.e. no-pooling). As pooling shows consistent performance improvements, we nominate the three best performing hyper-parameter settings with max-pooling and mean-pooling for the 10-fold CV step. Table 1 lists the average Precisions, Recalls, and F1-Scores of the 10-fold CV step. The results are ordered according to the F1-Score.

Generally, adopting deep learning for this problem produces higher scores. Nonetheless, traditional machine learning algorithms provide comparable results. For the deep learning detector, we can see that from the best six hyper-parameter settings presented in Table 1, three of them has their Latent Dimensionality size set to 32 and Batch Size set to 256. Furthermore, none of the best six has 3 LSTM layers. For the traditional machine learning detector, MNB provides a higher F1-Score than SVM. However, SVM provides the highest Precision score (41.5%) amongst all the tested models. The highest Recall (29.8%) and F1-score (31.1%) amongst all the tested models was achieved by the LSTM model with {[64, 1, 64], max} for the {[Latent Dimensionality, Number of Layers, Batch Size], pooling technique}.

Table 1 Average Precisions (P), Recalls (R), and F1-Scores (F1) of stratified 10-fold cross validation of our approach

8.4.1 Benchmarks

We benchmark against a machine learning tool (TEDIOuS) and a static analysis tool (SonarQube) for SATD Identification. We evaluate our approach by replicating/applying these benchmarks, run them on our dataset, and compare their results with SATDID’s.

TEDIOuS: Zampetti et al. (2017) developed a Random-Forest-based approach called TEDIOuS. When a developer writes a new piece of code, TEDIOuS recommends to them if they should self-admit “design” technical debt. To the best of our knowledge, this is the only existing work in the SATD field that analyses the source code instead of the comment. Unlike our approach, they only focus on design debt, and they build the feature space using source code metrics instead of using the concrete source code. More details of Zampetti et al.’s approach can be found in Sect. 10. We replicate their approach and use it as a benchmark. Table 2 orders TEDIOuS’s results alongside the other experiments based on F1-Score, the highest first.

SonarQube: Static Analysis Tools (SATs) are prominently used as means to improve code quality by revealing recurrent code violations without incurring the costs of running the program (Marcilio et al. 2019). One of the most famous SATs is SonarQubeFootnote 10. SonarQube is an automatic code review tool to detect bugs, vulnerabilities, and code smells in the code. We use SonarQube as another benchmark in order to compare our approach with SATs in recommending SATD. To conduct this experiment, we leverage SonarQube’s code smell analysis capability. Table 2 provides a comparison of the results of using SonarQube alongside the other experiments.

Table 2 shows the results of SATDID implemented using both deep learning and traditional machine learning. We also experimented with two pre-training styles and without pre-training (see Sect. 8.4.2 for pre-training details). The highest F1-Score (31.1%) was achieved by our LSTM model with {[64, 1, 64], max}. This provides 31.78 and 475.93% improvements over TEDIOuS and SonarQube, respectively. The highest Precision score (41.5%) was achieved by our SVM (see Table 1) with 23.88 and 21.35% improvements over TEDIOuS and SonarQube. The highest Recall score (29.8%) was also achieved by our LSTM model with {[64, 1, 64], max}. This provides 59.36 and 1,046.15% improvements over TEDIOuS and SonarQube. Therefore, our approach outperforms the two benchmarks in all the evaluation mertics. We attribute this conclusion to SATDID’s efficient feature extraction (executed by the Data Vectorisation module) and learning capabilities (provided by the SATD Identification module).

8.4.2 Pre-training

The purpose of this experiment is to see if pre-training can help initialise enhanced model weights for the main training time instead of random weight initialisation. The negligible difference between the deep learning and traditional machine learning results shown in Table 1 further motivated us to attempt pre-training. We tried two pre-training methods: end2end and embedding pre-training with traditional machine learning.

In end2end, we train an LSTM model on predicting the next token in the input sequence. This results in a pre-trained embedding and LSTM layers. When we train the model for SATD Identification, We use the pre-trained layers whose weights are not randomly initialised anymore to see if it provides improved reseults.

In embedding pre-training with traditional machine learning, we also train an LSTM model on predicting the next token in the input sequence. However, when we train the model for SATD Identification, we only use the embedding layer of the pre-trained model. We extract the vector representation of each token from the pre-trained embedding layer. Then, for every input sequence in our dataset, we take the mean-pooling of the embeddings (i.e. vectors) of its tokens in order to represent it as one vector. After that, the resulting vectors are fed to a traditional machine learning model. We tried different traditional machine learners for this experiment and found that SVM is the best performing one.

Table 2 orders the results with and without pre-training alongside the other experiments based on F1-Score, the highest first. Between the two pre-training styles, end2end DLD achieved higher F1-Score (30.8%) and Recall score (29.3%), while embeddings with TMLD achieved higher Precision score (34.1%). Contrary to our expectation, pre-training did not provide improvement to SATDID’s performance. In terms of F1-Score, end2end DLD and embeddings with TMLD show –0.96 and -4.78% performance declines to our deep learning and TML detectors, respectively. However, the pre-trained models still outperform the benchmarks. end2end DLD provides 30.51 and 470.37% improvements over TEDIOuS and SonarQube, respectively. embeddings with TMLD provides 1.27 and 342.59% improvements over TEDIOuS and SonarQube.

Table 2 Average results of Stratified 10-fold Cross Validation in comparison with two pre-training methods and two benchmarks

8.5 SATD comment generation

The experimental setup of our generator slightly differs from the detectors’ due to two reasons. First, the time and space complexity of training the generator is much higher than that of the detectors. Second, the generator reports distinctive results every time we tune the hyper-parameters. We experiment with the following hyper-parameter sets:

  • Latent Dimensionality: (512, 1024, 2048)

  • Number of Layers: (1, 2)

  • Batch Size: (32, 64)

8.5.1 Ground-truth evaluation

Table 3 lists the results from both the hyper-parameter tuning and 10-fold CV steps, ordered according to the Bleu-4 score, the highest first. The highest Bleu-n scores in the hyper-parameter tuning step was achieved by the LSTM model with [1024, 1, 64] for [Latent Dimensionality, Number of Layers, Batch Size]. Therefore, this hyper-parameter setting was nominated for the 10-fold CV step.

During the hyper-parameter tuning step, we started by setting the Latent Dimensionality to 512 and gradually increased it. The distinctive behaviour of the generator clearly showed us that setting the Latent Dimensionality to 1024 produces higher Bleu-n scores as 512 and 2048 decreased the scores. Increasing the Number of Layers to 2 gives the lowest Bleu-n scores, so we kept experimenting with 1. This suggests that sometimes it is not ideal to over-complicate the model as that may lead to over-fitting to the training set and suppress the model’s ability to generalise. We experimented with 64 and 32 for the Batch Size and found that 64 trains faster and produces higher scores. Increasing the Batch Size more than 64 caused OOM issues.

Table 4 shows some SATD comments generated by SATDID in comparison with human-written SATD comments from the ground-truth. The first example shows a generated comment that is identical to the comment written by the human developer. The second example shows minor differences, while the third example shows a comment that is totally different from the human-written one.

Table 3 Bleu-n scores of the hyper-parameter tuning step followed by the average Bleu-n scores of the 10-fold cross validation step
Table 4 Sample model-generated SATD comments compared with human-written comments from the ground-truth

8.5.2 Generic comment generation

We have also compared our approach against existing techniques for generating comments from code. One of the prominent approaches was proposed by Hu et al. (2018). Hence, we have replicated their approach and run it on our dataset in order to benchmark our approach against it. For a detailed explanation of the differences between our approach and (Hu et al. 2018)’s, refer to Sects. 9.2.1, 9.2.2, 9.2.3, and 9.2.4.

Table 3 shows the Bleu-n scores of SATDID and the benchmark (Hu et al.). The improvements our framework provides over the benchmark can be clearly noticed, where SATDID provides 88.54, 232.56, 400, and 583.33% improvements in terms of the Bleu-1, Bleu-2, Bleu-3, and Bleu-4, respectively. We attribute the performance improvements in our framework to our focus on SATD code-comment pairs as well as our approach towards the following four aspects (discussed in detail in Sect. 9.2): dealing with long sequences, dealing with Out of Vocabulary (OOV) tokens, model hyper-parameters, and including beam search.

Table 5 Performance of human evaluation between two evaluators according to Acceptability and Understandability

8.5.3 Human evaluation

The third and fourth authors evaluated 337 randomly selected samples independently. The result of the evaluation presented in Table 5 shows that the comments generated by SATDID are relatively acceptable and understandable by human. This is indicated by the mean score in both Acceptability and Understandability at 3.128 and 3.172 respectively. The assessment results between individual evaluators are quite similar. It can be seen in high positive correlation values, that is, 0.791 for the Acceptability, and 0.783 for the Understandability.

9 Discussion

In this section, we discuss some implications and lessons learned based on the results from our evaluation of the SATD Identification (Sect. 9.1) and SATD Comment Generation (Sect. 9.2) experiments. We also discuss the threats to the validity of our approach (Sect. 9.3).

9.1 SATD identification

9.1.1 Deep learning versus traditional machine learning

Our experiments show that the deep learning detector performed only slightly better than the traditional machine learning detector. We recommend using the traditional machine learning detector due to its significantly smaller training time (a few seconds compared to \(\sim\)4 hours for training the deep learning detector). If one still prefers adopting deep learning, we recommend the adoption of a pooling technique, especially max-pooling, as it consistently improves the model’s performance.

9.1.2 Pre-training

While pre-training is commonly used in practice, we do not recommend it for SATD Identification. Pre-training is an alternative to random weight initialisation. However, we have found that, in our setting, pre-training did not help improve the model’s performance despite the substantial amount of time it took to complete. We attribute this behaviour to the size of our problem and dataset. Our dataset is relatively small and focuses on SATD code-comment pairs. Therefore, our models are able to learn during training and do not need the extra layer of complexity proposed by pre-training.

9.1.3 Comparison with the benchmarks

A key difference between our approach and TEDIOuS in SATD Identification is the input format to the model. TEDIOuS uses source code metrics as input features. On the other hand, our approach vectorises the concrete of source code, hence is able to capture (through learning) its syntactic and semantic structures. The reported performance improvement of our framework over the benchmarks’ suggests the efficiency of our model input format.

We also notice the limitations of using static analysis tools for SATD identification. Although SonarQube provides comparable (but not higher) Precision in our experiments, it fails in the Recall and F1 aspects. This is attributed to SonarQube’s speciality in detecting “code smells" while our framework is concerned with broader technical debt patterns.

9.2 SATD comment generation

Our study focuses on the TD-affected source code. Our model examines the code to determine when to associate a SATD comment. If a code fragment is determined, the model generates the appropriate SATD comment. However, generating a comment for every code fragment, whether it is TD-affected or not, is beyond our scope and beyond the scope of Self-Admitted Technical Debt for that matter.

Sections 9.2.1, 9.2.2, 9.2.3, and 9.2.4 describe the differences between our approach and the benchmark’s in comment generation. Section 9.2.5 discusses our evaluation process of the generated SATD comments, and Sect. 9.2.6 discusses three sample SATD comments generated by our model.

9.2.1 Dealing with long sequences

Long sequences causes memory issues. We ignore long sequences in our approach while Hu et al. truncate them. Truncated target words can map to non-truncated input tokens and vice versa. With truncation, the model is forced to only map non-truncated elements in the source and target sequence pairs. This can result in incorrect mappings. To avoid that, we ignore sequences that passes the thresholds instead of forcing incorrect mappings. When we replicated the benchmark, we increased the maximum lengths for codes and comments from 400 and 30 to 1500 and 150, respectively, to allow a chance for a better performance and to match with the maximum lengths of our approach. Despite that, the incorrect mappings still have their negative impact as shown in the reported results.

9.2.2 Dealing with out of vocabulary tokens

A large vocabulary size contributes to the time and space complexities of the problem. Hu et al. have a threshold for vocabulary size while we decided to tolerate the entire vocabulary. Not only did our approach allow us to tolerate the entire vocabulary, it also allowed the models to train faster. Thus, the time and space gains of the benchmark’s approach of limiting the vocabulary size are not necessary for our approach, and we benefit from the increased accuracy of the vocabulary toleration. To replicate Hu et al.’s approach, only the most frequent 3.78% source code tokens of our dataset will be included. Out of Vocabulary (OOV) tokens are represented by their types alone. For example, in Fig. 3, choosing is an OOV token, so it is represented by its type (SimpleName) alone, while return is represented by both its type (ReturnStatement) and value (return) since it is one of the most frequent 3.78% tokens.

9.2.3 Model hyper-parameters

Table 3 illustrates the difference of our approach and the benchmark’s in tuning three hyper-parameters. Hu et al.’s model hyper-parameters are [512, 2, 100] for [Latent Dimensionality, Number of Layers, Batch Size], while ours are [1024, 1, 64]. A fourth hyper-parameter difference is the Number of Iterations during training each fold of the 10-fold CV. We decreased it to 40 epochs (i.e. iterations) while the benchmark trains for 50 iterations. Our hyper-parameter choices for the 10-fold CV step were decided during the hyper-parameter tuning step. The reasoning behind our choices for Latent Dimensionality, Number of Layers, and Batch Size are explained in Sect. 8.5.1. For the Number of Iterations hyper-parameter, our experiments show that our model reaches convergence at 40 epochs at most. Yet, we achieved higher results with less iterations.

9.2.4 Beam search

We added the beam algorithm to our model (see Sect. 6.3) with \(\varvec{k} = 10\) which generates 10 candidate SATD comments for every input code fragment. Note that including the beam algorithm increased the training time from \(\sim\)14 hours (greedy search) to \(\sim\)26 hours, which is still comparable with the time spent by Hu et al.’s approach (\(\sim\)23 hours). Although the model’s best candidate comment is usually the first candidate, beam search allows multiple attempts for the model to generate better comments when the first ones are not ideal. For example, one of the target comments in our dataset is:

figure a

Our model was able to generate the exact same comment in the fourth attempt (candidate 4), which had a positive impact on the Bleu-n scores.

9.2.5 On the evaluation of the generated SATD comments

One might assume that the Bleu-n scores achieved by our generator are low. This is not necessarily the case as previous work (e.g. Liu et al. 2018; Wan et al. 2018) reported similar and lower Bleu-n scores. A limitation of Bleu-n score is that it does not consider the semantic similarity. For example, if the model generates a synonym for a word, a human can accept that. However, this does not increase the Bleu-n score. The human evaluation of the generated SATD comments in this work (Sects. 8.3.3 and 8.5.3) is conducted to address this issue.

9.2.6 Samples from the generated SATD comments

We will discuss here three samples from the generated SATD comments and compare them to the SATD comments from the ground-truth in the dataset. Sample 1 shows a generated comment that is identical to the ground-truth comment. Sample 2 shows a comment that is completely different from the Ground-Truth comment, while Sample 3 shows a comment slightly different from the ground-truth comment.

figure b

The model has not seen the input code before for this comment but it has seen a similar code fragment. The model learned the characteristics of the code fragment and its associated comment and, therefore, was successful in generating the perfect SATD comment for the input code. This promising result demonstrates the applicability of SATDID in generating useful comments.

figure c

Although it is hard to determine why the model specifically generated this comment, we attribute the failure in this case as well as a few similar instances to the fairly small size of our SATD code-comment pair dataset. Despite the small dataset, we have reported an overall promising performance of our generator. We believe that the generator’s performance will improve once a dedicated effort is made to collect a large corpus of SATD code-comment pairs. Hence, we urge the community to consider that moving forward in this research area.

figure d

This is one of the cases where the Bleu score decreases regardless of the semantic acceptability of the generated comment.

figure e

is a generic tag used to replace other tags such as

figure f

and

figure g

. Our model successfully generated a descriptive comment in this sample. With a larger dataset and longer training, we expect that the model will be capable of generating specific tags for different code fragments.

9.3 Threats to validity

Threats to internal validity relate to biases in data labelling as well as the implementation and training of the framework components. Regarding data labelling, a threat to validity relates to the potential existence of actual SATD comments that do not pass our inclusion test and, thus, labelled non-SATD. To mitigate this threat, we conducted an investigation of the collected dataset (Sect. 8.1.1). This investigation ensured that the SATD comments represent technical debt in source code. Regarding the implementation and training of the framework components, we have performed extensive hyper-parameter tuning experiments for all of the framework components. In the SATD Identification experiments, we also have tested multiple hyper-parameter settings in the cross validation step instead of only selecting the best performing setting from the hyper-parameter tuning step.

Threats to external validity relate to the quality and quantity of our dataset as well as our human evaluation procedure. Regarding the dataset, we have collected code-comment pairs from 4,995 active software development repositories. From these repositories, we obtained 5,313 SATD code-comment pairs and 839,431 non-SATD pairs. Thus, we claim that these threats are minimal. Regarding the human evaluation procedure, we acknowledge the potential bias in having it conducted by the third and fourth authors.

Since all the data is collected from open-source projects and repositories, there is an open question about the generalisability of our approach to company projects. However, developers in companies are less prone to self-admitting technical debt they introduce (Huang et al. 2018; Ren et al. 2019), which lessens the threat to our approach’s generalisability. Furthermore, all of the analysed projects and repositories are implemented in Java. Future work would involve exploring our approach with larger and more diverse programming languages and datasets.

10 Related work

Potdar and Shihab (2014) proposed the concept Self-Admitted Technical Debt (SATD) to describe technical debt that is intentionally introduced and documented in the comments that accompany source code. They manually inspected 100k comments and extracted 62 patterns that are actively used in following studies to detect SATD. Research on Self-Admitted Technical Debt ever since has expanded (Maldonado and Shihab 2015; Wehaibi et al. 2016; Bavota and Russo 2016; de Freitas Farias et al. 2015, 2016; Maldonado et al. 2017; Zampetti et al. 2017, 2018; Yan et al. 2018; da Silva Maldonado et al. 2017; Huang et al. 2018; Ren et al. 2019). Maldonado and Shihab (2015) examined 33k source code comments and determined five types of SATD: design debt, defect debt, documentation debt, requirement debt, and test debt. They found that design debt is the most common one. de Freitas Farias et al. (2015, 2016) developed a Contextualized Vocabulary Model (CVM-TD) to identify SATD in code comments by analysing code tags and words’ parts of speech.

Zampetti et al. (2017) developed a Random-Forest-based approach called TEDIOuS. TEDIOuS recommends to developers when they should self-admit design debt. Zampetti et al. use static analysis tools to calculate method-level metrics in order to create the model input features. These metrics are structural metrics, readability metrics, and warnings raised by Static Analysis Tools. SATDID is inspired by their Zampetti et al.’s work. We use their approach as a benchmark for our SATD Identification components. However, there are key differences between Zampetti et al.’s approach and ours. Firstly, they calculate source code metrics and use them as features for the machine learning model to learn from. In our case, we vectorise the source code tokens themselves rather than calculating metrics to create the feature space. Secondly, their work specialises in detecting “design” debt. Ours attempts to detect all kinds of technical debt. Thirdly, Their work operates on method-level. Ours can operate on any level (conditional-statement-level, method-level, class-level, etc.) as long as an Abstract Syntax Tree can be parsed from the target source code fragment. Lastly, We propose a comprehensive framework that also performs SATD Comment Generation and leverages deep learning as well as traditional machine learning.

da Silva Maldonado et al. (2017) propose a Natural Language Processing (NLP) approach to identify the two most common types of SATD: design debt and requirement debt. They employ a maximum entropy classifier to identify the SATD comments. Huang et al. (2018) used text mining techniques to detect all types of SATD comments. They utilise feature selection to select useful features for classifier training. Their work outperformed the NLP approach. Note that a limitation of Maldonado et al.’s work is its speciality in only two SATD types rather than all types. In their recent work, Ren et al. (2019) Proposed a deep learning-based approach in the form of Convolutional Neural Networks (CNN) for SATD comment identification. They argue that leveraging CNN improves the detection task over other existing approaches. Furthermore, exploiting the computational structure of CNN improves the explainability of the results by identifying key patterns of SATD comments. The main difference between these approaches and ours is that our approach deals with the case when the comment is not provided with the source code, which is a key limitation of the previous work. SATDID detects source code fragments that require SATD comments and, on top of that, generates the appropriate SATD comments for them, another contribution that is missing from the previous work.

11 Conclusions and future work

In this paper, we have presented our framework, SATDID, for Self-Admitted Technical Debt recommendation and comment generation. To the best of our knowledge, we are the first to propose such a comprehensive solution. SATDID analyses source code to determine if it contains technical debt that should be self-admitted. Our approach does not assume the existence of SATD comments as in existing work. In addition, if TD is found in a code fragment, the SATDID will generate an appropriate SATD comment which describes the TD and can be attached with the code fragment. For replicabilitiy purposes and further research in this area, we have made our code, dataset, and result reports publicly available.

We have evaluated SATDID’s performance on a dataset of code-comment pairs from 4,995 active software development repositories. Our approach provides at least 21.35, 59.36, 31.78, and 583.33% improvements over all the replicated/tested benchmarks in terms of Precision, Recall, F-1, and Bleu-4 scores, respectively. In addition, our approach scores total means of 3.128, and 3.172 in terms of Acceptability and Understandability of the generated comments, respectively. The results demonstrate the effectiveness of our approach.

Future work involves investigating larger and more diverse datasets. This includes studying other programming languages, types of code fragments other than conditional statements (e.g. method calls), as well as commercial and closed-source software projects when accessible. In addition, we plan to develop an automated, open-source plugin for SATD Identification and Comment Generation for Integrated Development Environments (IDEs). Furthermore, we plan to adopt other deep learning techniques in order to further validate its usage over traditional machine learning. Future work will also involve inviting external entities to extensively exam the human evaluation of the SATD comments generated by our model. Finally, we plan to extend this work and study SATD repayment (i.e. removal).