An empirical study of issue-link algorithms: which issue-link algorithms should we use?

The accuracy of the SZZ algorithm is pivotal for just-in-time defect prediction because most prior studies have used the SZZ algorithm to detect defect-inducing commits to construct and evaluate their defect prediction models. The SZZ algorithm has two phases to detect defect-inducing commits: (1) linking issue reports in an issue-tracking system to possible defect-fixing commits in a version control system by using an issue-link algorithm (ILA); and (2) tracing the modifications of defect-fixing commits back to possible defect-inducing commits. Researchers and practitioners can address the second phase by using existing solutions such as a tool called cregit. In contrast, although various ILAs have been proposed for the first phase, no large-scale studies exist in which such ILAs are evaluated under the same experimental conditions. Hence, we still have no conclusions regarding the best-performing ILA for the first phase. In this paper, we compare 10 ILAs collected from our systematic literature study with regards to the accuracy of detecting defect-fixing commits. In addition, we compare the defect prediction performance of ILAs and their combinations that can detect defect-fixing commits accurately. We conducted experiments on five open-source software projects. We found that all ILAs and their combinations prevented the defect prediction model from being affected by missing defect-fixing commits. In particular, the combination of a natural language text similarity approach, Phantom heuristics, a random forest approach, and a support vector machine approach is the best way to statistically significantly reduced the absolute differences from the ground-truth defect prediction performance. We summarized the guidelines to use ILAs as our recommendations.

In this paper, we compared all criteria that were used by the previous ILAs as our ILAs and their combinations on the same dataset. As we divided the previous ILAs into some criteria, our comparison covers not only previous ILAs, but also other combinations. More specifically, we compared 10 criteria as our ILAs (i.e., time filtering, natural language text similarity, natural language text similarity with word association, message generation from source code, loners heuristics, phantom heuristics, modified text files, PU learning, random forest, and support vector machine) in terms of the accuracy of detecting defectfixing commits. To collect these ILAs, we conducted a systematic literature study with the snowballing approach (Wohlin 2014). In addition, we investigated the impact of the ILAs and their combinations to defect prediction performance in terms of the absolute differences to the ground-truth defect prediction performance. The ground-truth defect prediction performance is measured in the dataset where almost all defect-fixing commits are already detected accurately. The details of such datasets and defect-fixing commits are discussed in Section 5. Our ultimate goal is to clarify which ILA or combination of ILAs detects the most defect-fixing commits and prevents the defect prediction model from being affected by missing defect-fixing commits compared with the baseline ILA (i.e., used by the SZZ algorithm) called the keyword extraction. To achieve this goal, we investigated the following two research questions.

RQ1: Which issue-link algorithm is the best to detect defect-fixing commits?
Motivation: Many prior studies have proposed ILAs to detect defect-fixing commits accurately. However, no studies have conducted a large empirical comparison across ILAs. In this RQ, we compared 10 ILAs. Our goal is to identify ILAs that detect defect-fixing commits accurately. Results: The time filtering approach and the natural language text similarity approach recovered the statistically significantly largest number of missing defectfixing commits compared with the other ILAs in different projects. The random forest approach achieved the statistically significant highest precision in 22 out of 25 results. RQ2: Which issue-link algorithm is the best to prevent a defect prediction model from being affected by missing defect-fixing commits in defect prediction?
Motivation: Researchers and practitioners should carefully select an ILA if ILAs prevent a defect prediction model from being affected by missing defect-fixing commits. In this RQ, we studied how ILAs and their combinations affect defect prediction performance. Results: All ILAs including the combinations of ILAs that detect defect-fixing commits accurately result in a statistically significant reduction in the impact to defect prediction performance compared with the baseline ILA, the keyword extraction approach. These ILAs are robust to the datasets including missing defect-fixing commits. In particular, the combination of the natural language text similarity, Phantom heuristics, random forest, and support vector machine approaches is the best method to prevent the defect prediction performance from being affected by missing defect-fixing commits.
Our results provide researchers and practitioners who study/use defect prediction or investigate defect-fixing commits with guidelines to choose the best ILA for their purpose. We recommend using the combination of the natural language text similarity, Phantom heuristics, random forest, and support vector machine approaches to remove the bias of missing defect-fixing commits on defect prediction performance. If researchers and practitioners want to investigate defect-fixing commits on a dataset in which no falsepositive defect-fixing commits exist, we recommend using the random forest approach. If researchers and practitioners need more defect-fixing commits to investigate while allowing false-positive defect-fixing commits, we recommend using the time filtering or natural language text similarity approach. In addition, before using any ILAs, we recommend using the dates of the commit and the issue report to remove noise of defect-fixing commits for defect prediction.
The four main contributions of this paper are as follows: -We have conducted the first large-scale empirical study to evaluate the ILAs on the same experimental setup. -We have proposed guidelines for the use of ILAs according to the purpose of each study.
-We have implemented all the studied ILAs that were collected by our systematic literature study (Kondo 2021b). -We have conducted a systematic literature study of the ILAs.
We summarized our ILAs and the validation technique in defect prediction as Python packages (Kondo 2021b, c). In addition, we made the replication package (Kondo 2021a). These packages can be used to replicate/update our experiment.
The organization of our paper is as follows. Section 2 presents a motivating example. Section 3 introduces related work and contextualize our research. Section 4 presents the experimental design. Section 5 presents our methodology. We also explain our studied ILAs in this section. Section 6 presents the results of our experiment. Section 7 discusses these results. Section 8 describes the threats to the validity of our findings. Section 9 presents the conclusion.

Defect Prediction and ILAs
Firstly, an overview of just-in-time defect prediction and ILAs is depicted in Fig. 1. For more details, prior studies such as the study by Kamei and Shihab (2016) may be referred to. Defect prediction mainly consists of three phases: the data preparation phase, the model construction phase, and the evaluation phase.
Data Preparation: This phase prepares the data for defect prediction. The data are (1) software entities (e.g., commits) that are the target of the prediction, (2) the label that indicates whether entities include defects (i.e., defect-inducing entities), and (3) the metrics that measure the characteristics of entities such as the change metrics (Kamei et al. 2013). To prepare the label, researchers use two techniques: ILAs and commit-link algorithms (CLAs). ILAs link issue reports to software entities, whereas CLAs find entities that induce defects from entities that are linked to issue reports related to defects (e.g., the SZZ algorithm (Śliwerski   2005)). All data are collected from two data sources: the issue tracking system (e.g., JIRA) and the version control system (e.g., GitHub).

Model Construction:
This phase constructs defect prediction models based on the data that are prepared in the previous phase. To construct the prediction model, researchers need to select (4) modeling techniques (e.g., logistic regression), (5) preprocessing techniques (e.g., z-score), and (6) model validation techniques (e.g., bootstrap-sampling).
First, researchers need to select modeling techniques for defect prediction. Based on the selected modeling techniques, the preprocessing techniques must be decided. Usually, the z-score approach (Kondo et al. 2019) is utilized. However, according to the requirements of the selected modeling techniques, we might choose another preprocessing technique such as the min-max scaling approach (Kondo et al. 2019).
The model validation technique divides the data into the training data and the test data to improve the validity of the evaluation of the prediction models. One technique needs to be selected from the various existing model validation techniques (e.g., bootstrap sampling).
Finally, we construct defect prediction models based on the selected modeling techniques, preprocessing techniques, and model validation techniques.
Evaluation: This final phase evaluates the constructed defect prediction model. Similar to validation techniques, various (7) evaluation measures also exist. Researchers usually evaluate the prediction performance such as precision, recall, and F1-score. In addition, cost-aware evaluation measures are utilized (e.g., Norm(P opt )). To evaluate the applicability of defect prediction models to practical scenarios, the execution time might also be evaluated. To evaluate the difference across prediction models, (8) the statistical test (e.g., the Scott-Knott ESD test Tantithamthavorn et al. 2017) and the effect size (e.g., the Cohen's d effect size cohen 2013) are computed.
ILAs studied in this paper are utilized in the data preparation phase. In particular, ILAs link the issue reports extracted from the issue tracking system to the commits/changes extracted from the version control system to prepare the label. This indicates that ILAs are applied as the first step in defect prediction. Hence, ILAs are important because the accuracy of the links (i.e., the label) affects all the phases. Our study will support the improvement of the accuracy of links and improve the reliability of the defect prediction research.

Do ILAs Affect Defect Prediction?
In Section 2.1, we introduce the accuracy of ILAs is important for defect prediction. Our next questions are that are existing ILAs inaccurate and do such ILAs affect defect prediction? If so, that should be the motivation for our study. In this section, to answer these questions, we introduce false-positive and false-negative defect-fixing commits induced by the most popular ILA, and show a simple survey that clarifies do ILAs affect defect prediction.
False-positive defect-fixing commits indicate defect-fixing commits that are linked with unrelated issue reports while false-negative defect-fixing commits indicate defect-fixing commits that should link with issue reports but do not. Prior studies usually use an ILA called keyword extraction approach, which uses regular expressions to identify defect-fixing commits in which commit messages include issue ids. However, this approach induces falsepositive/negative defect-fixing commits.
For example, the commit of cf3318e1b in the Tez project, which is a studied project in this paper, includes two issue ids, TEZ-8 and TEZ-1594. TEZ-8 corresponds to a defect-fixing process while TEZ-1594 does not. The keyword extraction approach links this commit and these two issue reports and refers to this commit as a defect-fixing commit. However, TEZ-8 is not directly related to this commit. Hence, the commit of cf3318e1b is a possible false-positive defect-fixing commit.
Also, the commit message of the commit of 0b74bd5e in the Avro project, which is also a studied project, does not include any issue ids. Hence, the keyword extraction approach does not refer to this commit as a defect-fixing commit. However, the changed file by this commit includes an issue id (AVRO-2033) that corresponds to a defect-fixing process. Hence, this commit is a possible false-negative defect-fixing commit. Therefore, even the most popular existing ILA may induce false-positive/negative defect-fixing commits.
Next, let us show a simple survey that clarifies do ILAs affect defect prediction. We conducted the preliminary survey that has been used by prior defect prediction studies (Yedida and Menzies 2021;Fu et al. 2016). The procedure of our survey is as follows: 1. We search for studies that use the keyword "defect prediction" and are published in the top venues 1 using Google Scholar. 2. We read the title and the abstract and exclude non-defect prediction studies (e.g., issue report studies) and studies that do not have any PDF links. The remaining studies are considered defect prediction studies. We call this set of studies Group A. 3. We read the papers and collect the ILAs that are explicitly written in the papers. Also, we exclude studies that are not change/commit-level defect prediction (a.k.a. just-intime defect prediction) studies. The remaining studies are considered just-in-time defect prediction studies. We call this set of studies Group B.
The studies collected into Group A and Group B can be found in our Google sheet 2 . Only 16.1% of the prior studies use datasets that were generated using ILAs except for the keyword extraction approach. However, 83.3% of them reuse the publicly available dataset. Our survey collects 112 studies in Group A. The proportion of the studies that explicitly use any datasets that were generated using ILAs except for the keyword extraction approach is only 16.1% (18/112). In addition, 83.3% of them (15/18) reuse the ReLink dataset that is generated by an ILA, ReLink . Since the ReLink dataset was publicly available 3 , several prior defect prediction studies used this dataset, which implies that almost all prior defect prediction studies do not consider ILAs but simply use the publicly available dataset or the most popular ILA, the keyword extraction approach.
However, we have another question: how many commits can the keyword extraction approach link with issue reports? If the number of linked commits is high and such links are accurate, we do not need to use any ILAs. Since we have already discussed that the keyword extraction approach induces false-positive/negative defect-fixing commits, we counted the number of linked commits. To answer this question, we investigated all the projects that were used as the target projects in Group B except for the unclear or unreachable projects (e.g., "Mozilla" is used as a target project by prior studies while Mozilla is an organization having several projects, not a project). There are 24 studies in Group B. We used Group B because such linked commits are used in just-in-time defect prediction studies. We applied the following regular expressions that were modified regular expressions of the original SZZ (Śliwerski et al. 2005) to all commit messages and computed the proportion of commits where commit messages include at least one issue id candidate (i.e., the proportion of linked commits): In addition, for the Apache projects, we considered the issue ids that are used in the Apache projects such as CAMEL-{{issue id}} in the Camel project. If the proportion is high, the number of linked commits by the keyword extraction approach is high. Table 1 lists the proportion of commits in which we can find issue id candidates with the regular expressions in the studied projects in Group B. The number of studied projects without duplication is 58. The gold cell indicates a proportion of over 80%. This is because, in this paper, we used projects in which over 80% of commits include at least one issue id candidate as our studied projects. The blue cells indicate a proportion of under 50%. We observe only 5 of 58 projects are over 80%. On the contrary, almost all projects (49/58 projects) are less than 50%.
In summary, these result show that we need ILAs to improve the accuracy and number of links between commits and issue reports. Otherwise, many commits exist that do not correspond to any issues reports. Such commits could potentially be defect-fixing commits that are not detected (false-negative defect-fixing commits). In addition, false-positive defect-fixing commits also exist. Such commits affect defect prediction. Hence, in this study, we investigated the impact of ILAs to defect prediction.

Related Work
Locating defect-fixing and defect-inducing commits by using the issue ids in commit/log messages is a common practice in software engineering (Čubranić and Murphy 2003;Fischer et al. 2003bFischer et al. , 2003aŚliwerski et al. 2005;Bernstein 2009a, 2010;Bird et al. 2010;Sureka et al. 2011;Wu et al. 2011;Nguyen et al. 2012;Bissyandé et al. 2013;Le et al. 2015;Schermann et al. 2015;Sun et al. 2016Sun et al. , 2017aXie et al. 2019;Tu and Menzies 2020). For example, Fischer et al. (Fischer et al. 2003a) applied regular expressions to log messages to retrieve issue ids. In defect prediction, the SZZ algorithm (Śliwerski et al. 2005) is the de facto standard approach to detect both defect-fixing and defect-inducing commits by using the issue ids in commit/log messages. This algorithm uses two data sources (i.e., a version control system and an issue-tracking system) and links these data sources to detect defect-fixing commits. Defect-inducing commits are tracked based on the modifications in such defect-fixing commits.
Multiple-purposes commits: A defect-fixing commit could include modifications that accomplish other purposes apart from fixing defects.  called commits that have multiple purposes tangled changes. Such changes affect the SZZ algorithm when detecting defect-inducing commits from defect-fixing commits. Kawrykow and Robillard (2011) found that up to 15.5% of method updates occur by non-essential modifications only. Kim et al. (2006) modified the SZZ algorithm to handle not only defect-fixing hunks but also other purpose hunks in a defect-fixing commit. The modified SZZ algorithm improved the accuracy of detecting defect-inducing commits compared with the original SZZ algorithm. Herbold et al. (2019) found that half of defect-fixing commits that were detected by the SZZ algorithm are not actual defect-fixing commits.
A small number of detected commits: The SZZ algorithm uses an issue-tracking system to detect defect-fixing commits; however, this approach can only detect a fraction of the defect-fixing commits (Bird et al. 2009;Bernstein 2009b, 2010;Ayari et al. 2007). For example, Bachmann and Bernstein (2009b) reported the rate of fixed issue reports that are linked with commits. They found that the rate for the Apache HTTPD project is 43.43%, the Eclipse project is 33.05%, the GNOME project is 38.99%, the NetBeans project is 54.60%, the OpenOffice project is 7.43%, and the BSZKB project is 37.31%. Ayari et al. (2007) reported that the heuristic is not sufficient to find links between issue reports and changes. Indeed, our motivating example (Section 2) also reported that only a fraction of commits include issue id candidates. Hence, the SZZ algorithm needs to detect defect-fixing commits based on incomplete information.
In summary, the SZZ algorithm has two challenges: (1) detecting defect-inducing commits based on multiple-purpose defect-fixing commits; (2) detecting defect-fixing commits based on incomplete information.

Detecting Defect-Inducing Commits Based on Multiple-Purpose Defect-Fixing Commits
To address the first challenge that a defect-fixing commit intends to accomplish multiple purposes or is not related to defects (Mockus and Votta 2000;Kim et al. 2008;Pan et al. 2009;Kawrykow and Robillard 2011;Nguyen et al. 2013;Mills et al. 2018), prior studies have proposed several solutions (Mockus and Votta 2000;Rosen et al. 2015;Jung et al. 2009;Nguyen et al. 2013;Neto et al. 2018Neto et al. , 2019. Jung et al. (2009) excluded non-fixing hunks from a defect-fixing commit. They identified 11 non-fixing hunk patterns, which can be divided into two categories: syntactically detectable patterns and semantically detectable patterns. For example, renaming is a non-fixing hunk pattern in the syntactically detectable patterns category. Pan et al. (2009) also summarized code patterns in defect-fixing hunks. They found 27 code patterns; the defect-fixing hunks that include one of them account for around 50%. Nguyen et al. (2013) called such commits mixedpurpose fixing commits (MFCs). They proposed a tool, Cardo, which achieved an average of 93% precision and 61% recall to detect MFCs. Neto et al. (2018) tried to remove refactoring changes by modifying the SZZ algorithm and called this approach refactoring aware SZZ implementation (RA-SZZ). They reported that RA-SZZ removed 20.8% lines that were identified as defective lines by another state-of-the-art SZZ implementation.
cregit (German et al. 2019) has been utilized to detect defect-inducing commits from defect-fixing commits including non-source code changes (e.g., style changes). This tool converts a Git repository into a view repository in which specified types of files (e.g., Java files) are converted into token per line files. Each token also has an AST type. Hence, we can track the modification at the token level and easily ignore all non-source code modifications (e.g., comments, blanks, and format changes).
However, even if researchers use these previous solutions to detect defect-inducing commits, they need to detect defect-fixing commits first. If such detected defect-fixing commits are inaccurate, any of the previous solutions induce false defect-inducing commits. Hence, detecting defect-fixing commits accurately is important in detecting defect-inducing commits to take full advantage of the previous solutions. Hence, in this paper, we focus on the second challenge, detecting defect-fixing commits based on incomplete information.

Issue-Link Algorithm: Detecting Defect-Fixing Commits Based on Incomplete Information
To address the second challenge, many studies have attempted to improve the accuracy of detecting defect-fixing commits (Fischer et al. 2003a;Śliwerski et al. 2005;Bernstein 2009a, 2010;Bird et al. 2010;Sureka et al. 2011;Wu et al. 2011;Nguyen et al. 2012;Bissyandé et al. 2013;Le et al. 2015;Schermann et al. 2015;Sun et al. 2016Sun et al. , 2017aXie et al. 2019;Tu and Menzies 2020). More specifically, ILAs (issue-link algorithms) have been proposed to link issue reports to commits. For example, Fischer et al. (2003a) proposed an ILA that extracts issue ids from log messages to link between issue reports and commits.
As we described in Section 1, prior studies that proposed ILAs have two challenges: the data inconsistency and the small comparisons. Table 2 provides an overview of the studied projects (in the "Studied Projects/Organizations" column) and ILAs (in the "Baseline ILAs name" column) that were compared with the proposed ILA (in the "Proposed ILAs name" column) in prior studies. "NaN" indicates that no information was available or that the authors did not provide any names with their ILAs (e.g., heuristics). To collect these ILAs, we conducted a systematic literature study with the snowballing approach (Wohlin 2014). This is because we want to collect prior studies that proposed ILAs regardless of their venues and years. We observe that prior studies used different studied projects (data inconsistency), and compared their proposed ILAs with only few ILAs (small comparison). As a result, it is difficult to compare their results and conclude on the best-performing ILA in terms of the accuracy of detecting defect-fixing commits and improving defect prediction performance.

Defect Data Quality in Defect Prediction Research
If an ILA induces false-positive/negative defect-fixing commits, the ground-truth data that is used to train and evaluate defect prediction models would be biased. Indeed, prior studies have investigated the importance of data quality (Nguyen et al. 2010;Bird et al. 2009). Nguyen et al. (2010) investigated the impact of missing links for a commercial project. They found that even a commercial project, which adheres to strict rules, also provides a biased dataset. Bird et al. (2009) reported that the defect-fixing commits that were detected by using an issue-tracking system are not accurate and affect defect prediction performance.
In addition, prior studies have investigated the impact of noisy data on defect prediction performance Rahman et al. 2013;Ramler and Himmelbauer 2013;Tantithamthavorn et al. 2015). Kim et al. (2011) reported the impact of the false-positive/negative rate of detected defects by an ILA on defect prediction performance. They found that the proportion of false-positive/negative rates over a certain threshold (e.g., 20%) had a significant effect on defect prediction performance. Ramler and Himmelbauer (2013) studied the noise in a defect dataset. They reported that the prediction performance is not significantly affected by 20% noise. Rahman et al. (2013) compared the impact of the bias and sample size on defect prediction performance, reporting that the sample size is more important than the bias. They found that researchers need to focus more on collecting samples rather than the bias.  reported that bug reports are frequently misclassified (33.8% of bug reports). In addition, they found that 39% of files that are labeled as defective are not defective on average. They also showed that such Table 2 Data inconsistency and small comparisons in prior studies for ILAs. The numbers in parentheses in the "Studied Projects/Organizations" column indicate the numbers of retrieved projects from the Apache Software Foundation or studied projects Reference Keyword labeling EMBLEM misclassified data potentially decrease the defect prediction performance. Tantithamthavorn et al. (2015) evaluated the impact of mislabeled data on defect prediction. They found that such mislabeled data rarely affect precision values; in contrast, they do affect recall values.
To remedy such biased ground-truth data, we need to improve ILAs. To the best of the authors' knowledge, no studies have conducted large-scale empirical comparisons across ILAs, though many prior studies have proposed various ILAs (as described in Table 2). Hence, we conducted a large-scale empirical comparison across ILAs and evaluated the impact to defect prediction performance. Note that detecting defect-inducing commits based on defect-fixing commits is beyond the scope of this paper. We use a basic approach to detect defect-inducing commits to evaluate the impact to defect prediction performance.

Experimental Design
In this section, we give an overview of our experimental design. Figure 2 shows the steps of our experiments. In the following, we describe these steps in detail.

Extract explicit links by the Keyword Extraction.
The keyword extraction approach uses issue ids in the commit messages to make links between issue reports and commits. We regard commits that are linked with studied issue reports labeled Bug as defect-fixing commits and use them as the ground-truth defect-fixing commits. We call the links of such ground-truth defect-fixing commits explicit links.
2. Randomly delete X% links. We randomly deleted X% explicit links on our studied datasets (Section 5.1). By randomly deleting explicit links and regarding defect-fixing commits that are only linked with such deleted links as missing defect-fixing commits, we can simulate and evaluate a scenario in which datasets have low link proportions. Figure 3 shows an example. Let us assume we have three explicit links (Commit A and Issue A, Commit A and Issue B, and Commit B and Issue A) and delete 66% of the links. We might delete two links: Commit A and Issue A, and Commit B and Issue A. We regard Commit B as a missing defect-fixing commit; Commit A is still a defect-fixing commit because Commit A is still linked with Issue B. We describe the studied delete rates (i.e., X%) in our results section (Sections 6.1 and 6.2).
3. Preprocess data. We executed the preprocessing for the ILAs. The missing defect-fixing commits should not have any issue ids on commit messages because we assume that the keyword extraction approach overlooks such commits. Hence, we removed issue ids from the commit messages to conduct a fair comparison when using the commit messages on the missing defect-fixing commits. In addition, we applied a basic restriction.We describe the details of this restriction in Section 5.2. The details of the preprocessing for each ILA are given in Appendix A.

Extract links by the ILAs.
We applied the ILAs to the preprocessed commits and issue reports for each delete rate. When using a delete rate greater than 0%, ILAs are trained on the explicit links without the deleted links if such ILAs need to be trained.

Execute defect prediction based on extracted links.
We executed the defect prediction on the extracted links. We first used the extracted links to identify defect-fixing com- mits. We used such defect-fixing commits to identify defect-inducing commits by using the commit-link algorithm. We describe the details of our commit-link algorithm implementation in Section 5.3. Based on the defect-inducing commits, we trained the defect prediction model and evaluated the performance across different ILAs.

Repetitions.
To relieve data selection bias on the deleted links, we repeated steps 2-5. We repeated steps 1-4 100 times while we repeated step 5 20 times. We used the 100 results of step 4 as the RQ1 results and the 20 results of step 5 as the RQ2 results. We employed  3 An example to delete 66% of the explicit links from three of them different times because the execution time of step 5 would be too long to conduct 100 repetitions. We discussed the details of the execution time in Section 6.2.
A running example. Let us describe these steps with an example: the Avro project. In particular, we utilize the commit a439bf9. In step 1, the keyword extraction approach forms the links. The commit message of the commit a439bf9 includes a studied issue id of AVRO-2741 that is labeled Bug. Consequently, the keyword extraction approach links this commit to the issue report of AVRO-2741 and refers to this commit as a defect-fixing commit. Also, this link is an explicit link. For all the commits in the Avro project (2,728), the keyword extraction approach links 778 commits to issue reports labeled Bug. These commits are also defect-fixing commits, and all the links are explicit links.
In step 2, we delete X% links. If X is zero, no links are deleted. However, if X is not zero, X% links are deleted. For example, if X is 50, half of the links in the Avro project are randomly deleted. We refer to all commits whose all links are deleted as missing defectfixing commits. For example, if the link for the commit a439bf9 is deleted, the commit a439bf9 is a missing defect-fixing commit.
In step 3, if the commit a439bf9 is a missing defect-fixing commit, the issue id (i.e., AVRO-2741) is excluded from the commit message. We apply this exclusion process to all missing defect-fixing commits. In addition, we also apply the basic restriction to all remaining links to exclude the false-positive links (Section 5.2). The link of the commit a439bf9 is not the false-positive link; and therefore, it is not the target of the basic restriction.
In step 4, ILAs are applied to all commits and issue reports. If the commit a439bf9 is a missing defect-fixing commit, ILAs may recover the link between the commit and the issue report of AVRO-2741. However, ILAs may form links between the commit and other issue reports as false-positive links. Similarly, ILAs may recover links between any commits and any issue reports.
In step 5, we conduct defect prediction. First, we use all the commits that are defectfixing and not missing defect-fixing commits, and all the commits that are not defect-fixing but linked to issue reports labeled Bug by ILAs to find the corresponding defect-inducing commits. For the commit a439bf9, if either the link is not deleted in step 2 or the link is recovered in step 4, this commit is referred to as a defect-fixing commit and it is utilized to find the corresponding defect-inducing commits. Otherwise, this commit is not referred to as a defect-fixing commit even if it is an actual defect-fixing commit. We build a defect prediction model based on the defect-inducing commits.
In step 6, steps 2-5 are repeated. This repetition allows us to study the impact of various deleted links and false-positive links (i.e., the combination of defect-fixing commits, false-positive defect-fixing commits, and missing defect-fixing commits). For the commit a439bf9, we study both cases where the commit a439bf9 is a missing defect-fixing commit and a defect-fixing commit.

Methodology
In this section, we describe our methodology. In particular, we discuss our studied datasets, ILAs, a commit-link algorithm, defect prediction models, evaluation measures, preprocessing steps, a resampling approach, and validation schemes. The tools, data, and operations in Fig. 2 correspond to each method.

Studied Datasets
We used five open-source software projects (the Avro (Apache Software Foundation 2009b), Tez (Apache Software Foundation 2014), ZooKeeper (Apache Software Foundation 2008), Chukwa (Apache Software Foundation 2009), and Knox (Apache Software Foundation, 2013) from the Apache Software Foundation as our studied datasets. Table 3 describes the basic information of the projects. Avro is a data serialization system. Developers can use Avro to transform raw data into rich binary data. Tez is a framework on Hadoop that allows developers to process data. ZooKeeper is a centralized service for managing distributed systems. Chukwa is a monitoring system for distributed systems. Knox provides developers with an application gateway on Hadoop. As a result, we used two domains (data serialization and distributed system) in this study. We extracted Git repositories on GitHub and issue reports on JIRA for these five projects. The studied data include over 10k linked commits and 5k issue reports.
We chose these five projects because almost all commit messages of the repositories include issue ids on JIRA. The proportion of linked commits (i.e., including issue ids on commit messages) for the Avro project is 81.1%, for the Tez project is 96.3%, for the ZooKeeper project is 84.6%, for the Chukwa project is 82.4%, and for the Knox project is 82.7%; the proportion of defect-fixing commits that are detected by our keyword extraction approach (we describe the details in Section 5.2) for the Avro project is 28.5%, for the Tez project is 53.0%, for the ZooKeeper project is 44.7%, for the Chukwa project is 36.8%, and for the Knox project is 35.1%.
We considered these defect-fixing commits as the ground-truth defect-fixing commits. This is commonly used when evaluating ILAs (Bachmann and Bernstein 2009b;Sureka et al. 2011, Sun et al. 2016, 2017aXie et al. 2019) because prior studies have validated this practice (Bachmann and Bernstein 2009b;Sureka et al. 2011;Sun et al. 2017a). Prior studies (Bissyandé et al. 2013;Sun et al. 2017a;Sureka et al. 2011) executed a manual inspection to validate the accuracy of their data. Similar to such prior studies, to validate the accuracy of our ground-truth data, we also executed a manual inspection for both false-positive and -negative defect-fixing commits by two of the authors. We first randomly extracted 361 defect-fixing commits to validate the number of false-positive defect-fixing commits and 367 non-defect-fixing commits to validate the number of false-negative defect-fixing commits from all projects. These numbers are determined by the condition where the confidence level is 95% and the confidence interval is 5. Two of the authors labeled these commits as false-positive/negative defect-fixing commits. The kappa coefficients (scikit-learn developers 2020d) of this labeling process are 1.000 and 0.971, respectively. For the conflicts between two of the authors, the two of the authors carefully discussed and decided the final label. Given this manual inspection, we found that the accuracy of the defect-fixing commits and non-defect-fixing commits are 99.7% (360/361) and 89.1% (327/367), respectively, which are high accuracy values. Consequently, the ground truth data is reliable. We discuss the false-positive and -negative defect-fixing commits in Section 7.5 and the threats of this manual inspection in Section 8.3.
We studied the Java source code in the studied projects, though the Avro project also provides developers with implementations on multiple languages. Note that we removed merge commits from the studied commits. This is because merge commits only merge existing diff codes and do not add/modify any codes. In addition, we studied issue reports that are labeled Bug and the status is either Resolved or Closed. Note that there exists an issue report whose resolution date is missing. Hence, in our experiment, we used the closed date as a proxy of the resolution date if the resolution date is missing.

Studied ILAs
We first collected prior studies that propose ILAs. To prevent overlooking such prior studies, we used the snowballing approach (Wohlin 2014) that we described in Section 3.3. In particular, when we find a paper that proposes ILAs, we also collect all studies that refer to this study and are referred by this study. This process allows us to collect studies regardless of their venues and years. Also, we used the result of our literature survey that we described in Section 2.2. Finally, we found 16 prior studies ( Table 2).
Prior studies combined several criteria (e.g., text similarity) on their ILAs (Fischer et al. 2003a;Śliwerski et al. 2005;Bachmann and Bernstein 2009a;Bachmann et al. 2010;Bird et al. 2010;Sureka et al. 2011;Wu et al. 2011;Nguyen et al. 2012;Bissyandé et al. 2013;Le et al. 2015;Schermann et al. 2015;Sun et al. 2016Sun et al. , 2017aXie et al. 2019;Tu and Menzies 2020). In this paper, we retrieved each of the criteria from the previous ILAs and call them and their combinations ILAs. This is because such criteria are the finest-grained algorithms when linking issue reports to commits, and we can cover not only previous ILAs, but also other combinations of criteria. Note that we exclude the studies that used manual analysis Bird et al. 2010;Tu and Menzies 2020). Table 4 lists all the studied ILAs. We studied 10 ILAs including the baseline ILA (the keyword extraction approach). Note that because we used two models for the machine learning approach, the actual number of studied ILAs is 11. Before applying these ILAs, we applied an essential restriction: Table 4 Overview of studied ILAs -a linked issue report is created before the date of its linked commits are committed; and -such a linked issue report is resolved after the date of its linked commits are committed.
All ILAs include this restriction. We call this restriction the basic restriction. This restriction reduces the number of false-positive defect-fixing commits. We discussed the details in Section 7.5. All the implementations of ILAs used in this paper can be seen as a Python package (Kondo 2021b).
In the following, we give a brief overview of the ideas behinds ILAs. We describe the details of them in Appendix A.
-Keyword Extraction (KE): This is a de facto standard approach to identify defectfixing commits extracting issue ids from commit messages with regular expressions. As described previously, we used the output of this ILA as the ground-truth defect-fixing commits. However, even if we use the projects in which almost all commits include issue ids, linking commits and issue reports is a difficult process. We describe this threat in Section 8.  (Le et al. 2015) used a comment generation technique. They used javadoc comments as the supervised data to train the technique, and therefore, we used the javadoc comments instead of using code comment generation techniques to ensure that clean information is used. The procedure is the same as that in the natural language text similarity approach.  (Elkan and Noto 2008). As there might exist many unlabeled links between issue reports and commits, prior study (Sun et al. 2017a) used the PU learning to predict positive links based on such unclear data. To predict positive links, we provided five features with the PU learning approach: the time difference, the time difference type, the cosine similarity of text, the proportion of modified source files, and the number of modified source files. Further details of the features are described in Appendix A. -Machine Learning (ML): This approach applies machine learning models to predict links. Although the PU learning approach predicts positive links based on positive and unlabeled links, this approach predicts positive links based on positive links. We used two machine learning models: a random forest model (scikit-learn developers 2020a) and a support vector machine model (scikit-learn developers 2020c). To predict positive links, we provided five features with the machine learning approach that are also used on the PU learning approach.
Note that, in this paper, we decided not to use the following four ILAs that have been proposed previously: The file filtering and code similarity approaches need modified files information (patches). However, the Apache JIRA prohibited such information from being retrieving. The developer filtering approach is a common practice; however, we cannot use such information because of GDPR (EU 2016). The deep learning approach was proposed by Xie et al. (2019). However, many settings are not clear in the paper, such as the details of deep learning architectures. Hence, we decided not to use these approaches in this paper.

Commit-Link Algorithm
After detecting defect-fixing commits, we need to detect defect-inducing commits. We call this process commit-link algorithm (CLA). We used a basic procedure as our CLA: 1. Apply cregit (German et al. 2019) to the target repository. As described in Section 3, cregit (German et al. 2019) converts a Git repository into a view repository in which specified types of files (i.e., Java file) are converted into token per line files. Each token also has an AST type. Hence, we can easily ignore redundant tokens (e.g., comments). 2. Extract commit hash lists from the target repository. Remove the first commit hash. This is because the first commit is not related to source code in the Avro and ZooKeeper projects and it is difficult to track individual modifications in the Tez, Chukwa, and Knox projects. 4. Extract the deleted lines for all Java files for each commit, but ignore the added lines. This is because the added lines are newly added lines in this defect-fixing commit. Hence, such lines do not have any information to detect defect-inducing commits. 5. Remove the deleted lines in non-source code (i.e., comments). 6. Apply the git blame 5 command to the remaining deleted lines to identify the commits where the deleted lines were added. We regard the extracted commits as defect-inducing commits.

Studied Defect Prediction Model
We used the logistic regression model as our defect prediction model. As our goal is not to construct an accurate defect prediction model, but to reduce the difference in defect prediction performance from the ground-truth defect prediction performance by using ILAs, we only chose logistic regression. The logistic regression model is frequently used for constructing defect prediction models (Kamei et al. 2013;Basili et al. 1996;Gyimóthy et al. 2005). This model learns the relationship between a dependent variable and independent variables. In defect prediction, a dependent variable is the flag of commits that indicates whether this commit is defective or clean; dependent variables are the features of commits.
To construct a logistic regression model, we used the scikit-learn implementation (scikitlearn developers 2020b). Because it is important to optimize the hyper-parameters of defect prediction models (Tantithamthavorn et al. 2016), we optimized the hyper-parameter of the logistic regression model. The scikit-learn implementation has two hyper-parameters that can be optimized: the regularization strength C and the norm of the penalty. We optimized these two hyper-parameters in the following ranges: 0 to 10 for C and l1 and l2 for the norm of the penalty. We might optimize other hyper-parameters; however, because of the long execution time, we only used these two hyper-parameters. We describe the execution time of defect prediction in Section 6.2. From empirical and theoretical viewpoints, the random search is one of the best optimization methods (Bergstra and Bengio 2012). Hence, we used the random search to optimize the hyper-parameters of the logistic regression.

Evaluation Measures
We evaluated two tasks: the accuracy of detecting defect-fixing commits by the ILAs and the accuracy of detecting defect-inducing commits by the defect prediction model. As each task has different outputs, we used different sets of evaluation measures. In addition, we used a statistical test. In the following, we explain the evaluation measures for each task and the statistical test.

Detecting Defect-Fixing Commits
We used four evaluation measures: precision, recall, F1, and true-positive (TP) rate. The precision indicates the proportion of true defect-fixing commits in all the defect-fixing commits that are decided by an ILA; the recall indicates the proportion of true defect-fixing commits that are identified by an ILA in all the true defect-fixing commits. Here, the true defectfixing commits indicate the commits that are identified by the explicit links. Let us assume that we have three true defect-fixing commits and two clean commits, and an ILA detects one true defect-fixing commit and one clean commit as defect-fixing commits. In this case, the precision value would be 0.500 (1/2), and the recall value would be 0.333 (1/3) (Fig. 4).
The TP rate is used on this task only. In this task, we deleted X% of the links and have the ILAs recover missing defect-fixing commits. The precision, recall, and F1 values were computed on all the true defect-fixing commits; however, the TP rate was computed on the missing defect-fixing commits only. This is because we want the TP rate to evaluate the accuracy of the ILAs on the missing defect-fixing commits. Let us assume that we have five missing defect-fixing commits. If an ILA detects two missing defect-fixing commits as defect-fixing commits, the TP rate would be 0.400.

Detecting Defect-Inducing Commits
We used six evaluation measures: area under the receiver operating characteristic curve (AUC), precision, recall, F1, Matthews correlation coefficient (MCC), and Brier score. AUC and Brier score are threshold-independent measures, though the precision, recall, and F1 are threshold-dependent measures (we used 0.5 as the threshold). This is because Tantithamthavorn and Hassan (2018) suggested using threshold-independent measures because threshold-dependent measures may result in different conclusions by different thresholds. However, we also used threshold-dependent measures because such measures show us various viewpoints on the results. We also used a threshold-dependent measure, MCC, because prior studies reported that MCC is durable to the skewness of defect data (Boughorbel et al. 2017;Zhang et al. 2016).

The Scott-Knott ESD test
We used the Scott-Knott ESD test (Tantithamthavorn et al. 2017) as our statistical test to compare the evaluation measures across ILAs (using a 95% significance level). The Scott-Knott ESD test is an extended version of the Scott-Knott test. The Scott-Knott test is a clustering algorithm that ranks the distributions. If distributions are not statistically significantly different, these distributions are placed in the same rank. The Scott-Knott ESD test ranks the distributions with not only statistically significant differences but also Cohen's d effect size (Cohen 2013). The distributions that are not statistically significantly different or with negligible effect size are placed in the same rank.

Preprocessing for Predicting Defect-Inducing Commits
To predict defect-inducing commits, we used the defect prediction model. Thus, we need to transform a commit into a numerical vector representation. The most common representation in commit-level defect prediction (a.k.a. just-in-time defect prediction) is metrics-based approaches such as using the change metrics (Kamei et al. 2013;Kim et al. 2008;Mockus and Votta 2000;Kondo et al. 2020).
In this paper, we used the change metrics (Kamei et al. 2013;Kondo et al. 2020) to transform a commit into a numerical vector representation and evaluate the ILAs. We used Commit Guru (Rosen et al. 2015) to calculate the change metrics. We transformed the change metrics to remove correlated features and normalize the features following a previous study (Kondo et al. 2020): -Exclude ND and REXP because they are strongly correlated with NF and EXP.
Finally, we apply the z-score (Kondo et al. 2019) to the processed change metrics. Note that we decided not to apply z-score to FIX because FIX is a binary metric.

Resampling Approach
When learning the model, the learning performance might be affected by imbalanced data (Tan et al. 2015). Prior studies (Bennin et al. 2017;Agrawal and Menzies 2018;) recommend using the following resampling approaches: random under-sampling, SMOTUNED, and MAHAKIL. In particular, SMOTUNED and MAHAKIL are state-of-the-art approaches.
To remove the affection of imbalanced data, we compared the three approaches in defect prediction and selected one of them for our study. Because we used the resampling approach in RQ2, we evaluated the impact of these different approaches on the defect prediction performance in the same experimental setting as RQ2 except for the repetition times. Owing to the long execution time, we used 10 repetitions. Given the result, we found that SMO-TUNED is the best resampling approach in our study. Hence, we employed SMOTUNED. We only applied SMOTUNED to training data because we must use raw test data for evaluation.

Validation Schemes
We need to relieve data selection bias on the deleted links. If we used a set of deleted links, our result may be affected by which links are deleted. Therefore, we repeated the process of deleting links 100 times for each delete rate. For each process, we computed the evaluation measures of detecting defect-fixing commits in RQ1. Also, we used 20 of them in RQ2 to compute the evaluation measures of detecting defect-inducing commits. Finally, we computed the median evaluation measures across 100 repetitions in RQ1 and 20 repetitions in RQ2. When applying the Scott-Knott ESD test, we considered the values of an evaluation measure for 100/20 repetitions as a distribution for each ILA.
When evaluating the accuracy of detecting defect-inducing commits (just-in-time defect prediction), we also need to relieve data selection bias on the training data and test data. Cross-validation techniques or bootstrap-sampling techniques (Tantithamthavorn et al. 2017) are frequently used. However, just-in-time defect prediction is studied on sequential data. We must use past commits/changes to train the model without any information from the future commits/changes. Thus, we used online change classification, which satisfies this restriction. Online change classification was originally proposed by Tan et al. (2015), and Kondo et al. (2020) formalized the parameters. We provide a Python package of the online change classification (Kondo 2021c). Table 5 lists the parameter settings of the online change classification. We used the same process with prior work (Kondo et al. 2020).

RQ1: Which Issue-Link Algorithm is the Best to Detect Defect-Fixing Commits?
Motivation and Approach: In recent years, several prior studies (Sureka et al. 2011;Bissyandé et al. 2013;Schermann et al. 2015;Sun et al. 2016Sun et al. , 2017aXie et al., 2019) focused on recovering missing links rather than detecting missing defect-fixing commits. A missing link indicates a link between a commit and an issue report that is not detected by the KE approach. A missing defect-fixing commit is a commit that fixes a defect but is not detected by the KE approach.
Our main aim is to evaluate the ability of the ILAs in terms of detecting missing defectfixing commits rather than detecting missing links. This is because we want to contribute to defect prediction rather than recovering missing links.
In this experiment, we deleted the explicit links of 10% to 50% in steps of 10%. We considered the deleted explicit links as missing links and commits that are only linked with such missing links as missing defect-fixing commits. We evaluated how many missing defect-fixing commits are detected by the ILAs. Table 6 shows the median values of the evaluation measures for the 100 repetitions; the row indicates an ILA, and the column indicates an evaluation measure. The cells show not only the median values of the evaluation measures, but also the ranks in the parentheses that were computed by the Scott-Knott ESD test across the ILAs. The gold cells indicate the highest rank (= 1). Owing to space limitations, we only show the delete rates of 50% and 10%.   (7) 0.000 (10) 0.000 (9) 0.000 (9) LO 0.817 (5) 0.236 (7) 0.366 (5) (7) 0.095 (7) 0.057 (7) PU 0.219 (7) 0.551 (3) 0.313 (3) 0.546 (2)  RF 1.000 (5) 0.004 (9) 0.007 (9) 0.001 (9) SVM 0.955 (2) 0.040 (8)

Observation 1) The TF approach generally statistically significantly outperformed the
other ILAs in the case where the delete rate is 50%. Tables 6a, c, e, and g list the results on the datasets with the delete rate of 50% in the Avro, Tez, ZooKeeper, and Chukwa projects. The TF approach achieved the highest rank on recall, F1, and TP rate in the delete rate of 50% except for F1 in the Chukwa project. In addition, the rank on F1 in the Chukwa project is the second rank. This result implies that the TF approach recovers the largest number of missing defect-fixing commits (recall and TP rate) in almost all projects, whereas the number of false-positive defect-fixing commits (not defect-fixing commits, but identified by the ILA) is moderate (F1). However, the TF approach ranked fourth or fifth in terms of precision in these four projects. Hence, even if the TF approach achieved the highest F1 rank, we need to be aware of false-positive defect-fixing commits when using the TF approach. Finally, the TF approach did not achieve high ranks in the Knox project on the three evaluation measures (Table 6i). Hence, the TF approach generally statistically significantly outperformed the other ILAs while projects exist in which the TF approach does not work well.

Observation 2) The TF approach statistically significantly outperformed the other
ILAs in terms of the TP rate for all delete rates except for the Knox project. Except for the Knox project, all the results in Table 6 show that the TF approach achieved the highest rank in terms of the TP rate. We observed the same results in the other delete rates as well. This result implies that the TF approach can recover the most missing defect-fixing commits for not only the delete rate of 50% but for all the delete rates in many projects.

Observation 3) The TS approach statistically significantly outperformed the other
ILAs in terms of the TP rate for all delete rates in the Knox project. We observed that the TF approach is the best approach in the Avro, Tez, ZooKeeper, and Chukwa projects in terms of the TP rate. However, in the Knox project, the TS approach achieved the highest rank in all the delete rates. Hence, the TS approach may recover the most missing defect-fixing commits in certain projects.

Observation 4)
The PH achieved the highest rank (statistically significantly outperforming the ILAs that are placed at lower ranks) in terms of the recall, or F1 in 32 out of 40 cases between the delete rates of 10% and 40%. Table 6b, d, f, h, and j list the results on the datasets with the delete rate of 10%. The PH achieved the highest rank on the recall and F1 in all the projects. We observed similar results between the delete rates of 10% and 40% (32 out of 40 cases 6 ). This result implies that the PH detects many defect-fixing commits (recall) while keeping the number of false-positive defect-fixing commits moderate (F1). However, the PH achieved statistically significantly lower recall and F1 in the datasets with the delete rate of 50% except for one case and TP rate in all delete rates compared with the TF or TS approach. Hence, the PH potentially overlooks missing defect-fixing commits compared with the TF or TS approach.

Observation 5)
The RF approach achieved the highest rank in terms of the precision in 22 out of 25 cases. Except for the Tez project with delete rates of 10%, 20%, and 50%, the RF approach achieved the highest rank on precision. This result implies that the RF approach prevents false-positive defect-fixing commits. Indeed, the median precision values are over 0.900. Hence, the RF approach could recover missing defect-fixing commits accurately. However, the recall values and ranks are low. Hence, the RF approach may overlook many defect-fixing commits. Note that the SVM approach achieved the highest or second highest rank in 19 out of 25 cases. Hence, the SVM approach could also recover missing defect-fixing commits accurately.

RQ2: Which Issue-Link Algorithm is the Best to Prevent a Defect Prediction Model From Being Affected by Missing Defect-Fixing Commits in Defect Prediction?
Motivation and Approach: From the RQ1 results, we found that the ILAs can detect missing defect-fixing commits. In particular, the following ILAs are well performed: -the time filtering approach; -the natural language text similarity approach; -the Phantom heuristics ; -the random forest approach; -the support vector machine approach.
We hypothesize that such ILAs can improve the reliance of defect prediction performance on a low-quality dataset by reducing the number of missing defect-fixing commits. A low-quality dataset indicates a dataset that has many missing defect-fixing commits. If there exist many missing defect-fixing commits, a defect prediction model may not learn sufficient numbers of defect-inducing commits that are detected by insufficient numbers of defect-fixing commits.
We prepared six possible scenarios: we randomly deleted 0% to 50% of links in steps of 10% as described in Section 4. In particular, we refer to the scenario where 0% of links are deleted as a high-quality dataset scenario; we refer to the other scenarios (where 10%, 20%, 30%, 40%, and 50% of links are deleted) as low-quality dataset scenarios.
We refer to the defect prediction performance on the high-quality dataset where the KE approach is used to detect defect-fixing commits as the ground-truth defect prediction performance. We computed the difference between the ground-truth defect prediction performance and the defect prediction performance on the low-quality datasets where we use any ILAs. If such ILAs detect missing defect-fixing commits accurately and sufficiently, the difference would be smaller than using the KE approach only on the low-quality dataset scenarios. For example, if a defect prediction model on a low-quality dataset where an ILA is used matches the ground-truth defect prediction performance, such an ILA may detect all the missing defect-fixing commits. Note that we investigated not only an ILA, but also all combinations of the ILAs that are well performed in RQ1. Figure 5 shows the procedure of the RQ2 approach for a studied project. We describe the steps in the following. The details are described in Section 5.

Execute defect prediction based on the explicit links.
We used the explicit links (detected by the KE approach on the high-quality dataset) to detect defect-fixing commits and compute the ground-truth defect prediction result in terms of six evaluation measures (AUC, precision, recall, F1, MCC, and Brier score).

Randomly delete X % links.
We randomly delete X% links (10% to 50% in steps of 10%) from the explicit links and prepare a low-quality dataset.

Apply ILA.
We apply an ILA to the dataset to detect missing defect-fixing commits.

Execute defect prediction based on the dataset that was processed by ILA.
We execute defect prediction based on the dataset. We repeat steps 2-4 20 times to relieve the data selection bias of deleted links. Eventually, we have 20 ILA defect prediction results for each evaluation measure.

Compute the absolute difference between the ground-truth result and ILA results.
We compute the absolute difference between the ground-truth defect prediction result and the ILA defect prediction results. As we have 20 ILA defect prediction results, this process results in 20 differences for each evaluation measure.
6. Construct the distribution of differences for each evaluation measure. As each of the evaluation measures has 20 results, we consider these 20 results as a distribution of an evaluation measure. We repeat this process for each ILA for each delete rate (10% to 50%). Eventually, each ILA has a distribution for each evaluation measure for each delete rate.
7. Apply the Scott-Knott ESD test to compare ILAs for each evaluation measure for each delete rate. To identify the ILA that achieves the smallest differences, we apply the Scott-Knott ESD test to the distributions of all ILAs for each evaluation measure for each delete rate.
In these steps, the execution time of defect prediction (Strep 4) is remarkably long. In this paper, we repeated this step 20 times for 31 ILAs (all combinations of six ILAs), 5 studied projects, and 6 dataset scenarios. The execution time of one repetition for an ILA, a studied project, and a dataset scenario is about 761 seconds on a computational resource, which consists of 8 CPUs and 32 GB memory with parallel execution. Hence, if we repeated this process 100 times similar to RQ1, the total expected execution time would be 31 * 5 * 6 * 100 * 761 819 days. To reduce this execution time, we only repeat this process 20 times in this RQ.
Results: Observation 6) The combination of the TS, PH, RF, and SVM approaches achieved the highest rank or statistically significantly reduces the absolute differences the most compared with the KE approach. To understand this observation easily, we first describe the result in an experimental setting. Table 7 lists the median absolute differences between the ground-truth result and the ILA results in the Avro project with the delete rate of 50%. The values in parentheses show the ranks that were computed by the Scott-Knott ESD test across the ILAs. The gold cells indicate the cases where the rank is the highest (rank 1) across the ILAs. The cyan cells indicate the cases where the rank is higher than the rank of the KE approach. The COUNT column indicates the numbers of gold and cyan cells for each row; the values in parentheses indicate the number of gold cells only. We observed that 18 of the ILAs statistically significantly reduce the absolute differences for all the evaluation measures (i.e., the values in the COUNT column were six). Hence, these ILAs work well at reducing the absolute differences across the ILAs in this experimental setting. Table 8 lists the summation of all the COUNT values for each ILA. As we used six evaluation measures in the five projects with five delete rates, the maximum summation value is 150. Indeed, we observed that the combination of TS, PH, RF, and SVM approaches achieved 111, which is the highest value. This result implies that this combination statistically significantly reduces the absolute differences compared with the KE approach or at least achieved the highest rank.
Observation 7) All ILAs statistically significantly reduced the absolute differences compared with the KE approach. Table 8 indicates that the KE approach, which is the baseline, achieved 56, which is the smallest value. Hence, the ILAs statistically significantly reduced the absolute differences compared with the KE approach.

Observation 8) The combination of the TS, PH, RF, and SVM approaches achieve
better results in the lower-quality dataset scenarios while it may achieve worse results in the higher-quality dataset scenarios. Table 9 lists all the median absolute differences of the combination of TS, PH, RF, and SVM approaches with the Scott-Knott ESD test results. In the Chukwa project, the numbers of cyan and gold cells with the delete rate of 10% were zero. In addition, the numbers in the Tez project with the delete rate of 20% were one and one; the numbers in the ZooKeeper project with the delete rate of 10% were two and one. This result implies that the best combination of ILAs may be more suitable for the lowerquality dataset while it may achieve worse results in certain projects with higher-quality datasets.

Can the RF Approach Detect Missing Defect-Fixing Commits in the High-Quality Dataset?
From the RQ1 result, the RF approach achieves the highest precision (e.g., 1.000); hence, we suppose that the RF approach can identify new defect-fixing commits that are not detected by the KE approach in the high-quality dataset scenario. Table 10 lists the links of newly identified defect-fixing commits and issue ids in the high-quality dataset. The non-green cells are actual links that were confirmed manually by two of the authors.

Which Time Interval is the Best to Detect Defect-Fixing Commits?
Observation 11) The 10-minute time interval is the best setting to detect defect-fixing commits in our studied projects. Table 11 indicates the performance of the TF approach in terms of detecting defect-fixing commits in different time intervals. The gold cells indicate over 0.7. The values in the parentheses show the ranks that were computed by the Scott-Knott ESD test for each evaluation measure across five time intervals. The delete rate is 50%. We observed two findings: the smaller the time interval, the better the performance in terms of precision; the larger the time interval, the better the performance in terms of recall and the TP rate (deleted). This is because these evaluation measures are the trade-off. Hence, we focused on the harmonic evaluation measure, F1. As the TF approach does not work well in the Knox project (RQ1), we only studied the other four projects.
The time intervals of 10 and 30 minutes achieved rank 1 once in the Scott-Knott ESD test. The time interval of 5 minutes achieved rank 1 twice. However, the time interval of 5 minutes achieved rank 4 in the Tez project. Although the time interval of 10 minutes achieved rank 1 once, it achieved rank 2 in the other projects. Therefore, we concluded that the time interval of 10 minutes is well balanced. This result implies that the 10-minute time interval detects many defect-fixing commits while keeping the number of false-positive defect-fixing commits low.

Do ILAs Affect the Effort-Aware Defect Prediction Performance Measures?
Motivation and Approach: Just-in-time defect prediction models help in identifying whether a commit is likely to be defective. If such a commit is identified as defective, developers use their test effort to inspect this commit to modify the defect; however, their test effort is limited. Hence, considering the test effort is also important to evaluate defect prediction models.
IFA measures the number of commits that need to be inspected before the first defectinducing commit is identified. The smaller IFA implies that defect prediction models identify defect-inducing commits at an early time. PII@L and CostEffort@L measure the number of commits that need to be inspected and the number of identified defect-inducing commits, respectively, when developers can inspect L lines of code. We used the same L as the prior study (Ni et al. 2020): 20%, 1000, and 2000. Norm(P opt ) indicates the similarity between the prediction result and the optimized case where defect prediction models perfectly predict defect-inducing commits according to the number of lines of code (Kamei et al. 2013). The range is between 0 to 1; the higher the value, the better the prediction result is implied.
Results: Observation 12) The KE approach is the best approach in terms of the effortaware evaluation measures. In RQ2, the KE approach is the worst approach. However, no ILAs achieved a higher sum of the COUNT values than the KE approach in terms of the effort-aware evaluation measures (Table 12). As a result, all ILAs may not work well in terms of the effort-aware evaluation measures. However, the difference between the largest sum of the COUNT values and the smallest one is 32  in this analysis while that in RQ2 is 55 (111-56). Also, because we used eight effort-aware evaluation measures (PII@L and CostEffort@L have three variants), the maximum value is 200 while that in RQ2 is 150. If we consider this difference, the difference ratio between the largest value and the smallest value in RQ2 is two times larger than this analysis. Hence, the difference between the best approach and the worst approach in terms of the effort-aware evaluation measures may be small. Future studies are necessary to investigate the relationship between ILAs and effort-aware evaluation measures.

The False-positive/negative Defect-fixing Commits in the Ground Truth Data
As we described in Section 5.1, we found that our ground truth data (defect-fixing commits) are accurate through manual inspection. However, there exist a few false-positive/negative defect-fixing commits. For example, the commit cf3318e1b in the Tez project is labeled a defect-fixing commit. However, this is a false-positive defect-fixing commit. The commit message includes two issue ids: TEZ-1594 labeled Sub-task and TEZ-8 labeled Bug. As the KE approach links this commit to these two issue reports, this commit is referred to as a defectfixing commit; however, the link to TEZ-8 is a false-positive link because TEZ-8 is not directly related to this commit. Hence, this commit is a false-positive defect-fixing commit.
The commit 0b74bd5e in the Avro project is an example of a false-negative defectfixing commit. This commit does not include any issue ids in its commit message. However, CHANGES.txt, which is a changed file, includes an issue id labeled Bug. Hence, this commit should be a defect-fixing commit.
Finally, this manual inspection provides us with an interesting suggestion. Our basic restriction (Section 5.2) may exclude defect-fixing commits that do not fix source code (i.e., noise). For example, the commit message of the commit c89e352e0 in the Tez project includes an issue id, TEZ-2885 labeled Bug. Hence, prior KE approaches may link this commit and the issue report. However, this commit only modifies CHANGES.txt while the actual defect-fixing commit is the commit 6eb2cb551. This may occur if developers forget to modify CHANGES.txt. This kind of defect-fixing commit should be excluded from the defect prediction research. Because our basic restriction excludes these commits, we suggest researchers and practitioners use the basic restriction at least. This kind of restriction is employed by prior work (McIntosh and Kamei 2018).

Answer to This Paper: Which ILAs Should We Use?
In summary, researchers and practitioners need to select the ILAs according to their particular purpose. If researchers and practitioners want to evaluate the defect prediction models in the low-quality dataset scenario, we recommend using the best ILA in terms of the absolute differences with the ground-truth defect prediction performance: the combination of the TS, PH, RF, and SVM approaches. This is because, in the low-quality dataset scenarios, this ILA can reduce the absolute differences of defect prediction performance from that in the high-quality dataset scenario with the KE approach (RQ2).
If researchers and practitioners investigate the defect-fixing commits, we recommend using the ILA that achieves the highest precision: the RF approach because researchers and practitioners do not need to worry about false-positive defect-fixing commits (RQ1).
If researchers and practitioners want to identify almost all missing defect-fixing commits, we recommend using the TF approach or TS approach because these ILAs achieve the highest TP rate in different projects (RQ1) while being a very simple approach. However, the precision value is lower than those of the other accurate ILAs. Researchers and practitioners need to consider false-positive defect-fixing commits.
Finally, in defect prediction, we recommend using the basic restriction to exclude noise of defect-fixing commits. In particular, considering the dates of the commit and the issue report is a simple but effective approach to detect defect-fixing commits.

Construct Validity
The reliability of the issue reports in the studied issue-tracking system (i.e., JIRA) is a threat in this study. Ramler and Himmelbauer (2013) described this challenge.  reported that 39% of files that are labeled as defective are not defective on average. Bachmann et al. (2010) reported that some defects are only reported on the mailing list. In addition, such defects are very important because the core developers in the Apache projects use the mailing list. Future studies are necessary to investigate the quality of issue-tracking system to improve the reliability of our findings.
Defect-fixing commits could include addition/modification/deletion that is not related to defect fixing. For example, Mills et al. (2018) reported that around 63.1% (848/1,344) of modified files in defect-fixing commits are not related to defect fixing. If we removed such files from the defect-fixing commits, this might result in different defect prediction performance for each ILA. Future studies are necessary to investigate whether our results are consistent with removing such files.
We used the results of the keyword extraction approach as our ground-truth data. The keyword extraction approach uses a regular expression to extract issue ids from commit messages. Unfortunately, this process may induce false-positive/negative defect-fixing commits. Such commits would affect our experimental results, though we manually inspected the accuracy of ground truth data and found the accuracy is high.
To execute the natural language text similarity approach, we removed the issue ids from commit messages on the missing defect-fixing commits to make our experimental setting closer to a practical situation. However, commit messages also frequently include issue report titles because of their commit rule (Apache Software Foundation 2020). This JIRA title may make our experimental setting artificial and unfair. We kept the JIRA title because we assume that developers forget to add issue ids only.
In this paper, we mainly focus on the ILAs while defect prediction includes several factors such as the process of detecting defect-inducing commits. Hence, the results of our study are restricted by our experimental setting in defect prediction. Future studies are necessary to investigate the relationship between ILAs and the other factors in defect prediction.

External Validity
To generalize our results, we applied our experiments to five open-source software projects on the Apache Software Foundation. These studied projects contain two domains. However, all the projects are Apache projects and have high-quality commit messages. Future studies are necessary to investigate whether our results generalize to other projects.
We carefully chose our studied ILAs from prior studies that were collected by our systematic literature study with the snowballing approach. However, we decided not to use a few ILAs and we may have overlooked a few prior studies that proposed ILAs. Future studies are necessary to investigate such ILAs with software projects in which we can use all necessary information.

Internal Validity
We summarized our validation technique and ILAs as Python packages (Kondo 2021b, c). In addition, we made the replication package (Kondo 2021a). Researchers and practitioners may easily repeat our experiments.
We manually investigated the correctness of the identified links in Section 7.1. This investigation was conducted by two of the authors, and we double-checked the result. However, the result may include mistakes. In addition, we manually inspected the accuracy of the ground truth data. This manual inspection has been carefully done. However, it may include mistakes.
To remove the merge commits, we used the --no-merges option of the git log command (Git community 2020). Hence, the accuracy to identify the merge commits depends on this option.

Conclusion
The impact of false-positive/negative defect-inducing commits on the defect prediction performance is important when evaluating defect prediction models. To reduce the number of false-positive/negative defect-inducing commits, many prior studies have proposed ILAs to detect defect-fixing commits accurately (Fischer et al. 2003a;Śliwerski et al. 2005;Bernstein 2009a, 2010;Bird et al. 2010;Sureka et al. 2011;Wu et al. 2011;Nguyen et al. 2012;Bissyandé et al. 2013;Le et al. 2015;Schermann et al. 2015;Sun et al. 2016Sun et al. , 2017aXie et al. 2019;Tu and Menzies 2020). However, challenges still exist, such as dataset inconsistency and small comparisons. Our work is the first large-scale study to evaluate the ILAs on the same experimental settings. In addition, we summarized the prior ILAs as our related work through our systematic literature study.
In the following, we summarize the main recommendations. We recommend selecting ILAs according to the particular purpose.
The proportion of modified source files is the proportion of modified Java files in a commit. The number of modified source files is the number of modified Java files in a commit. The time difference is the time difference of a commit date and an issue resolved date in seconds. The time difference type is a binary value; if an issue resolved date is after a commit date, it would be one; otherwise, it would be zero. The cosine similarity of text is the cosine similarity values that are computed on the natural language text similarity approach.

A.10 Machine Learning (ML)
The PU learning approach used a PU model to predict links. However, we can also apply other machine learning models to this task. We used two machine learning models: a random forest model (scikit-learn developers 2020a) and a support vector machine model (scikitlearn developers 2020c) instead of a PU model. The procedure is the same as the PU learning approach. The only difference is to use a PU model or machine learning models. We call this approach as the machine learning approach.