1 Introduction

The manner in which defects are introduced into code, and the sheer volume of defects in software, are typically beyond the capability and resources of most development teams (Tantithamthavorn et al. 2016a; Nam et al. 2017; Kamei et al.2012; Ghotra et al. 2017; Kondo et al. 2019). Defect prediction models aim to identify software artifacts that are likely to be defective (Ohlsson and Alberg 1996; Menzies et al. 2010; Turhan et al. 2009; Weyuker et al. 2010; Ostrand and Weyuker 2004; Ostrand et al. 2005). The main purpose of defect prediction is to reduce the cost of testing, analysis, and code reviews, by prioritizing developers’ effort on specific artifacts such as commits, methods, or classes.

In this work, we study defect prediction that aims to support the testing phase, i.e., the predictions that are performed after the coding phase and before the deployment phase. During the testing phase, developers work to identify and eventually fix defects in the code before these defects can reach the deployment phase and, hence, become production defects. The focus of our paper is not on the Just in Time (JIT) defect predictions, which are usually performed during the development phase (Herbold 2019, 2020; McIntosh and Kamei 2018; Pascarella et al. 2019; Kondo et al. 2020; Huang et al. 2019; Fan et al. 2019; Tu et al. 2020; Rodriguezperez et al.2020).

In the defect prediction area, researchers have proposed the use of different models to predict defects, leveraging, for example, product metrics (Basili et al. 1996; Gyimóthy et al. 2005; Khoshgoftaar et al. 1996; Nagappan and Ball 2005; Hassan 2009), process metrics (Moser et al. 2008), knowledge from where previous defects occurred (Ostrand et al. 2005; Kim et al. 2007), information about change-inducing fixes (Kim et al. 2008; Fukushima et al. 2014) and, recently, deep learning techniques to automatically engineer features from source code elements (Wang et al. 2016). While many studies investigated how to predict the defectiveness of commits (Herbold et al. 2019, 2020; McIntosh and Kamei 2018; Pascarella et al. 2019, 2020; Kondo et al. 2020; Huang et al. 2019; Fan et al. 2019; Tu et al. 2020; Rodriguezperez et al. 2020; Giger et al. 2012), or classes (Kamei et al. 2016; Tantithamthavorn et al. 2016b, 2019, 2020; Bennin et al. 2018, 2019; Herbold et al.2017, 2018, 2019; Hosseini et al. 2019; Yan et al. 2017; Liu et al. 2017; Chi et al. 2017; Jing et al. 2017; Di Nucci et al. 2018; Palomba et al. 2019; Song et al. 2019; Zhang et al. 2016, 2017; Lee et al. 2016; Yu et al. 2019a; Peters et al. 2019; Qu et al. 2021; Shepperd et al. 2018; Amasaki 2020; Bangash et al. 2020; Kondo et al. 2019; Morasca and Lavazza 2020; Mori and Uchihira 2019; Tian et al. 2015; Jiarpakdee et al. 2020; Chen et al.5555; Dalla Palma et al. 2021) independently, no study other than Pascarella et al. (2019), investigated how these predictions differ or can benefit each other. Specifically, every commit touches one or more methods in one or more classes; hence, the likelihood of a method, and of a class, to be defective depends on the likelihood of the touched commits being defective. Thus, it is reasonable to conjecture that method-defectiveness-predictions (MDP) and the class-defectiveness-predictions (CDP) may become more accurate if we leverage the commit-defectiveness-predictions (JIT). Furthermore, before code is shipped to production, developers may decide whether to test methods or classes. To the best of our knowledge, no study has investigated whether it is more useful for developers to rank methods or classes. Pascarella et al. (2019) reports a fine-grained just-in-time defect prediction technique that seems to enhance the performance of Kamei et al. (2012). We took inspiration from Pascarella et al. (2019) regarding the fact that they leverage CDP to refine JIT. However, we investigated the use of JIT to improve the accuracy of MDP and CDP. Moreover, compared to Pascarella et al. (2019), we use a more advanced SZZ algorithm (RA-SZZ), and we assume a different scenario: end of a release versus after a commit.

To better explain the purpose of our paper, we elaborate on an example. Let’s consider a defect of project OPENJPAFootnote 1 that is documented and tracked via the OPENJPA-2414 ticket, which was opened on 24/07/13 and closed on 29/10/13.Footnote 2 The commit that fixes this defect has the ID ee6f4acc3ff9ac43ea4e98579b478e55767aef24 and was committed on 23/08/2013 in release 2.3.0. Figure 1 shows the log of this commit, which reports the defect ID. The commit containing the defect ID is defined as the fix-commit of the defect. Figure 1 also shows a fragment of the code statements touched (in green). The commit touched 1,494 LOC and removed 8 LOC over the following five classes:

  • openjpa-jdbc/src/main/java/org/apache/openjpa/jdbc/kernel/FinderCacheImpl.java

  • openjpa-kernel/src/main/java/org/apache/openjpa/kernel/DelegatingFetchConfiguration.java

  • openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfiguration.java,

  • openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java

  • openjpa-persistence-jdbc/src/test/java/org/apache/openjpa/persistence/fetchgroups/ TestFetchGroups.java

Fig. 1
figure 1

Example of fix commit

By applying RA-SZZ (Neto et al. 2019) to commit ee6f4acc3ff9ac43ea4e 98579b478e55767aef24, we retrieve a list of commits that may have caused the defect (i.e., the defect-inducing commits). Specifically, RA-SZZ retrieved two defect-inducing commits, which touched 773 LOC and 139 classes. These two defect-inducing commits have the IDs 1fede626e2cad16f7bb4d77dd9fc3270a8b6b331 and 979d2340e93eaaa9f273a100dbe78e42ea9ed400, respectively (both in release 0.9.0). However, not all statements in these two defect-inducing commits are defective. Specifically, only class openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java is deemed defective among the classes retrieved by RA-SZZ; i.e, the two defect-inducing commits touched the same class. Afterwards, the class openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java is labeled as defective from release 0.9.0 to release 2.3.0 (excluded). Since we have two defect-inducing commits that may have contributed to the defectiveness of class openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java, we believe that information embedded in these commits can contribute to both class-defectiveness-predictions.

The contribution of our paper is fourfold:

  1. 1.

    We compare methods and classes in terms of defectiveness.

  2. 2.

    We compare the accuracy of MDP versus CDP in terms of effort-aware metrics, thus supporting the decision of whether to perform MDP or CDP.

  3. 3.

    We propose and evaluate the increase in MDP accuracy by leveraging JIT information.

  4. 4.

    We propose and evaluate the increase in CDP accuracy by leveraging JIT information.

Our analysis features two types of accuracy metrics (threshold-independent and effort-aware), nine machine learning defect prediction classifiers, 1,860 defects related to 35 releases of 9 open source projects from the Apache ecosystem. MDP, CDP, and JIT predictions are performed using state-of-the-art approaches and rely on a total of 15 features from commits, 15 features from methods, and 16 features from classes. Our results rely on a ground truth featuring a total of 269,004 data points.

Our results reveal that MDP is substantially more accurate than CDP (+ 5% AUC and + 62% PofB10). These results indicate that it is more rewarding to predict and rank defective methods instead of defective classes. Moreover, leveraging JIT information by using a simple median approach increases the accuracy of MDP by an average of 18% AUC and 49% PofB10 while increasing the accuracy of CDP by an average of 28% AUC and 31% PofB20.

The remainder of this paper is structured as follows: Section 2 reports the design of our research questions. Section 3 describe the results of our study. Section 4 provides a discussion of our results. Section 5 reports on the threats to the study validity. Section 6 discusses the related literature, focusing in particular on MDP and CDP. Finally, Section 7 concludes the paper and outlines directions for future work.

2 Study Design

In this section, we explain the design of our study, which includes the procedures to choose our subject projects, our research questions (RQs) and our approaches to answer our RQs.

2.1 Subject Projects

We describe in the following how we chose the 10 datasets we used in this study.

  1. 1.

    We first retrieved the JIRA and Git URL of all existing Apache projects.Footnote 3 We focused on Apache projects instead of GitHub projects because Apache projects have a higher quality of defect annotation and are unlikely to be toy datasets (Munaiah et al. 2017).

  2. 2.

    We filtered out projects that are not tracked in JIRA or not versioned in Git.

  3. 3.

    We filtered out projects that are small and therefore not representative of a medium size industrial project. Specifically, we filtered out projects having: less than 20 releases, less than 20 commits per release, less than 20 linked and fixed defects per release, less than 50% of commits in the Java programming language. We define a commit in Java as a commit touching at least one Java file.

  4. 4.

    We filtered out two projects that although hosted on GitHub, their historical data did not cover a significant portion of their life time (which were actually covered in JIRA). These two projects were Openmeetings and Camel.

  5. 5.

    We selected the nine projects having the highest linkage rate, i.e., the highest proportion of defects with linked commits. A defect is considered to be linked if such a defect can be associated with at least one commit from the source code commit log. We selected nine projects due to the manual effort constraints required to perform the study.

Table 1 reports the details of the our 9 selected projects in terms of commits, percent of defective commits, methods, percent of defective methods, classes, percent of defective classes, defects, and releases.

Table 1 Details of the datasets used in terms of commits, percent of defective commits, methods, percent of defective methods, classes, percent of defective classes, defects, linkage, releases, and manually validated commits

In the reminder of this section we report the design of each research question.

2.2 RQ1: Do Methods and Classes vary in Defectiveness?

Many studies evaluated techniques to perform MDP and CDP. However, there is a lack of studies that have compared the defectiveness across different granular levels (e.g., methods vs classes). Studying the defectiveness of methods and classes on the same releases can help us understand whether the defectiveness at a certain granularity is harder or easier to predict compared to other granularities. Specifically, despite we know that, by definition, defective classes are more frequent than defective methods, we do not know if defective classes are statistically more frequent than defective methods. In this study, we compare methods and classes in terms of defectiveness.

In this research question, we test the following null hypotheses:

  • H10: the number of defective methods is equivalent to the number of defective classes.

  • H20: the proportion of defective methods is equivalent to the proportion of defective classes

Independent Variables

The independent variable is the type of defective entity, either methods or classes.

Dependent Variables

The dependent variables are the number and the proportion of detective entities. The proportion of defective entities of a type is computed as the number of defective entities of that type divided by the total number of entities of that type.

An Extended RA-SZZ

In order to label commits, methods and classes as defective we performed the following steps, as detailed in Fig. 2:

  1. 1.

    Defect identification: For each project, we find defect tickets on JIRA where type == “defect” AND (status == “Closed” OR status ==“Resolved”) AND Resolution ==“Fixed”.

  2. 2.

    Labelling statements: We run RA-SZZ, which provides us with the defect-introducing statements.

  3. 3.

    Labelling commits: From the defect introducing statements we retrieved their defect-introducing commits and, hence, the defect-introducing releases.

  4. 4.

    Labelling methods: To identify which methods have been impacted by the defect-introducing statements (as retrieved by RA-SZZ), we developed a tool that uses the javaparser libraryFootnote 4 to parse the Java files implicated by the defect-introducing statements. Once the java files are parsed, the tool becomes aware of the start and end statements of each method within the Java files. Then, our tool identifies the methods that contain the defect-introducing statements that were previously retrieved. In doing so, we are able to identify the defective methods of a set of Java classes. Thus, in this study, we use an extended release of RA-SZZ, which provides a ground truth at the method granularity (in addition to the code statement and commit granularities).

  5. 5.

    Labelling classes: We labelled classes as defective if they contained at least one defective method.

Fig. 2
figure 2

Steps of RQ1

Note that, in this study, an element at any granularity (i.e., a class, a method, or a statement) is defined as defective in all releases containing a defective statement.

Manual Validation

We have manually validated the defect-inducing commits used in this work. We used a 95% confidence level and a 20% margin of error; the resulting number of manually analyzed commits per project is reported in column 11 of Table 1. The third and fourth authors performed the manual validation of defect-inducing commits independently; a discussion resolved the disagreements between the two authors. The two authors resulted in agreement 95% of the cases (Kappa equals to 0.69) and the tool resulted in an agreement with the two authors 93% of the cases. The replication package reports the following information for each validated commit: bug ID, defect-fixing commit SHA, defect-inducing commit SHA, the label of the third author, the label of the fourth author, and the tool label.

Hypothesis Testing

To test our hypotheses, we used the paired Wilcoxon signed-rank test (Rey and Neuhäuser 2011), which is a non-parametric test (i.e., there are no assumptions regarding the underlying distribution) to check whether two paired distributions are significantly different. We chose the non-parametric Kruskal–Wallis because our metrics (e.g., number and proportion of defective entities) do not follow a normal distribution (as we noted when performing Shapiro–Wilk tests (Shapiro and Wilk 1965)). Therefore, our approach is compliant to the suggestion to avoid using ScottKnottESD in case of non-normal distributions (Herbold 2017). We use standard value of alpha α = 0.05. To account for the chance of errors due to multiple comparisons, in all research questions we perform a Holm-Bonferroni correction of our pvalues (Holm 1979). We performed effect size analysis by using Cohen’s d (Herbold 2017) which shall be interpreted as in Table 2.

Table 2 Intepretation of Cohen’s d

2.3 RQ2: Does Leveraging JIT Information Increase the Accuracy of MDP?

Every commit touches one or more methods; hence, the likelihood that a method is defective depends on the likelihood that the touched commits are defective. While many studies investigated techniques for MDP, no previous study investigated whether MDP may become more accurate if JIT information is used. In this paper, we propose and evaluate techniques to increase MDP accuracy by leveraging JIT.

In this research question, we test the following null hypothesis:

  • H30: leveraging JIT does not improve the accuracy of MDP.

Independent Variables

The independent variable is MDP with JIT information versus without JIT information. Specifically, we investigate the following MDP approaches:

  • Single: It uses state of the art approach for MDP (Giger et al. 2012; Pascarella et al. 2020). Specifically, we used the following set of features as input to a machine learning classifier:

    • size: LOC of a method.

    • methodHistories: number of times a method was changed.

    • authors: number of distinct authors that changed a method.

    • stmtAdded: sum of all source code statements added to a method body over all method histories.

    • maxStmtAdded: maximum number of source code statements added to a method body throughout the method’s change history.

    • avgStmtAdded: average number of source code statements added to a method body per change to the method.

    • stmtDeleted: sum of all source code statements deleted from a method body over all method histories.

    • maxStmtDeleted: maximum number of source code statements deleted from a method body for all method histories.

    • avgStmtDeleted: Average number of source code statements deleted from a method body per method history.

    • churn: sum of stmtAdded plus stmtDeleted over all method histories.

    • maxChurn: maximum churn for all method histories.

    • avgChurn: average churn per method history

    • cond: number of condition expression changes in a method body over all revisions.

    • elseAdded: number of added else-parts in a method body over all revisions.

    • elseDeleted: number of deleted else-parts from a method body over all revisions.

  • Combined: It takes the median to combine the previously mentioned Single approach with two other scores:

    • SumC is the sum of defectiveness probabilities of the commits touching the method. The rationale is that the probability that a method is defective is related to the sum of probabilities of the commits touching the method.

    • MaxC is the max of defectiveness probabilities of the commits touching the method. The rationale is that the probability that a method is defective is related to the max of probabilities of the commits touching the method.

    The rationale of the Combined approach is that a defective commit incurs defective methods (i.e., those methods that are touched by the commit). We use the median as the combination mechanism because it is a simple way to combine several probabilities.

    We use a standard JIT approach (Kamei et al. 2012) to obtain probabilities of defectiveness of the commits that touch the methods. Specifically, we used the following set of features as input to a machine learning classifier:

    • Size: lines of code modified.

    • Number of modified subsystems (NS): changes modifying many subsystem are more likely to be defect-prone.

    • Number of modified directories (ND): changes that modify many directories are more likely to be defect-prone.

    • Number of modified files (NF): changes touching many files are more likely to be defect prone.

    • Distribution of modified code across each file (Entropy): changes with high entropy are more likely to be defect-prone, because a developer will have to recall and track higher numbers of scattered changes across each file.

    • Lines of code added (LA): the more lines of code added the more likely a defect is introduced.

    • Lines of code deleted (LD): the more lines of code deleted the higher the chance of a defect to occur.

    • Lines of code in a file before the change (LT): the larger a file the more likely a change might introduce a defect.

    • Whether or not the change is a defect fix (FIX): fixing a defect means that an error was made in an earlier implementation, therefore it may indicate an area where errors are more likely.

    • Number of developers that changed the modified files (NDEV): the larger the NDEV, the more likely a defect is introduced because files revised by many developers often contain different thoughts and coding styles.

    • Average time interval between the last and the current change (AGE): the lower the AGE, the more likely a defect will be introduced.

    • Number of unique changes to the modified files (NUC): the larger the NUC, the more likely a defect is introduced, because a developer will have to recall and track many previous changes.

    • Developer experience (EXP): more experienced developers are less likely to introduce a defect.

    • Recent developer experience (REXP): developers that have often modified the files in recent months are less likely to introduce a defect because they will be more familiar with the recent developments in the system.

    • Developer experience on a subsystem (SEXP): developers that are familiar with the subsystems modified by a change are less likely to introduce a defect.

Dependent Variables

The main dependent variable is the accuracy of MDP. As performance indicators of defect prediction we used the following metrics:

  • AUC: Area Under the Receiving Operating Characteristic Curve (Powers 2007) is the area under the curve of true positives rate versus false positive rate, which is defined by setting multiple thresholds. A positive instance is a defective entity, whereas a negative instance is a defect-free entity. AUC has the advantage to be threshold independent and, therefore, it is recommended for evaluating defect prediction techniques (Lessmann et al. 2008). We decided to avoid metrics such as Precision, Recall and F1, since they are threshold dependent.

  • PofBX: as the effort-aware metric, we used PofBx (Chen et al. 2017; Wang et al. 2020; Xia et al. 2016; Tu et al. 2020). PofBx is defined as the proportion of defective entities identified by analyzing the first x% of the code base. For instance, a PofB10 of 30% signifies that 30% of defective entities have been found by analyzing 10% of the code base. We explored PofBx with an x in the range of [10, 50]. While previous studies only focused on x = 20, we investigated a wider range to obtain more informative results. Note that PofB is different than Popt (Kamei et al. 2010, 2013; Mende and Koschke 2010) in two aspects: normalization and range of x. Regarding normalization, while Popt normalizes the value according to a random approach, PofB does not perform such analysis, which aligns with our goals for two reasons:

    1. 1)

      The comparison against a random approach is already provided by AUC, since an AUC higher than 0.5 indicates that a classifier performed better than a random classifier,

    2. 2)

      In our study, we are interested in comparing classifiers that rank entities at different levels of granularity. Specifically, since methods and classes have a different defectiveness proportion (see Table 1), a random ranking would perform differently across methods and classes. Regarding the value of x, Popt represents an average of the complete spectrum of x, but we decided to neglect high values of x, as we believe that high values of x would be unrealistic for practitioners when indicating which code should be inspected during testing. Specifically, the lower the amount of code tested, the higher the impact of the ranking approach; i.e., if 100% of the code needs to inspected the ranking approach is effectively useless. Thus, we envisioned a metric that express the return of investing a specific amount of time in testing x% of the code as suggested by the ranking from a classifier. For all these reasons, PofB is a better match to our needs than Popt.

As an additional dependent variable, we measured the proportion of times features related to JIT are chosen to predict MDP. We performed this feature selection to complement the accuracy analysis, since it is important to know whether our approach of using the median to leverage JIT information is beneficial. For example, if JIT information is selected as features for MDP but do not result in improving the accuracy of MDP, it indicates that using the median may not be the right approach to leverage JIT information.

Measurement Procedure

In this section, we describe the steps we performed to compute the accuracy metrics related to RQ2. As described in Fig. 3, for each project, we:

  1. 1.

    Compute for each commit the above mentioned features.

  2. 2.

    Label commits as defective or not by using RA-SZZ (as described in RQ1)

  3. 3.

    Create a commit dataset by combining features and defectiveness labels

  4. 4.

    Perform preprocessing:

    • Normalization: we normalize the data of all features with log10 since performed in many similar studies (Jiang et al. 2008; Tantithamthavorn et al. 2019),

    • Feature Selection: we filter the independent variables described above by using the correlation-based feature subset selection (Hall 1998; Ghotra et al. 2017; Kondo et al. 2019). The approach evaluates the worth of a subset of features by considering the individual predictive capability of each feature, as well as the degree of redundancy between different features. The approach searches the space of feature subsets by a greedy hill-climbing augmented with a backtracking facility. The approach starts with an empty set of features and performs a forward search by considering all possible single feature additions and deletions at a given point.

    • Balancing: we apply SMOTE (Chawla et al. 2002; Agrawal and Menzies 2018) so that each dataset is perfectly balanced. As suggested (Witten et al. 2011), we apply feature selection and balancing to the training set only.

  5. 5.

    Create, Train, and Test commit datasets by splitting the preprocessed dataset into about 66% of releases as the training set and about 33% of releases as the testing set, while preserving the order of data (Falessi et al. 2020). We chose this split proportion since suggested in ML (Witten et al. 2011) and because it resemble the split proportion of the bootstrap approach (Falessi et al. 2020). Note that since the split is at the level of the release, and since different releases have a different number of commits, methods and classes, the specific proportion of spilt in training set and testing set vary across datasets and types of entities. Commits are assigned to the different releases given their time-stamp. The status of methods and classes in a release takes into account the commits of that release.

  6. 6.

    Compute predicted probability of defectiveness for each commit by using each of the 9 classifiers.

  7. 7.

    Compute the above mentioned features for each method.

  8. 8.

    Label methods as defective or not by using RA-SZZ (as described in RQ1).

  9. 9.

    Create a dataset of methods by combining features and defectiveness labels.

  10. 10.

    Perform preprocessing.

  11. 11.

    Create, Train, and Test method datasets by splitting the preprocessed dataset into 66% as the training set and 33% as the testing set, while preserving the order of data (Falessi et al. 2020).

  12. 12.

    Compute Direct predicted probability of defectiveness for each method by using each of the 9 classifiers.

  13. 13.

    Compute accuracy metrics of Direct.

  14. 14.

    Find commits related to each method. For the set of commits related to a method compute MaxC and SumC given the above commits predictions performed.

  15. 15.

    Perform feature selection, with the same technique described in item 4 (above), of the following features: Direct, MaxC and SumC.

  16. 16.

    Compute Combined by computing the median between Direct, MaxC and SumC.

  17. 17.

    Compute accuracy metrics of Combined.

Fig. 3
figure 3

Measurement procedure of RQ2

In Fig. 3, the dataset is split twice and some actions are repeated twice because one flow is for the data about commits and another is for data about the entity we want to predict (i.e., methods and classes).

In this paper, we used the following set of classifiers, since they have been successfully adopted in a previous study (Falessi et al. 2020):

  • Random Forest: It generates a number of separate, randomized decision trees and provides as classification the mode of the classifications. It has proven to be highly accurate and robust against noise (Breiman 2001). However, it can be computationally expensive as it requires the building of several trees.

  • Logistic Regression: It estimates the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. The estimation is performed through the logistic distribution function (Cessie and Houwelingen 1992).

  • Naïve Bayes: It uses the Bayes theorem, i.e., it assumes that the contribution of an individual feature towards deciding the probability of a particular class is independent of other features in that dataset instance (Mccallum and Nigam 2001).

  • HyperPipes: It simply constructs a hyper-rectangle for each label that records the bounds for each numeric feature and what values occur for nominal features. During the classifier application, the label is chosen by whose hyper-rectangle most contains the instance (i.e., that which has the highest number of feature values of the test instance fall within the corresponding bounds of the hyper-rectangle) .

  • IBK: Also known as the k-nearest neighbors’ algorithm (k-NN) which is a non-parametric method. The classification is based on the majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (Altman 1992).

  • IB1: It is a special case of IBK with K = 1, i.e., it uses the closest neighbor (Altman 1992).

  • J48: Builds decision trees from a set of training data. It extends the Iterative Dichotomiser 3 classifier (Quinlan 1986) by accounting for missing values, decision trees pruning, continuous feature value ranges and the derivation of rules.

  • VFI: Also known as voting feature intervals (Demiröz and Güvenir 1997). A set of feature intervals represents a concept on each feature dimension separately. Afterwards, each feature is used by distributing votes among classes. The predicted class is the class receiving the highest vote (Demiröz and Güvenir 1997).

  • Voted Perceptron: It uses a new perceptron every time an example is wrongly classified, initializing the weights vector with the final weights of the last perceptron. Each perceptron will also be given another weight corresponding to how many examples they correctly classify before wrongly classifying one, and at the end, the output will be a weighted vote on all perceptrons (Freund and Schapire 1999).

Hypothesis Testing

As in RQ1, we use the paired Wilcoxon signed-rank test (Rey and Neuhäuser 2011) to test our hypothesis, H30.

2.4 RQ3: Does Leveraging JIT Information increase the Accuracy of CDP?

Similar to RQ2, as no previous work investigated whether CDP may become more accurate by using JIT information, we propose and evaluate a technique to increase CDP accuracy by leveraging JIT information.

In this research question, we test the following null hypothesis:

  • H40: leveraging JIT does not improve the accuracy of CDP.

Independent Variables

The independent variable is the use of JIT information in CDP. As in RQ2, we investigate the following MDP approaches:

  • Single: It uses state of the art approach for CDP (Falessi et al. 2021). Specifically, we used the following set of features as input to a machine learning classifier:

    • Size (LOC): lines of code.

    • LOC Touched: sum over revisions of LOC added and deleted

    • NR: number of revisions.

    • Nfix: number of defect fixes.

    • Nauth: number of authors.

    • LOC Added: sum over revisions of LOC added and deleted.

    • Max LOC Added: maximum over revisions of LOC added.

    • Average LOC Added: average LOC added per revision.

    • Churn: sum over revisions of added deleted LOC.

    • Max Churn: maximum churn over revisions.

    • Average Churn: average churn over revisions.

    • Change Set Size: number of files committed together.

    • Max Change Set: maximum change set size over revisions.

    • Average Change Set: average change set size over revisions.

    • Age: age of release.

    • Weighted Age: age of release weighted by LOC touched.

  • Combined: Similar to RQ2, it takes the median to combine the previously described Direct approach with JIT information.

Dependent Variables

As in RQ2, the main dependent variable is the accuracy of CDP. Again, as an additional dependent variable, we measured the proportion of times features related to JIT are chosen to predict CDP.

Measurement Procedure

In this section, we describe the steps we performed to compute the accuracy metrics related to RQ3. We performed the same exact set of steps of RQ2 with the only difference being that we used the above mentioned features related to classes rather than features related to methods. Therefore, in RQ3, median is computed among the direct probability of a class to be defective, MaxC and SumC.

Hypothesis Testing

As in RQ1 and RQ2, we use the paired Wilcoxon signed-rank test (Rey and Neuhäuser 2011) to test our hypotheses, H40.

2.5 RQ4: Are we more Accurate in MDP or CDP?

When using defect predictions, developers may choose to inspect methods or classes during testing (i.e., before code is shipped to production). However, no previous study has investigated which prediction granularity is more advantageous (methods or classes?) in terms of accuracy and effort. In this study, we compare the accuracy of MDP against CDP, using also effort aware metrics.

In this research question, we test the following null hypotheses:

  • H50: MDP is as accurate as CDP.

Independent Variables

The independent variable is the granularity of the entity that is subject to the defectiveness prediction. The independent variable can have the following values: methods or classes.

Dependent Variables

The dependent variable is the accuracy of MDP and CDP as measured by the same performance metrics we used in RQ2 and RQ3.

Measurement Procedure

We used the data already used in RQ2 and RQ3 related to the Combined approach.

Hypothesis Testing

As in our previous three research questions, we use the paired Wilcoxon signed-rank test (Rey and Neuhäuser 2011) to test our hypotheses, H50.

2.6 Replication Package

For the interested researchers, the present study can be replicated using the replication package available online.Footnote 5 The replication package provides the scripts used to measure the data and the data itself. Both scripts and data are organized by research questions.

3 Study Results

3.1 RQ1: Do Methods and Classes vary in Defectiveness?

Figure 4 reports the number of defective entities (x-axis) in specific projects (x-axis) across different granularities of entities (color). According to Fig. 4 there is no entity that is more defective than another in all projects. Moreover, the number of defective entities varies across projects.

Fig. 4
figure 4

Number of defective entities (x-axis) in specific projects (y-axis) across different granularities of entities (color)

Figure 5 reports the proportion of defective entities (x-axis) in specific datasets (x-axis) across different types of entities (color). According to Fig. 5: in all nine projects, the proportion of defective classes is higher than the proportion of defective methods. Therefore, it is more likely to find by chance a defective class than a defective method. Moreover, the number of defective entities varies across datasets.

Fig. 5
figure 5

Proportion of defective entities (x-axis) in specific datasets (x-axis) across different types of entities (color)

Table 3 reports the statistical results (p-value) comparing the number of defective classes versus the number of defective methods. According to Table 3, there is a statistical difference between the number of defective methods and the number of defective classes; therefore, we can reject H10. Moreover, the effect size is large.

Table 3 Statistical result (p-value) and Cohen’s d effect size, comparing the number of defective methods and the number of defective classes in our projects

Table 4 reports the statistical results (p-value) comparing the proportion of defective methods against the proportion of defective classes. According to Table 4, methods are statistically less frequently defective than classes. Therefore, we can reject H20. Moreover, the effect size is medium.

Table 4 Statistical result (p-value) and Cohen’s d effect size, comparing the proportion of defective methods against the proportion of defective classes

3.2 RQ2: Does Leveraging JIT Information Increase the Accuracy of MDP?

Accuracy

Figure 6 reports the distribution of AUC values across classifiers (y-axis) achieved by our proposed approaches (i.e., Combined and Direct) for MDP in the subject projects (x-axis). According to Fig. 6, the Combined approach is more accurate than Direct in all projects, except DRSERVER. Moreover, the distribution of AUC values across classifiers of Combined is narrower than the distribution of Direct.

Fig. 6
figure 6

Distributions across classifiers of AUC values (y-axis) achieved by Combined and Direct (colors) for MDP in our subject projects (x-axis)

Figure 7 reports the distribution, across classifiers, of PofB values (x-axis) achieved by the Combined and Direct approaches (colors) for MDP in our subject projects (quadrant). According to Fig. 7:

  • Combined is better than Direct in all PofBs in Groovy, MNG, NUTCH, OPENJPA, QPID and TIKA.

  • Combined is worse than Direct in all PofBs in DERSERVER.

  • Similar to the previous results related to the AUC, the distribution of values from Combined is substantially more narrow than the distribution of values from Direct. This result indicates that the choice of classifiers is not as important when using Combined.

Fig. 7
figure 7

Distribution across classifiers of PofB values (x-axis) achieved by the Direct and Combined approaches (colors) for MDP in our studied projects (quadrant)

Figure 8 reports the mean of the relative gain in MDP by leveraging JIT across classifiers and projects. According to Fig. 8: leveraging JIT increases the accuracy of MDP by an average of 17% in AUC and 46% in PofB10. It is interesting to note that the relative gain is inversely correlated with PofB; this is due to the fact that the margin of performance, and hence the relative gain, is reduced when considering a larger code base.

Fig. 8
figure 8

Mean relative gain in MDP by leveraging JIT across classifiers and datasets

Feature Selection

Figure 9 reports the percent (x-axis) of times, across the nine classifiers, that a given feature (color) has been chosen in a given project (y-axis). According to Fig. 9:

  • MaxC or SumC have been selected in all projects, except Zookeeper.

  • Direct has been selected in seven out of the nine projects.

  • MaxC or SumC have been selected more than Direct in five out of the nine projects.

Fig. 9
figure 9

Percent (x-axis) of times, across the nine classifiers, that a given feature (color) has been chosen in a given project (y-axis), in MDP

Table 5 reports the statistical results (p-value) comparing the MDP accuracy of the combined versus direct approach. An asterisk identifies cases where the pvalue is lower than alpha according to the Holm-Bonferroni correction and hence we can reject the null hypothesis. We note that MDP is statistically more accurate by leveraging JIT in AUC and in seven out of nine PofB. Therefore, we can reject H30 and claim that leveraging JIT statistically and significantly improves the accuracy of MDP. Moreover, the effect size is at least medium in most of the metrics.

Table 5 Statistical result (p-value) and Cohen’s d effect size, comparing the MDP accuracy of the combined versus direct approach

3.3 RQ3: Does Leveraging JIT Information Increase the Accuracy of CDP?

Accuracy

Figure 10 reports the distribution of the AUC values across classifiers (y-axis) achieved by different our approaches (i.e., Combined and Direct) for CDP in our subject projects (x-axis). According to Fig. 10:

  • The median of Combined is more accurate than Direct in all nine projects.

  • As in RQ2, the distribution of values across classifiers of Combined is extremely narrower than the distribution of Direct. Therefore, the choice of classifiers is not as important when using the Combined approach.

Fig. 10
figure 10

Distribution of the AUC values across classifiers (y-axis) achieved by different approaches (Combined and Direct) for CDP in our subject projects (x-axis)

Figure 11 reports the distribution across classifiers of PofB values (x-axis) achieved by different approaches (Combined and Direct) for CDP in our subject projects (quadrant). According to Fig. 11:

  • Combined is better than Direct in all PofB values in seven out of nine projects: ARTEMIS, DIRSERVER, MNG, NUTCH, OPENJPA, QPID and TIKA.

  • There is no dataset where Combined is worse than Direct in all PofB values.

  • Similar to the results for AUC, the distribution of values across classifiers from Combined is extremely narrower than the distribution from Direct. Therefore, the choice of classifiers is not as important when using Combined.

Fig. 11
figure 11

Distribution across classifiers of PofB values (x-axis) achieved by different approaches (Combined and Direct) for CDP in our subject projects (quadrant)

Figure 12 reports the mean relative gain in CDP by leveraging JIT across classifiers and projects. According to Fig. 12 leveraging JIT increases the accuracy of CDP by an average of 31% in AUC and 38% in PofB20. It is interesting to note that the relative gain is not inversely correlated with PofB (as observed in RQ2).

Fig. 12
figure 12

Mean relative gain in CDP by leveraging JIT across classifiers and datasets

Feature Selection

Figure 13 reports the distribution across datasets of feature selection and reports which feature is selected. According to Fig. 13:

  • MaxC or SumC have been selected in all nine projects.

  • Direct has not been selected in two out of the nine projects.

  • MaxC or SumC were selected more than Direct in five out of the nine projects.

Fig. 13
figure 13

Distribution across datasets of feature selection for CDP

Table 6 reports the statistical results (p-value) comparing the CDP accuracy of the combined versus direct approach. An asterisk identifies cases where the pvalue is lower than alpha according to the Holm-Bonferroni correction and hence we can reject the null hypothesis. We note that MDP is statistically more accurate by leveraging JIT in AUC and in seven out of nine PofB. Therefore, we can reject H40 and claim that leveraging JIT statistically and significantly improves the accuracy of CDP. Moreover, the effect size is at least medium in most of the metrics.

Table 6 Statistical result (p-value) and Cohen’s d effect size, comparing the CDP accuracy of the combined versus direct approach

3.4 RQ4: Are we more Accurate in MDP or CDP?

Figure 14 reports the distribution across classifiers of AUC values (y-axis) achieved for MDP or CDP (colors) in our subject projects (x-axis). According to Fig. 14:

  • MDP is more accurate than CDP in all nine projects.

  • The distribution of values across classifiers for MDP is extremely narrower than the distribution for CDP. Therefore, the choice of classifiers is less important in MDP than it is in CDP.

Fig. 14
figure 14

Distribution across classifiers of AUC values (y-axis) achieved for MDP or CDP (colors) in our subject projects (x-axis)

Since Fig. 14 results are interesting considering that defective methods are harder to find by chance than defective classes, Fig. 15 reports the distribution across projects of AUC values (y-axis) achieved for MDP or CDP (colors) by our classifiers (x-axis). According to Fig. 15:

  • MDP is more accurate than CDP in all nine classifiers.

  • The distribution of values across projects for MDP is extremely narrower than the distribution for CDP. Thus, classifiers are more stable in MDP than CDP.

Fig. 15
figure 15

Distribution across projects of AUC values (y-axis) achieved for MDP or CDP (colors) by our classifiers (x-axis)

Figure 16 reports the mean across classifiers for a specific value of PofB (x-axis) achieved by MDP or CDP (colors) in our subject projects (quadrant). We reported the results in bar-charts instead of box-plots (as done in RQ2 and RQ3) because the distributions are very narrow, which hinders the visualization of results in this case. These distributions can also bee seen in previous figures (Fig. 7 for MDP and Fig. 11 for CDP). According to Fig. 16:

  • MDP is better than CDP in all PofB values in four projects.

  • MDP is worse than CDP in all PofB values only in the Groovy project.

Fig. 16
figure 16

Mean across classifiers of PofB values (x-axis) achieved for MDP or CDP (colors) in our subject projects (quadrant)

Figure 17 reports the mean of the relative gain in performing MDP over CDP across classifiers and projects. According to Fig. 17 MDP is more accurate than CDP by an average of 5% in AUC and 62% in PofB10. It is worth noting that the relative gain is inversely correlated with PofB again.

Fig. 17
figure 17

Mean of the relative gain in performing MDP over CDP across classifiers and projects

Table 7 reports the statistical results (p-value) comparing the accuracy of MDP versus CDP. An asterisk identifies the eight cases out of ten where the pvalue is lower than alpha according to the Holm-Bonferroni correction and hence we can reject the null hypothesis. We note that MDP is statistically more accurate than CDP in AUC and in seven out of nine PofB. Therefore, we can reject H50 and claim that MDP is more accurate than CDP. Moreover, the effect size is at least medium in four out of ten metrics.

Table 7 Statistical result (p-value) and Cohen’s d effect size, comparing the MDP versus CDP accuracy

4 Discussion

This section discuss our main results, offering possible explanations for the results, implications, and guidelines for practitioners and researchers.

4.1 Main Results and Possible Explanations

The main result of RQ1 is that defective methods are significantly less frequent than defective classes. This means that it is harder to find by chance defective entities if they are methods rather than classes. This result confirms common wisdom for which MDP is more challenging than CDP because defective methods are rare. It was surprising at first to observe that in three out of nine projects the number of defective classes is higher than defective methods. We triple-checked the results and we found no mistakes. Our investigation revealed that such a surprising result was due to defects pertaining only to attributes and that such defects are particularly numerous in those three projects. Let us take the DIRSERVER-1019 as an example defect, which affects the core/src/main/java/org/apache/directory/server/core/jndi/ServerDirContext.java class. By observing the content of the DIRSERVER-1019 fix commit, which has ID 09cc2c065fb36662ebf9f56486af28d87ad09d4c, we realize that developers removed the static keyword from the attribute “static final FilterParserImpl filterParser = new FilterParserImpl();”. Such an attribute, and hence the DIRSERVER-1019 defect, was inserted in the commit b392e8f69e2c6f30116459152db612e414f18724. The fact that defective classes can be more than defective methods supports the need for CDP as it can identify defects that MDP cannot.

The main results of RQ4 are that MDP is substantially more accurate than CDP (a mean of + 5% in AUC and + 62% in PofB10). The higher accuracy of MDP in comparison with CDP is visible in all datasets in terms of AUC. This might be due to the coarser granularity of the classes that, by definition, are larger than methods and, hence, harder to rank as only partially defective. Another possible reason is that defective methods are lower in proportion than defective classes and, hence, a better ranking is more visible in effort-aware metrics like PofB.

The main result of RQ2 is that leveraging JIT increases the accuracy of MDP by an average of 17% in AUC and 46% in PofB10. Similarly, in RQ3, we observe that leveraging JIT increases the accuracy of CDP by an average of 31% in AUC and 38% in PofB20. Moreover, our statistical tests reveal that leveraging JIT increases the accuracy of MDP and CDP in AUC and in PofB (from PofB20 to PofB50). We note that leveraging JIT increases the accuracy of MDP and CDP in all datasets in terms of AUC.

Regarding PofB, in some Combined datasets - PofBX-JIT decreases the accuracy of MDP and CDP. For instance, in Zookeeper, JIT decreases the accuracy of CDP in PofB (from PofB10 to PofB25).

Regarding ZOOKEEPER, it was surprising at first to observe that according to Fig. 6 Combined is better than Direct whereas for the same project in Fig. 9 HighestC and SumC have never been selected. We triple-checked the results and we found no mistakes. If HighestC and SumC have not been selected, Combined cannot be better than Direct. If on the one side, this reasoning is correct, on the other side the datasets upon which the feature selection is applied differ between Figs. 6 and 9. To compute results in Fig. 6, the feature selection is applied to the training set, as it aims at supporting the prediction on the testing set. To compute results in Fig. 9, the feature selection is applied to the entire dataset, as it aims at providing results on a dataset. Thus, there is no inconsistency between results in Figs. 6 and 9.

The Role of Partially Defective Commits

We note that leveraging JIT increases the accuracy of MDP or CDP under two conditions: 1) the JIT is accurate, 2)the entities touched by a defective commit are defective. This last point is important, since defective commits usually have only a small proportion of statements that are defect-inducing (Pascarella et al. 2019). Thus, the methods and classes touched by the non-defect-inducing statements from a defective commit are actually not defective. In this study, when we leverage JIT, the defectiveness of a commit is cascaded over all the touching entities, therefore we cascade this defectiveness also over the entities touched by the non-defect-inducing statements of the commit. In other words, while our JIT prediction is performed at the commit level, our ground truth is computed at the defect-inducing-statements level. Thus, the positive impact of leveraging JIT to support MDP or CDP is correlated with the percent of defective entities (classes or methods) touched by defective commits.

To investigate whether the effect of partially defective commits is substantial in our results, we report in Fig. 18 the mean percent, across commits, of defective entities touched by defective commits. A low value of this percentage indicates that defective commits were only partially defective and, hence, they had non-defect-inducing statements touching several entities.

Fig. 18
figure 18

Mean percent across commits of defective entities that are touched by a defective commit

Regarding RQ2, according to Fig. 18, Artemis, Dirserver and Zookeeper have the lowest proportion of actually defective methods touched by a defective commit; this is in line with the Figs. 7 and 9 as, in Zookeeper, neither HighestC nor SumC have been selected in MDP. Moreover, according to Fig. 18, Tika has the highest proportion of actually defective methods touched by a defective commit; this is in line with Fig. 9 as, in Tika, Direct has never been selected in MDP.

Regarding RQ3, according to Fig. 18, Zookeper has the lowest proportion of actually defective classes; this is in line with Fig. 11 since in this project Combined has a lower PofB10 to PofB25 than Direct. Moreover, according to Fig. 18, Nutch and Openjpa have the highest proportion of actually defective classed touched by a defective commit; this is in line with Fig. 9 as, in Tika, Direct has never been selected in MDP.

Comparing RQ2 to RQ3, we observe that JIT helped more CDP than MDP. This can be explained by Fig. 18, since the percent of entities that are actually defective when touched by a defective commit is about five times higher for classes than it is for methods. This means that leveraging a finer grained defect prediction than JIT, i.e. a statement-defectiveness-prediction (Pornprasit and Tantithamthavorn 2021), would likely benefit more MDP than CDP.

The Narrower Distributions

Other important results from RQ2 and RQ3 are that the distributions of accuracy are extremely narrower in both MDP and CDP when leveraging JIT. This indicates that the choice of classifiers does not impact Accuracy as much and, hence, leveraging JIT does not only increases the accuracy of both MDP and CDP, but also makes them much more stable across a set of different classifiers. In order to analyze possible reasons as to why the distributions become narrower, Figs. 19 and 20 report the STDV of Combined, Direct, MaxC and SumC in each specific dataset for MDP and CDP, respectively. According to Fig. 19 and 20, Combined has a lower STDV than Direct in all datasets in both MDP and CDP. However, the difference in the STDV among Combined and Direct is not as big to explain the significant difference in the resulting accuracy metrics.

Fig. 19
figure 19

Distribution across classifiers of standard deviation achieved by different features (colors) for MDP in our projects

Fig. 20
figure 20

Distribution across classifiers of standard deviation achieved by different features (colors) for CDP in our projects

We do not see any correlation between defective commits ratio and the benefits of Combined for MDP or CDP. Specifically, Combined is better than Direct also in Groovy, see Figs. 6 and 7 for MDP and see Figs. 10 and 11 for CDP; i.e., leveraging JIT increases the accuracy of MDP and CDP even in a project with only 2% of defective commits. Moreover, we note that large projects might bias results across projects, e.g., Figs. 8 and 12.

4.2 Implications

The main implication of RQ1 is that if practitioners do not use any support system, they should prefer testing instances that are classes instead of methods. Conversely, RQ4 demonstrates that provided the same amount of effort, practitioners would find a much higher percentage of defective entities using a ranked list of methods instead of a list of classes. Specifically, analyzing results across datasets, the lowest accuracy achieved by MDP is higher than the highest accuracy achieved by CDP. Thus, the main implication of RQ4 to practitioners is that it is better to predict and rank defective methods rather than defective classes. The main implications for practitioners of RQ2 and RQ3 is that overall, ranking and classification of both methods and classes shall be done by leveraging JIT information.

Regarding implications to researchers, we found a very limited number of related papers that compared MDP to CDP. For instance, in an ongoing systematic mapping study, we found that, in the last five years, no existing work investigated the accuracy of MDP in three of the major software engineering journals (IEEE Transactions of Software Engineering, ACM Transactions on Software engineering and Methodologies, and Empirical Software Engineering). Hence, there is a high potential for the research community in improving MDP instead of focusing on CDP. Moreover, given the observed low percent of defective methods touched by defective commits, i.e., the defective commits are only partially defective, there is a high potential in leveraging statement-level-defectiveness (Pornprasit and Tantithamthavorn 2021) for MDP.

5 Threats to Validity

In this section, we report the threats to validity of our study. The section is organized by threat type, i.e., Conclusion, Internal, Construct, and External.

5.1 Conclusion

Conclusion validity concerns issues that affect the ability to draw accurate conclusions regarding the observed relationships between the independent and dependent variables (Wohlin et al. 2012).

We tested all hypotheses with non-parametric tests (e.g., Kruskal–Wallis) which are prone to type-2 error, i.e., not rejecting a false hypothesis. We have been able to reject the hypotheses in most of the cases; therefore, the likelihood of a type-2 error is low. Moreover, the alternative would have been using parametric tests (e.g., ANOVA) which are prone to type-1 error, i.e., rejecting a true hypothesis, which, in our context, is less desirable than type-2 error. Also, we acknowledge that our proposed method (i.e., median) to combine JIT with MDP and CDP is a simple and effective baseline to start with (as demonstrated by our results).

There is in the literature a gap between defect prediction accuracy and its actual value in software quality. To face this threat we adopted nine effort-aware metrics; these metrics relate to the effort savings provided by adopting defect prediction models.

Since AUC is sensitive to imbalanced data, and since our datasets are highly unbalanced, AUC results must be interpreted with care. The remaining nine performance metrics are insensitive to imbalanced data; thus, our overall results are not impacted by the imbalanced nature of our data.”

5.2 Internal

Internal validity is concerned with influences that can affect the independent variables with respect to causality (Wohlin et al. 2012). A threat to internal validity is the lack of ground truth for commits, methods and class defectiveness. In other words, the used RA-SZZ is not perfectly accurate. Nevertheless, we would argue that this is a common threat in most of empirical research in the area of software engineering (Kamei and Shihab 2016). Moreover, to face this threat, we have manually analyzed many commits.

Still regarding the lack of ground truth for commits, methods and class defectiveness, one relevant threat to validity is the possibility that non-linked tickets are related to defect fixes or injections. To face this threat, we tried our best to select projects with the highest linkage. Moreover, we reported the linkage proportion of each project in Table 1 to allow the reader to reason about the validity of the results to specific projects. Finally, we believe that the presence of non-linked commits could have inhibited the observed positive effects of using CDP to support MDP or CDP. Specifically, the inaccuracy in commits labeling, caused by the non-linked commits, reasonably inhibits the accuracy of the commits defectiveness prediction and hence its use.

The execution of a prediction study on defect prediction entails many, often subjective, design decisions such as validation technique, balancing, normalization, tuning, and many more, which might influence the prediction results. We do not expect that our design choices coincide with the choices of all readers, our intent is to use state-of-the-art techniques. We documented all our design choices in Section 2; moreover, we made our replication package available to researchers. Regarding tuning, many studies suggest the tuning of hyperparameters (Fu et al. 2016; Tantithamthavorn et al. 2019); however, in the present study, we use default hyperparameters due to resource constraints and due to the static time-ordering design of our evaluation. In the future, we plan to evaluate the interaction factor between hyperparameters tuning and the benefits of using JIT for MDT and CDP.

In this work, as in many other similar ones (Tantithamthavorn et al. 2016b, 2019; Falessi et al. 2020, 2021; Turhan et al. 2009; Fukushima et al. 2014; Vandehei et al.2021), we do not differentiate among the severity of defects. If, on one side, the severity of defects is important and practical, on the other side, to the best of our knowledge, there is no study suggesting that the severity of defects impacts defect prediction accuracy. Therefore, in the future, we plan to extend this work by analyzing the sensitivity of the current results to the severity or priority of the considered defects.

5.3 Construct

Construct validity is concerned with the degree to which our measurements indeed reflect what we claim to measure (Wohlin et al. 2012).

In order to avoid that dormant defects would impact our ground-truth, we neglected the last 90% of the releases. This provides us the confidence that snoring is only about 1% in our datasets (Falessi et al. 2021; Ahluwalia et al. 2019).

Our results could also be impacted by our specific design choices including classifiers, features, and accuracy metrics. In order to face these threats, we based our choice on past studies.

One relevant threat to validity is the possibility that non-linked tickets are related to defect fixes or injection; this would bias our ground truth. To face this threat, we tried our best to select projects with the highest linkage. Moreover, we reported the linkage proportion of each project in Table 1 to allow the reader to reason about the validity of the results to specific projects. Finally, we believe that the presence of non-linked commits could have inhibited the observed positive effects of using CDP to support MDP or CDP. Specifically, the inaccuracy in commits labeling, caused by the non-linked commits, reasonably inhibits the accuracy of the commits defectiveness prediction and hence its use.

5.4 External

External validity is concerned with the extent to which the research elements (subjects, artifacts, etc.) are representative of actual elements (Wohlin et al. 2012).

This study used a large set of datasets and, hence, could be deemed of high generalization compared to similar studies. Of course, our results cannot be generalized by projects that would significantly differ from the settings used in this present study. Moreover, since we focused on open-source projects, due to their high availability, we recommend care in generalizing these findings to industrial projects. Please note that considering only mature projects is a threat to external generalizability as results might not generalize to immature projects. One could argue that since only nine nontrivial Apache projects have issue linkage rate above 50%, then no other Apache project might reasonably benefit from the proposed method. We note that the linkage impacts the labeling mechanism, which might have no impact on our results. In practice, our independent variables are orthogonal to the labeling mechanism. In other words, if in the future we will be able to label the defectiveness of commits, classes, and methods according to a linkage-agnostic mechanism, then we believe that our results will still reasonably hold.

Finally, in order to promote reproducible research, all datasets, results, and scripts for this paper are available in our replication package.Footnote 6

6 Related Work

6.1 Combining Heterogeneous Predictions

While countless studies investigated how to predict the defectiveness of commits (Herbold et al. 2020; McIntosh and Kamei 2018; Pascarella et al. 2019; Herbold 2019; Kondo et al. 2020; Huang et al. 2019; Fan et al. 2019; Tu et al. 2020; Rodriguezperez et al. 2020; Pascarella et al. 2020; Giger et al. 2012), or classes (Kamei et al. 2016; Tantithamthavorn et al. 2016b, 2019, 2020; Bennin et al. 2018, 2019; Herbold et al. 2017, 2018; 2019; Hosseini et al. 2019; Yan et al. 2017; Liu et al. 2017; Chi et al. 2017; Jing et al.2017; Di Nucci et al. 2018; Palomba et al. 2019; Song et al. 2019; Zhang et al. 2016, 2017; Lee et al. 2016; Yu et al. 2019a; Peters et al. 2019; Qu et al. 2021; Shepperd et al. 2018; Amasaki 2020; Bangash et al. 2020; Kondo et al. 2019; Morasca and Lavazza 2020; Mori and Uchihira 2019; Tian et al.2015; Jiarpakdee et al. 2020; Chen et al. 5555; Dalla Palma et al. 2021) in a separate fashion, to the best of our knowledge, no study other than Pascarella et al. (2019), investigated how heterogeneous predictions can benefit one another.

Another family of studies that combines heterogeneous information is the ensemble model, which has been used in the context of defect prediction, as a way to combine the prediction of several classifiers (Laradji et al. 2015; Petric et al. 2016; Tosun et al. 2008; Yang et al. 2017).

6.2 Method Defectiveness Prediction

The first proposing of lowering the granularity of defective prediction have been Menzies et al. (2007) and Tosun et al. (2010). Giger et al. (2012) were the first to perform an MDP study. Specifically, Giger et al. (2012) defined a set of product and process features and found that both product and process features support MDP (i.e., F-Measure= 86%).

Our paper has been highly inspired by Pascarella et al. (2020). Specifically, Pascarella et al. (2020) provide negative results regarding the performance of MDP. In other words, using the same design of Giger et al. (2012), they show that MDP is as accurate as a random approach, i.e., the obtained AUC is about 0.51. We share with them several design decisions, including:

  • The use of process metrics as features for MDP. Specifically, “The addition of alternative features based on textual, code smells, and developer-related factors improve the performance of the existing models only marginally, if at all.” (Pascarella et al. 2020)

  • The use of a realistic validation procedure. However, they performed a walk-forward procedure, whereas we performed a simple split by preserving. However, both procedures are realistic since they preserve the order of data (Falessi et al. 2020).

The differences in design include:

  • We use an advanced SZZ implementation (RA-SZZ) whereas they use Relink (Wu et al. 2011).

  • The use of a different definition of a defective entity. In our research, an entity is defective from when the defect was injected until the last release before the defect has been fixed.

  • We use a different set of classifiers.

  • We use effort aware metrics such as PofB.

  • We selected a different set of projects from which we derived the datasets. The change was due to the fact that we needed the same dataset to produce commit, method, and class data.

The differences in results include that the accuracy achieved by MDP, even without leveraging JIT, is much better than a random approach. Specifically, According to Figs. 6, the median AUC across classifiers and datasets is of 0.81 without leveraging JIT and 0.96 when leveraging JIT. Moreover, the proportion of defective methods is lower by about an order of magnitude in our datasets than in their datasets. Those differences are due to the set of changes in the design, and we prefer not to speculate on which specific change caused the difference in results.

To the best of our knowledge, there is no other study investigating MDP other than Pascarella et al. (2020) and Giger et al. (2012).

Defect prediction can focus on finer-grained software entities other than methods, such as commits (JIT) and statements (Pornprasit and Tantithamthavorn 2021). However, these types of entities (commits and statements) seem more useful to rank at the moment of a commit instead of during the testing phase (which is our target phase in this paper).

7 Conclusion

In this study, we: (i) compare methods and classes in terms of defectiveness; (ii) compare methods and classes in terms of accuracy in defectiveness prediction; (iii) propose and evaluate a first and simple approach that leverages JIT information to increase MDP and (iv) CDP accuracy.

Our analysis features two types of accuracy metrics (threshold-independent and effort-aware) and feature selection metrics, nine machine learning defect prediction classifiers, 1,860 defects related to 35 releases of 9 open source projects from the Apache ecosystem. Our results rely on a ground truth featuring a total of 269,004 data points and 46 features among commits, methods and classes.

Our results reveal that:

  • MDP is significantly more accurate than CDP (+ 5% AUC and 62% PofB10). Thus, it is better to predict and rank defective methods instead of than defective classes from a practitioner’s perspective. From a researcher’s perspective, given the scarce number of MDP studies, there is a high potential for improving MDP accuracy.

  • Leveraging JIT by using a simple median approach increases the accuracy of MDP by an average of 17% in AUC and 46% in PofB10 and increases the accuracy of CDP by an average of 28% in AUC and 31% in PofB20. However, in a few cases, leveraging JIT decreased the accuracy of MDP and CDP.

  • Since many defective commits were only partially defective, only a small percent of methods touched by defective commits were actually defective. Therefore, we expect that leveraging statement-defectiveness-prediction (Pornprasit and Tantithamthavorn 2021) would better enhance MDP than JIT.

In conclusion, from a practitioner’s perspective, it is better to predict and rank defective methods than defective classes. From a researcher’s perspective, there is a high potential for leveraging statement-defectiveness-prediction (SDP) to aid MDP and CDP.

In the future we plan to:

  • Propose and evaluate new approaches to improve MDP by leveraging JIT. Specifically, instead of using a static approach like median, we could use a machine learning approach to combine MDP with JIT information.

  • Use smell information (Fowler 2018) to support MDP as suggested by previous works (Khomh et al. 2012; Palomba et al. 2018, 2019; Taba et al.2013).

  • Leverage statement level defect prediction (Pornprasit and Tantithamthavorn 2021) to augment MDP and CDP.

  • Investigate whether dormant defects (Ahluwalia et al. 2019; Falessi et al. 2021) or other types of noise in the datasets (Chen et al. 2014; Tantithamthavorn et al. 2015; Herzig et al. 2013; Rahman et al. 2013; Bird et al. 2009) have more impact on MDP or CDP.

  • Replicate the approach in the context of dependence (Cogo et al. 2019), performance (Chen et al. 2020) or security (Yu et al. 2019b) defects.

  • Using multi-level features in a single prediction model. While in this work we evaluated the benefits of combining two predictions, e.g., commits with methods, in the future, we plan to investigate the benefits of performing a single prediction that uses features at different levels, i.e., features at commits and methods levels).