On effort-aware metrics for defect prediction

Advances in defect prediction models, aka classifiers, have been validated via accuracy metrics. Effort-aware metrics (EAMs) relate to benefits provided by a classifier in accurately ranking defective entities such as classes or methods. PofB is an EAM that relates to a user that follows a ranking of the probability that an entity is defective, provided by the classifier. Despite the importance of EAMs, there is no study investigating EAMs trends and validity. The aim of this paper is twofold: 1) we reveal issues in EAMs usage, and 2) we propose and evaluate a normalization of PofBs (aka NPofBs), which is based on ranking defective entities by predicted defect density. We perform a systematic mapping study featuring 152 primary studies in major journals and an empirical study featuring 10 EAMs, 10 classifiers, two industrial, and 12 open-source projects. Our systematic mapping study reveals that most studies using EAMs use only a single EAM (e.g., PofB20) and that some studies mismatched EAMs names. The main result of our empirical study is that NPofBs are statistically and by orders of magnitude higher than PofBs. In conclusion, the proposed normalization of PofBs: (i) increases the realism of results as it relates to a better use of classifiers, and (ii) promotes the practical adoption of prediction models in industry as it shows higher benefits. Finally, we provide a tool to compute EAMs to support researchers in avoiding past issues in using EAMs.


Introduction
The manner in which defects are introduced into code, and the sheer volume of defects in software, are typically beyond the capability and resources of most development teams (Ghotra et al. 2017;Kamei et al. 2012;Tantithamthavorn et al. 2019). Defect prediction models aim to identify software artifacts that are likely to be defective Ohlsson and Alberg 1996;Ostrand and Weyuker 2004;Ostrand et al. 2005;Turhan et al. 2009;Weyuker et al. 2010). The main purpose of defect prediction is to reduce the cost of testing, analysis, or code review by prioritizing developers' efforts on specific artifacts such as commits, methods, or classes.
Accuracy metrics are important to validate the extent to which classifiers are accurate and would support potential users. The aim of this paper is to focus on a particular family of accuracy metrics called effort-aware metrics (EAM) (Jiang et al. 2013;Rahman et al. 2012). EAMs relate to the ranking, as provided by a classifier, of candidate defective entities (Mende and Koschke 2009). The better the classifier, the higher the number of defective entities a developer can identify within a given rank. The different EAMs vary in the thresholds used to stop analyzing the ranking and ranking criteria. For instance, PofBx is defined as the percentage of bugs that a developer can identify by inspecting the top x percent of lines of code (Chen et al. 2017). Table 1 reports an example of PofB20. The dataset in Table 1 is 1790 LOC, and it features seven entities, three of which are defective. Looking at PofB20, a user following the rank of classes predicted to have the highest probability of defectiveness, as provided by a DMP, would stop at 20% of the dataset and hence would analyze only up to 358 LOC. Thus, the user would analyze only the first two entities, finding one-third of the defective entities. Thus, in this example, PofB20 is 33%.
To better motivate the need to normalize the ranking according to the size of the ranked entities, we present here the same entities seven entities of Table 1 (Section 1) but with a normalized ranking. Specifically, Table 2 differently to Table 1, ranks the entities according to their predicted probability to be defective divided by their size rather than according to their predicted probability only. Looking at the normalization of PofB20 (Table 2), a user, following the rank of classes based on the predicted probability to be defective, as provided by a classifier, would stop at 20% of the dataset and hence would analyze only up to 358 LOC. Thus, by following a normalized ranking, the user would analyze the first three entities, finding two-thirds of the defective entities. Thus, in this example of seven entities, the normalization increased PofB20 from 33% to 66%. Despite the importance of EAMs, there is no study investigating EAMs trends and validity. The aim of this paper is twofold: 1) we reveal issues in EAMs usage, and 2) we propose and evaluate a normalization of PofB called NPofB. NPofB measures the ranking effectiveness when the ranking is normalized by the size of the ranked, and possibly defective, entities. In this paper we provide the following contributions: 1. We reveal trends in EAMs usage. Our systematic mapping study, featuring 152 primary studies, in major software engineering journals, reveals a few issues, including that most studies using EAMs use only a single EAM (e.g., PofB20) and that some studies mismatched EAMs names. 2. We suggest normalizing the PofBs. The idea behind normalization is that the user will follow a ranking based on the probability that an entity is defective, as provided by the classifier, normalized (i.e., divided) by the size of this entity. 3. We validate the normalization of PofBs. By analyzing ten PofBs, ten classifiers, two industrial projects and 12 open-source projects, we show that the normalization increases the PofBs statistically and by orders of magnitude. This result means that: 1)studies reporting PofBs, rather than the proposed normalization (i.e., NPofB), underestimate the benefits of using classifiers for ranking defective classes.
2)The normalization increases the realism of PofBs due to better use of classifiers. 4. We show that the proposed normalization changes the ranking of classifiers. Specifically, when considering the same dataset, in most cases, the best classifier for a PofB resulted as different from the best classifier for that normalized PofB. 5. We show that multiple PofBs are needed to support a comprehensive understanding of classifiers accuracy. 6. We provide a tool to compute EAMs to support researchers in: 1) avoiding extra effort in EAMs computation as there is no available tool to compute EAMs, 2) increasing results reproducibility, and 3) increasing results validity by avoiding EAMs misnaming and, 4)increasing results generalizability by avoiding single EAM usage.
The remainder of this paper is structured as follows. Section 2 discusses the related literature, focusing in particular on accuracy metrics for classifiers. Section 3 reports the design, Section 4 the results, and Section 5 a discussion of our study. Section 6 presents our ACUME tool. Section 7 provides the threats to validity of our investigation. Finally, Section 8 concludes the paper and outlines directions for future work.

Accuracy Metrics
Accuracy metrics evaluate the ability of a classifier to provide correct classifications. Examples of accuracy metrics include the following: -True Positive (TP): The class is actually defective and is predicted to be defective. -False Negative (FN): The class is actually defective and is predicted to be non-defective.
-True Negative (TN): The class is actually non-defective and is predicted to be nondefective. -False Positive (FP): The class is actually non-defective and is predicted to be defective.
-Precision: T P T P +F P . -Recall: T P T P +F N . -F1-score: 2 * P recision * Recall P recision+Recall . -AUC (Area Under the Receiving Operating Characteristic Curve) (Powers 2007) is the area under the curve, of true positive rate versus false positive rate, that is defined by setting multiple thresholds. AUC has the advantage of being threshold-independent. -MCC (Matthews Correlation Coefficient) is commonly used in assessing the performance of classifiers dealing with unbalanced data (Matthews 1975), and is defined as: (T N+F N) . Its interpretation is similar to correlation measures, i.e., MCC < 0.2 is considered to be low, 0.2 ≤ MCC < 0.4-fair, 0.4 ≤ MCC < 0.6-moderate, 0.6 ≤ MCC < 0.8-strong, and MCC ≥ 0.8-very strong.
-Gmeasure: 2 * Recall * (1−pf )  is the harmonic mean between recall and probability of false alarm (pf ), which denotes the ratio of the number of non-defective modules that are wrongly classified as defective to the total number of non-defective modules as F P F P +T N .  A drawback of the metrics above is that they somehow assume that the costs associated with testing activities are the same for each entity, which is not reasonable in practice. For example, costs for unit testing and code reviews are roughly proportional to the size of the entity under test.

Effort-Aware Metrics
The rationale behind EAM is that they focus on effort reduction gained by using classifiers (Mende and Koschke 2009).
In general, there are two types of EAM: normalized by size or not normalized by size. The most known not-normalized EAM is called PofB (Chen et al. 2017;Tu et al. 2020;Wang et al. 2020; which is defined as the proportion of defective entities identified by analyzing the first x% of the code base as ranked according to their probabilities, as provided by the prediction model, to be defective. The better the ranking, the higher the PofB, the higher the support provided during testing. For instance, a method having a PofB10 of 30% means that 30% of defective entities have been found by analyzing 10% of the codebase by using the ranking provided by the method. Since the PofBX of a perfect ranking is still costly, it is interesting to compare the ranking provided by a prediction model with a perfect ranking; this helps understanding how the prediction model performed compared to a perfect model. Therefore, Mende and Koschke (2009), as inspired by Arisholm et al. (2007), proposed Popt which measures the ranking accuracy provided by a prediction model by taking into account how it is worse than a perfect ranking and how it is better than a random ranking. Popt is defined as the area opt between the optimal model and the prediction model. In the optimal model, all instances are ordered by decreasing fault density, and in the predicted model, all instances are ordered by decreasing predicted defectiveness. The equation of computing Popt is shown below, where a larger Popt value means a smaller difference between the optimal and predicted model: Popt = 1 − opt .
Popt and PofB are two different metrics describing two different aspects of the accuracy of a model. Popt and PofB rank entities in two different ways: Popt according to bug density (i.e., bug probability divided by entity size), PofB according to bug probability. Therefore, the ranking of classifiers provided by Popt and PofB might differ. Finally, Popt is more realistic than PofB as the ranking is based on density rather than probability. However, Popt is harder to interpret than PofB as a classifier with the double of Popt does not provide the double of benefits to its user. Thus, in this paper, we try to bring the best of PofB and Popt by proposing a new EAM metric that ranks entities similarly to both Popt and PofB.
In the following we provide a description of additional EAMS.
respectively and they represent the Proportion of Changes Inspected and Proportion of Modules Inspected, respectively, when 20% LOC are inspected. Note that these metrics are about the ranking of modules in general rather than about the ranking of defective modules. The idea behind these two similar metrics is that context switches shall be minimized to support effective testing. Specifically, a larger PMI@20% indicates that developers need to inspect more files under the same volume of LOC to inspect. Thus bug prediction models should strive to reduce PMI@20% while trying to increase Popt (Qu et al. 2021b) at the same time. -PFI@20%: has been introduced by Qu et al. (2021b) and it coincides with PMI@20 ) when the module is a file.
-IFA: "returns the number of initial false alarms encountered before the first real defective module is found" ). This effort-aware performance metric has been considerably influenced by previous work on automatic software fault location (Kochhar et al. 2016).When IFA is high then there are many false positives before detecting the first defective module. ). -Peffort: has been introduced by D' Ambros et al. (2012) and it is similar to our proposed NPofB. Peffort uses the LOC metric as a proxy for inspection effort. Peffort evaluates a ranking of entities based on the number of predicted defects divided by size whereas our NPofB evaluates a ranking of entities based on the predicted defectiveness divided by size.

Evaluations
As EAMs drive and impact the results of prediction models evaluations, it is important to discuss studies about how to evaluate prediction models. The evaluation of prediction models performed in studies has been largely discussed. Many papers explicitly criticized specific empirical evaluations. For instance, Herbold (2017) criticized the use of the ScottKnottESD test in Tantithamthavorn et al. (2016c). Shepperd et al. (2014) found that the choice of classifier has less impact on results than the researcher group. Thus, they suggest conducting blind analysis, improve reporting protocols, and conduct more intergroup studies. Tantithamthavorn et al. (2016b) replied for a possible explanation for the results aside from researcher' bias; however, after a few months Shepperd et al. (2018) concluded that the problem of researcher' bias remains. Zhang and Zhang (2007) criticized Menzies et al. (2007b) because, due to the small percentage of defective modules, their results are not satisfactory for practical use. Zhang and Zhang (2007) suggest using accuracy metrics, such as Recall and Precision, instead of pd or pf. Menzies et al. (2007a) replied that it is often required to lower precision to achieve higher recall and that there are many domains where low precision is useful. Menzies et al. (2007a), in contrast to Zhang and Zhang (2007), advised researchers to avoid the use of precision metric; they suggest the use of more stable metrics (i.e., recall (pd) and false alarm rates) for datasets with a large proportion of negative (i.e. not defective) instances. Falessi et al. (2020) reports on the importance of preserving the order of data between the training and testing set. Afterward, the same issue was deeply discussed in Flint et al. (2021) Thus, results are unrealistic if the underlying evaluation does not preserve the order of data. Falessi et al. (2022) show that dormant defects impact classifiers' accuracy and hence its evaluation. Specifically, an entity, such as a class or method used in the training/testing set, can be labeled in the ground-truth as defective only after the contained defect is fixed. Since defects can sleep for months or years (Ahluwalia et al. 2019;Chen et al. 2014) then the entity erroneously seems to be not defective until the defect it contains is fixed. Thus, Ahluwalia et al. (2019) suggest to ignore the most recent releases to avoid that dormant defects impact classifiers' accuracy. Shepperd et al. (2013) commented on the low extent to which published analyses based on the NASA defect datasets are meaningful and comparable.
Very recently Morasca and Lavazza (2020) proposed a new approach and a new performance metric (the Ratio of Relevant Areas) for assessing a defect proneness model by taking into account only parts of a ROC curve. They also show the differences and how more reliable and less misleading their metric is compared to the existing ones.

Study Design
In this paper we investigate the following research questions: -RQ1: Which EAMs are used in software engineering journal papers? In this research question we investigate the trends in EAMs usage, i.e., which and how many EAMs are used in past studies. We are also interested in understanding if the same study uses multiple EAMs and if the EAMs are consistently defined and computed across different studies. -RQ2: Does the normalization improve PofBs? In this research question we investigate if the normalization of PofBs brings higher accuracy. Higher accuracy means that if we analyze a percent of lines of code of the possibly defective entities, we cover a high number of defective entities following a ranking that is based on both the entities likelihood (to be defective) and its size rather than a ranking that is based only on the entities likelihood. If the normalization of PofBs brings higher accuracy, then studies reporting EAMs, unlike our normalized EAMs, underestimate the benefits of using a classifier for ranking defective classes. Moreover, the normalized EAMs shall be considered more realistic than EAMs since they relate to better classifiers. -RQ3: Does the ranking of classifiers change by normalizing PofBs? In this research question we are interested in understanding if the best classifier of a PofB is also the best classifier of NPofB; i.e. if a classifier results as best in PofB10 then it might not be the best in NPofB10. Suppose the normalization changes the ranking of classifiers. In that case, past studies using PofB are misleading, i.e., past studies might not identify the classifier providing the highest benefit to the user in ranking defective classes. -RQ4: Does the ranking of classifiers change across normalized PofBs? In this research question we are interested in understanding if multiple NPofBs are needed to support a comprehensive understanding of classifier accuracy. In other words, we want to know if different NPofBs rank classifiers in the same way. If different NPofBs rank classifiers differently, then results related to a single NPofB cannot be generalized to the overall ranking effectiveness provided by classifiers; i.e. if a classifier resulted as best in NPofB20 then it might not be the best in NPofB10.

RQ1: Which EAMs are Used in Software Engineering Journal Papers?
To investigate the trends in EAMs usage, we carried out a mapping study (MS) in the first semester of 2021 by following the Kitchenham and Charters guidelines (Kitchenham and Charters 2007). We performed the MS by applying the following query in the tile:

(bug OR def ect) AND (prediction OR estimation)
To make the MS feasible to our effort constraints we focused on the top five journals in the software engineering areas: IEEE Transactions on Software Engineering, ACM Transactions on Software Engineering and Methodology, Empirical Software Engineering and Measurement, Journal of Systems and Software, and Information and Software Technology.
We excluded conferences since they pose space constraints. Specifically, we wanted to be sure that a limited use of EAM was a deliberate design decision of the authors rather than a decision to meet the (conference) space constraints.
Our search provided us a set of about 179 papers. Then we applied the following exclusion criteria: -Comments and answer to comments kind of papers.
-Systematic and mapping study kind of papers. -Practitioners' opinions kind of papers.
-Studies about models predicting things other than defectiveness such as ticket resolution time.
After applying the exclusion criteria, we focused the remainder of the MS on 152 primary studies.
Once we applied the above-mentioned exclusion criteria for each paper, we checked the name of the EAMs used and their definition (i.e., how it was computed). Thus, we started from an empty list of EAMs and we improved the list as we analyzed the papers. The data extracting and synthesis of all papers have been performed by both authors independently after a period of training on a small set of papers. The results of the authors perfectly coincided.

RQ2: Does the Normalization Improve PofBs?
In general, EAMs try to measure the ranking effectiveness of prediction models. The rationale behind EAMs is to measure the effort required by testers to find a specific percent of defects by following a ranked set of entities possibly containing defects. Since the testing effort varies according to the size of the entities under test, we had the intuition that the ranking of entities, is more effective if it takes into consideration both the likelihood of the entity to be defective and also its size. Therefore, in this paper we propose and validate a new EAM that measures the ranking effectiveness of prediction models when the ranking is normalized by the size of the ranked entities; i.e., it measures the effectiveness of an effort-aware ranking.
To investigate if the normalization increases PofBs we perform an empirical study based on within-project across-release class-level defect prediction. Specifically, we observe if the PofB of the same classifier on the same dataset increases after the normalization. As datasets we use the same two industry projects and 12 open-source projects we successfully used in a recent study (Falessi et al. 2020). The 12 open-source projects have been originally used by Tantithamthavorn et al. (2016c) which in turn have been selected from a set of 101 publicly-available defect datasets.
We refer to the recent study for details about the size and characteristics of the projects.

Independent Variable
The independent variable of this research question is the presence or absence of normalization in computing PofBs. In this study, we use the term, feature, to refer to the input (e.g., CHURN) of a classifier. Our independent variable is the normalization of the ranking by size as this is what we conjecture influences the ranking effectiveness. We note that in some studies, that are different from the present one, the features are the independent variables.

Dependent Variables
The dependent variable of this research question is the score of PofB with and without the normalization. As PofBs we considered the spectrum from 10 to 90 with a step of 10. We neglected PofB0 since this is always zero and PofB100 since this is always 100. We also considered the AveragePofB as computed as the average between the PofBs from 0 to 100 with a step of 10. Thus we considered ten different PofBs.
In addition to comparing the two scores, with versus without the normalization, in this paper we observe the relative gain provided by the normalization as defined as where NPofB represents the normalized score of PofB.

Measurement Procedure
For each project, we: 1. Perform preprocessing: -Normalization: we normalize the data with log10 as performed in a related study (Jiang et al. 2008;Tantithamthavorn et al. 2019). -Feature Selection: we filter the independent variables described above by using the correlation-based feature subset selection (Ghotra et al. 2017;Hall 1998;. The approach evaluates the worth of a subset of features by considering the individual predictive capability of each feature, as well as the degree of redundancy between different features. The approach searches the space of feature subsets by a greedy hill-climbing augmented with a backtracking facility. The approach starts with an empty set of features and performs a forward search by considering all possible single feature additions and deletions at a given point. -Balancing: we apply SMOTE (Agrawal and Menzies 2018;Chawla et al. 2002) so that each dataset is perfectly balanced.
2. Create the Train and Test datasets by adopting the above walk-forward validation technique. Specifically, our context is the within-project across-release class-level defect prediction. As a measurement procedure, we adopt the walk-forward validation technique suggested in a recent study (Falessi et al. 2020). In this technique, the project is first organized in releases. Afterwards, there is a loop for n = 2, n + +, up to n = max releases where the data of the initial n-1 releases is used as training set, and the data of the last n release is used as testing set. This technique has the advantage of preserving the order of data and hence avoiding that data from the future is used to predict data in the past. Moreover, the technique is fully replicable as there is no random mechanism. The disadvantage is that it requires the project to have at least two releases. The random aspects in our classifiers, if any, are controlled by seeds that are used as a parameter, i.e., input, of the classifiers. Therefore, our classifiers are deterministic rather than stochastic, i.e., our results coincide over multiple runs on the same train-test pair. Thus, there is no need to perform a sensitivity analysis of our results. Our set of 14 projects, analyzed via a walk-forward technique, leads to a total of 71 datasets (i.e., 71 specific combinations of training and testing sets). For instance, since KeymindA consists of five releases, then walk-forward on KeymindA leads to 4 datasets. Again, we forward the reader to the previous study for further details about the datasets (Falessi et al. 2020). 3. Compute predicted probability of defectiveness of each class by using each of the ten classifiers. 4. Compute PofBs and NPofBs.
As classifiers we used the ones used in a previous study (Falessi et al. 2020): -Decision Table: Two major parts: schema, the set of features included in the table, and a body, labeled instances defined by features in the schema. Given an unlabeled instance, try matching instance to record in the table (Kohavi 1995).
-IBk: Also known as the k-nearest neighbor's algorithm (k-NN), which is a nonparametric method. The classification is based on the majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (Altman 1992). K-nearest neighbors classifier run with k = 1 (Aha and Kibler 1991). -J48: Generates a pruned C4.5 decision tree (Quinlan 1993).
-KStar: Instance-based classifier using some similarity function. Uses an entropy-based distance function (Cleary and Trigg 1995). -Naive Bayes: Classifies records using estimator classes and applying Bayes theorem (John and Langley 1995) i.e., it assumes that the contribution of an individual feature towards deciding the probability of a particular class is independent of other features in that project instance (McCallum and Nigam 1998). -SMO: John Platt's sequential minimal optimization algorithm for training a support vector classifier (Platt 1998) -Random Forest: Ensemble learning creating a collection of decision trees. Random trees correct for overfitting (Breiman 2001). -Logistic Regression: It estimates the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. The estimation is performed through the logistic distribution function (Le Cessie and Van Houwelingen 1992). -BayesNet:Bayesian networks (BNs), also known as belief networks (or Bayes nets for short), belong to the family of probabilistic graphical models (GMs). These graphical structures are used to represent knowledge about an uncertain domain. In particular, each node in the graph represents a random variable, while the edges between the nodes represent probabilistic dependencies among the corresponding random variables. These conditional dependencies in the graph are often estimated by using known statistical and computational methods. Hence, BNs combine principles from graph theory, probability theory, computer science, and statistics (Ben-Gal 2008). -Bagging: Probably the most well-known sampling approach. Given a training set, bagging generates multiple bootstrapped training sets and calls the base model learning algorithm with each of them to yield a set of base models (Kotsiantis et al. 2005).

Analysis Procedure
We compare the value of PofBs of the same classifier on the same dataset, with versus without the normalization. Since our data strongly deviate from normality, the hypotheses of this research question are tested using the Wilcoxon signed-rank test (Wilcoxon 1945). The test is paired since the compared distributions, with versus without normalization, are related to the identical objects (i.e., the score of the same ten classifiers, ten classifiers, on the same 71 datasets). We also use the Cliff's delta (paired) to analyze the effect size (Grissom and Kim 2005). Table 3 presents the standard interpretation (Vargha and Delaney 2000) of Cliff's delta (paired) effect size.

RQ3: Does the Ranking of Classifiers Change by Normalizing PofBs?
Since the normalization of PofBs results in higher accuracy (RQ2), it is interesting to understand the validity of past studies since they do not normalize PofBs. Suppose the In this research question, we leverage RQ2 results, i.e., the accuracy of 10 classifiers over 72 datasets grouped in 14 projects. To compare the rankings, we use the Spearman's rank correlation (Spearman, 1904) between the ranking of classifiers provided by the same PofB, with versus without the normalization, in each of the 72 datasets. To compare the rankings we use the Spearman's rank correlation (Spearman 1904) between the ranking of classifier provided by the same PofB, with versus without the normalization. Table 4

presents the standard interpretation (Akoglu 2018) of Spearman's ρ.
We also compare, for each dataset and PofB, if the best classifier coincides after the normalization.

RQ4: Does the Ranking of Classifiers Change Across Normalized PofBs?
Since past studies used a very limited set of EAMs (RQ1), it is interesting to understand if it is a valid design decision to use a limited set of NPofBs. Suppose different NPofBs rank classifiers differently. In that case, the results related to a single NPofB cannot be generalized to the overall ranking effectiveness provided by prediction models; i.e. if a prediction model resulted as best in NPofB20 then it might not be the best in NPofB10. Thus, to understand if the use of multiple NPofBs is needed, we need to understand if there is a difference among NPofBs. We measure the difference among NPofBs as the difference among their rankings. To compare the rankings we use the Spearman's rank correlation (Spearman 1904) between the ranking of classifiers provided by each pair of NPofBx, with x in the range [10,90]. As in RQ3, in this research question we leverage RQ2 results. Specifically, each classifier, in each of the 72 datasets, has a ranking in the range [1,10] (as we used 10 classifiers) with a specific PofBx. We compute the Spearman's values across each combination of NPofBx, with x in the range [10,90]. We also compare, for each dataset, the proportion of ten NPofBs sharing the same classifier as best.  Table 5 reports the EAMs used in software engineering journal papers. According to Table 5 the most used EAM is AveragePopt and PofB20. Table 6 reports the number of EAMs used in past studies. According to Table 6 the majority of the studies used no EAM and hence ignored to validate the model according to their impact on effort. Moreover, the majority of studies using EAMs used a single EAM (i.e., 12 out of 20). Table 7 reports the number of studies correctly or incorrectly naming EAMs according to their original definitions (Chen et al. 2017;Mende and Koschke 2009). According to Table 7 seven out of 20 studies incorrectly named EAM.

RQ2: Does the Normalization Improve PofBs?
Figure 1 reports ten PofBs, and their normalization, of 10 classifiers over the 72 datasets grouped in 14 projects. Figure 2 reports the gain achieved by normalizing a specific PofB metric. According to Fig. 2 the normalization increases the performance of the median classifier of all PofBs in all 14 projects. Table 8 reports the average gain, across datasets and classifiers, in normalizing a specific EAM. According to Table 8 the relative gain doubles when decreasing the PofB metric; e.g., the relative gain in PofB10 is double than PofB20, which is the double of PofB30. Table 8 reports the statistical test results comparing a PofB before and after the normalization. According to Table 8, the normalization significantly improves all ten PofB metrics. Therefore, we can reject H10 for all ten EAMs. Moreover, the effect size resulted as large for all ten EAMs. Table 9 reports the correlation between the same PofB before and after the normalization. According to Table 9 the correlation is only fair between the same PofB before and after the normalization in nine out of 10 PofBs. Figure 3 reports the proportion of times the same PofB metrics, with and without normalization, identifies the same classifier as best. According to Fig. 3, the proportion of times the same PofB metrics, with and without normalization, identifies the same classifier as best changes across projects and PofBs. Specifically, in half of the datasets, the best classifier changes in all PofBs after the normalization. Table 10 summarizes Fig. 3 by reporting in average per specific PofBs, the proportion of times the same PofB metrics, with and without normalization, identifies the same classifier as best. According to Table 10, in all ten PofBs the proportion of times that the same PofB metrics, with and without normalization, identifies the same classifier as best is less than half. Thus, the best classifier is more likely to change than to coincide when considering PofB after the normalization.      Fig. 4 reports the distribution of the correlations among each couple of NPofBs. Table 11 reports the frequency of interpretations of correlation values among couples of NPofBs. According to Table 11 no couple of NPofBs is perfectly correlated. Moreover, only 30 out of 45 couples of NPofBs are very strongly correlated. In conclusion, Table 11 shows that the rankings of classifiers are far to be identical across different NPofBs. Figure 5 reports the distribution of the proportion of times the best classifier for a dataset coincides across NPofBs. According to Fig. 5, in only five out of 41 datasets the best classifier for a dataset coincides across the ten PofBs. Thus, in about 88% of the cases, the best classifier varies across NPofBs.

Discussion
This section discusses our main results, the possible explanations for the results, implications, and guidelines for practitioners and researchers.

RQ1: Which EAMs are Used in Software Engineering Journal Papers?
The main result of RQ1 is that EAMs are used in a minority of defect prediction studies, i.e., 20 out of 152 software engineering journal papers. One possible reason is that EAMs do not make much sense in Just-in-time prediction studies, i.e., in studies predicting the defectiveness of commits. As a matter of fact, in the JIT context, the user is envisioned to consider the defectiveness prediction just after each single commit, and hence a JIT classifier cannot help the user in ranking the possibly defective entities (i.e., commits). However, we observed many JIT studies using EAMs and many non-JIT studies, i.e., studies predicting a class's defectiveness or method, not using EAMs. One possible reason for the low EAMs usage in non-JIT studies is the absence of a tool for EAMs computation. A further possible reason for the low EAMs adoption could be the lack of awareness about the importance of EAMs to evaluate the realistic benefits of using prediction models.  Fig. 3 Proportion of times the same PofB metric, with and without normalization, identifies the same classifier as best Another important result of RQ1 is that some EAMs are misnamed and that only one study used more than one EAM. Again, one possible reason for this result is the absence of a tool for EAMs computation.
The main implication of RQ1 is that a tool to automate EAMs computation would have supported a broader and more correct use of EAMs.

RQ2: Does the Normalization Improve PofBs?
The main result of RQ2 is that the proposed normalization increases statistically and of orders of magnitude the PofBs. While the improvement is statistically significant on all 10 PofBs, we can see that the normalization increased the different PofBs differently. Specifically, the relative gain provided by the normalization resulted perfectly inversely correlated with the percent of analyzed LOC related in the specific PofB; the highest gain was observed in PofB10. This result can be explained by the fact that the ranking quality looses benefit while the number of analyzed entities are high. In our case, it is obvious that when considering PofB90 many rankings can result equally beneficial to the user as long as the 10% of the not analyzed LOC are shared across such rankings. Thus, a better ranking is more visible in PofB10 than in PofB90. We also note that the relative gain was higher in some datasets, i..e., xerces, than others, i.e., ar. The most probable reason is that the range of the relative gain is large when the accuracy without the normalization is low. Specifically, the PofBs in ar are much higher than in xerces. In other words, the normalization in ar has less chance of improving the PofBs since it is already high.
The main implication of RQ2 is that we need to use the normalized PofBs (aka NPofBs) rather than PofBs. The NPofBs are more realistic than PofBs as they relate to better use of classifiers. Moreover, studies reporting PofBs, rather than its normalization, underestimate the benefits of using classifiers for ranking defective classes and hence might have hindered the practical adoption of defect classifiers.

RQ3: Does the Ranking of Classifiers Change by Normalizing PofBs?
The main result of RQ3 is that the normalization changes the ranking of classifiers. Specifically, the correlation is only fair between the same PofB before and after the normalization in nine out of 10 PofBs. Moreover, in more than half of cases a classifier resulting as best with a PofB is not best with its normalization. The main implication of RQ2 is that past studies using the not normalized version of PofB likely highlighted a classifier as best despite another classifier brings the highest benefit in ranking defective classes to the user. Hence, RQ3 results call for replication of past studies using the not normalized version of PofBs.

RQ4: Does the Ranking of Classifiers Change Across Normalized PofBs?
The main result of RQ4 is that no couple of NPofBs is perfectly correlated. Moreover, in 88% of the cases, the best classifier varies across the ten considered NPofBs. Thus, the main implication of RQ4 is that a single EAM does not exhaustively capture classifiers' ability to rank defective candidate classes. Thus, researchers shall use multiple EAMs to support a comprehensive understanding of classifiers accuracy. Past studies that validated classifiers via a single EAM shall be replicated to increase their results generalizability.

The ACUME Tool
In this paper we provide a new tool called ACUME (ACcUracy MEtrics) which can compute PofB, the new proposed NPofB, Popt, IFA, and the performance metrics reported in Section 2.
In order to run, ACUME requires to: -Download project files from GitHub; -Place the csv files of the whole dataset in the data folder, and if needed in the test folder place your test files; -Update configs.py file to your needs by following the instructions of ReadMe.md file.
In order to run the code, you need python installed; more details are presented in the Readme file. ACUME has been developed with clean code principles, kiss principle, and functional/OOP4 programming. There is linear complexity and minimal repetition of calculation to achieve a fast script with minimal memory usage by sacrificing readability for efficiency. Within approximately 1000 lines of code, there are two data classes: class DataEntity (no function) and ProcessedDataEntity (10 functions/methods), six stand-alone models-not class-dependent as well as many helper functions in each class. Figure 6 reports the steps of the process on which ACUME works: 1.
Step 2: The research team provides the datatsets to a ML engine, such as Weka. The ML engine applies one or more defect prediction models that vary in classifiers, feature selection, balancing, etc. 3.
Step 3: The ML engine outputs the predicted file. The predicted file is designed to be as simple as possible; for each predicted entity (e.g., a class), the predicted file reports the ID, the size, the predicted probability to be defective, and the actual defectiveness. The different predicted files are identified simply with a name, e.g., KeymindA RandomForest withFeatureSelection.csv, ant RandomForest withFeatureSelection.csv, etc. 4.
Step 4: The research team provides the predicted files to the ACUME tool. 5.
Step 5: The ACUME tool outputs a single CSV file reporting, for each row, the performances of a predicted file in terms of accuracy metrics and EAMs.
We provide online material about ACUME. 1 ACUME has been used and validated by several researchers and students at the University of Rome Tor Vergata. We tested ACUME using a set of unit tests hence comparing the accuracy metrics computed by ACUME with expected values. To compute the expected values, we used a mixed approach according to metrics under test. Specifically, for metrics available in WEKA, such as AUC, we computed the expected values via WEKA on the project breast-cancer, as natively provided in WEKA. For metrics not available in WEKA, such as EAMs, we computed the expected values via Excels formulas on the ten projects used to address the research questions in this paper. The validation was led by the first author and double-checked by the last author. During the validation process we fixed a few bugs related to the measurement of the AUC metric. The validation folder in the replication package reports the validation artifacts.
Finally, since its code is open, we welcome bug reports and feature requests.

Threats to Validity
In this section, we report the threats to validity of our study. The section is organized by threat type, i.e., Conclusion, Internal, Construct, and External.

Conclusion
Conclusion validity concerns issues that affect the ability to draw accurate conclusions regarding the observed relationships between the independent and dependent variables (Wohlin et al. 2012). We tested all hypotheses with a non-parametric test (e.g., Wilcoxon Signed Rank) (Wilcoxon 1945) which is prone to type-2 error, i.e., not rejecting a false hypothesis. We have rejected the hypotheses in all cases; thus, the likelihood of a type-2 error is null. Moreover, the alternative would have been using parametric tests (e.g., ANOVA) that are prone to type-1 error, i.e., rejecting a true hypothesis, which is less desirable than type-2 error in our context.

Internal
Internal validity concerns the influences that can affect the independent variables concerning causality (Wohlin et al. 2012). A threat to internal validity is the lack of ground truth for class defectiveness, which could have been underestimated in our measurements. To avoid this threat, we used a set of projects already successfully used in our recent study (Falessi et al. 2020). Such datasets have been derived from a set of publicly available datasets which have been originated far before many issues were known, including mislabeling (Bird et al. 2009;Herzig et al. 2013;Kim et al. 2011;Rahman et al. 2013;Tantithamthavorn et al. 2015), snoring (Falessi et al. 2022), and wrong origin (Rodríguez-Pérez et al. 2018b). Thus, despite being publicly available and largely used, our datasets might be inaccurate.

Construct
Construct validity is concerned with the degree to which our measurements indeed reflect what we claim to measure (Wohlin et al. 2012).
In order to make our empirical investigation reliable, we used the walk-forward technique as suggested in our recent study (Falessi et al. 2020). It could be that our results are impacted by our specific design choices, including classifiers, features, and accuracy metrics. In order to face this threat, we based our choice on past studies.
Despite many studies suggest the tuning of hyperparameters (Fu et al. 2016;Tantithamthavorn et al. 2019), we used default hyperparameters due to resource constraints and to the static time-ordering design of our evaluation. Moreover, tuning might be relevant in studies aiming at improving the accuracy of classifiers whereas in this study tuning might be irrelevant as we aim at measuring the accuracy of classifiers, regardless of their tuning status. Finally, as our paper suggests, we plan to validate hyperparameters tuning via multiple and normalized PofBs.
Finally, the labeling of entities as defective or not has been subject to significant effort, and we still do not know how to perfectly label entities (Vandehei et al. 2021). To avoid this type of noise in the data, we used data coming from the literature and largely used in the past (Falessi et al. 2020).

External
External validity is concerned with the extent to which the research elements (subjects, artifacts, etc.) are representative of actual elements (Wohlin et al. 2012).
This study used a large set of datasets and hence could be deemed of high generalization compared to similar studies. Moreover, in this study, we used both open-source and industry-type of projects.
Finally, to promote reproducible research, all datasets, results and scripts for this paper are available online 1 .

Conclusion
Despite the importance of EAMs, there is no study investigating EAMs usage trends and validity. Therefore, in this paper, we analyze trends in EAMs usage in the software engineering literature. Our systematic mapping study found 152 primary studies (referenced in the appendix, Section A.1) in major software 630 engineering journals, and it shows that most studies using EAMs use only a single EAM 631 (e.g., Popt or PofB20) and that some studies mismatched EAMs names.
To improve the internal validity of results provided by PofBs, we proposed normalization of PofBs. The normalization is based on ranking the defective candidates by the probability of the candidate to be defective divided by its size. We validated the normalization of PofBs by analyzing 10 PofBs, 10 classifiers, two industrial projects and 12 open-source projects. Our results show that the normalization increases statistically and of orders of magnitude the PofBs. Thus, past studies reporting PofBs underestimate the benefits of using classifiers for ranking defective classes and might have hindered the practical adoption of prediction models in the industry. The proposed normalization increases the realism of PofBs values as it relates to better use of classifiers and promotes the practical adoption of prediction models in the industry.
We showed that when considering the same dataset, in most of the cases, the best classifier for a PofB changes when considering the normalized version of that PofB. Thus, past studies that used the non-normalized version of PofBs likely identified the wrong best classifier, i.e., past studies likely identified as best a classifier not providing the highest benefit to the user in ranking defective classes.
In this paper, we also showed that multiple PofBs are needed to support a comprehensive understanding of classifiers accuracy. Thus, we provide a tool to compute EAMs automatically; this aims at supporting researchers in: 1) avoiding extra effort in EAMs computation as there is no available tool to compute EAMs, 2) increasing results reproducibility as the way to compute EAMs is shared across researchers, 3) increasing results validity by avoiding the observed EAMs misnaming and, 4) increase results generalizability by avoiding single EAMs usage.
Researchers and practitioners involved in validating defect prediction models shall always consider using several EAMs. Researchers shall try to propose and evaluate EAMs, or new normalization, that more realistically measure the benefits provided by classifiers when ranking candidate defective entities.
In the future, we plan to replicate the validation of past defect prediction studies that considered a single EAM by including multiple EAMs. Hence, we want to check if the observed best classifier varies when considering multiple EAMs. We also plan to validate past studies by using the proposed normalization of PofBs rather than the original PofBs. Thus, we want to check if the observed best classifier varies when considering the proposed normalization. Finally, we plan to investigate the opinions of developers on EAMs and specifically on the validated new NPofB metric.   To what extent could we detect field defects? an extended empirical study of false negatives in static bugfinding tools. Autom. Softw. Eng.,22(4), 561-602. doi:10.1007/s10515-014-0169-8