On the assessment of software defect prediction models via ROC curves

Software defect prediction models are classifiers often built by setting a threshold t on a defect proneness model, i.e., a scoring function. For instance, they classify a software module non-faulty if its defect proneness is below t and positive otherwise. Different values of t may lead to different defect prediction models, possibly with very different performance levels. Receiver Operating Characteristic (ROC) curves provide an overall assessment of a defect proneness model, by taking into account all possible values of t and thus all defect prediction models that can be built based on it. However, using a defect proneness model with a value of t is sensible only if the resulting defect prediction model has a performance that is at least as good as some minimal performance level that depends on practitioners’ and researchers’ goals and needs. We introduce a new approach and a new performance metric (the Ratio of Relevant Areas) for assessing a defect proneness model by taking into account only the parts of a ROC curve corresponding to values of t for which defect proneness models have higher performance than some reference value. We provide the practical motivations and theoretical underpinnings for our approach, by: 1) showing how it addresses the shortcomings of existing performance metrics like the Area Under the Curve and Gini’s coefficient; 2) deriving reference values based on random defect prediction policies, in addition to deterministic ones; 3) showing how the approach works with several performance metrics (e.g., Precision and Recall) and their combinations; 4) studying misclassification costs and providing a general upper bound for the cost related to the use of any defect proneness model; 5) showing the relationships between misclassification costs and performance metrics. We also carried out a comprehensive empirical study on real-life data from the SEACRAFT repository, to show the differences between our metric and the existing ones and how more reliable and less misleading our metric can be.


Introduction
Accurate estimation of which modules are faulty in a software system can be very useful to software practitioners and researchers. Practitioners can efficiently allocate scarce resources if they can predict which modules may need to undergo more extensive Verification and Validation than others. Researchers need to use quantitative, accurate module defect prediction techniques so they can assess and subsequently improve software development methods. In this paper, by the term "module," we denote any piece of software (e.g., routine, method, class, package, subsystem, system).
Several techniques have been proposed and applied in the literature for estimating whether a module is faulty (Beecham et al. 2010a;Hall et al. 2012;Malhotra 2015;Radjenović et al. 2013). We focus on those techniques that define defect prediction models (i.e., binary classifiers (Fawcett 2006)) by setting a threshold t on a defect proneness model (Huang et al. 2019), i.e., a scoring classifier that uses a set of independent variables. For instance, if the defect proneness model computes the probability that a module is faulty, a defect prediction model estimates a module faulty if its probability of being faulty is above or equal to t. The issue of defining the value of t has been addressed by several approaches in the literature (for instance, Alves et al. (2010), Erni and Lewerentz (1996), Morasca and Lavazza (2017), Schneidewind (2001), Shatnawi (2010), and Tosun and Bener (2009)).
The selection of t may greatly influence the estimates and the performance of the resulting defect prediction model. Thus, to evaluate a defect proneness model, one should evaluate the performance of the entire set of defect prediction models obtained with all possible values of t. Receiver Operating Characteristic (ROC) curves (see Section 4) have long been used to this end.
However, it is unlikely that all possible threshold values are used in practice. Suppose you have a defect proneness model for critical applications. It is unlikely that any sensible stakeholder selects a value of t corresponding to a high risk (e.g., to a 0.8 probability) of releasing a faulty module. Also, practitioners may not be able to confidently choose a "sharp" t value, corresponding to an exact probability value like 0.1. Instead, they may have enough information to know that the value of t should correspond to a risk level around 0.1, e.g., between 0.05 and 0.15. So, the evaluation should be restricted only to those defect prediction models that may be really used, depending on the goals and needs of individual practitioners.
In addition, some values of t are not useful because of two general reasons that do not depend on any specific practitioners' or researchers' goals and needs, hence they hold for every evaluation of defect prediction models.
First, suppose that a defect prediction model based on a set of independent variables and built with a given t does not perform better than a basic, reference defect prediction model that does not use any information from independent variables. Then, the defect prediction model should not be used, because one would be better off by using the simpler reference model. In other words, t is not an adequate threshold value for the defect proneness model. Second, the prediction models obtained as t varies have different performance, but one should use in practice only those that perform well enough, based on some definition of performance and its minimum acceptable level.
So, in general, it may be ineffective and even misleading to evaluate the defect prediction models built with all possible values of t.
The goal of this paper is to propose an approach to assessing a given defect proneness model. We show how to use ROC curves and reference models to identify the defect prediction models that are worth using because they perform well enough for practical use and outperform reference ones according to (1) standard performance metrics and (2) cost. Thus, we identify the values of t for which it is worthwhile to build and use defect prediction models. Our empirical validation shows the extent of the differences in the assessment of defect prediction models between our method and the traditional one.
Our approach helps practitioners compare defect prediction models and select those useful for their goals and needs. It allows researchers to assess software development techniques based only on those defect prediction models that should be used in practice, and not on irrelevant ones that may bias the results and lead to misleading conclusions.
Here are the main contributions of our proposal.
-We introduce a new performance metric, the Ratio of Relevant Areas (RRA). RRA can take into account only those parts of the ROC curve corresponding to thresholds for which it is worthwhile to build defect prediction models, i.e., the defect prediction models perform well enough according to some specified notion of performance. We also show how RRA can be customized to address the specific needs of different practitioners. -We show that the Area Under the Curve (AUC) and Gini's coefficient (G) (Gini 1912) and other proposals are special cases of RRA, which, however, account for parts of the ROC curve corresponding to thresholds for which it is not worthwhile to build defect prediction models. -We show how cost can be taken into account. We also provide an inequality that should be satisfied by all defect prediction models, regardless of the way they are built and of the specific misclassification costs. -We show that choosing a performance metric (like Precision, Recall, etc.) for the assessment of defect prediction models is not simply a theoretical decision, but it equates to choosing a specific cost model.
We would like to clarify upfront that, in this paper, we are not interested in building the best performing models possible. Metrics like AUC, G, and RRA are used to assess existing, given models. We build models simply because we need them to demonstrate how our proposal works. To this end, we use 67 datasets from the SEACRAFT repository (https:// zenodo.org/communities/seacraft). These datasets contain data about a number of measures and the faultiness of the modules belonging to real-life projects. At any rate, in the empirical validation illustrated in Section 10, we build defect proneness models that use all of the available measures so as to maximize the use of the available information about the modules' characteristics and possibly model performance as well.
The remainder of this paper is organized as follows. Section 2 recalls basic concepts of Software Defect Prediction along with the performance metrics that we use. Section 3 introduces the reference policies against which defect prediction models are evaluated. Sections 4 and 5 summarize the fundamental concepts underlying ROC curves and a few relevant issues. We show how to delimit the values of t that should be in general taken into account and we introduce RRA in Section 6. We show how RRA can be used based on several performance metrics in Section 7 and based on cost in Section 8. Section 9 compares RRA to AUC and G. We empirically demonstrate our approach on datasets from real-life applications in Section 10 and also highlight the insights that RRA can provide and traditional metrics can not. Section 11 illustrates the usefulness of our approach in Software Engineering practice. Threats to the validity of the empirical study are discussed in Section 12. Section 13 discusses related work. The conclusions and an outline for future research are in Section 14. Appendices A-F contain further details on some mathematical aspects of the paper.

Software Defect Prediction
The investigation of software defect prediction is carried out by first learning a defect prediction model (i.e., a binary classifier (Fawcett 2006)) on a set of data called the training set, and then evaluating its performance on a set of previously unseen data, called the test set. By borrowing from other disciplines, we use the labels "positive" for "faulty module" and "negative" for "non-faulty module". We denote by z = z 1 , z 2 , . . . the set of independent variables (i.e., features) used by a defect prediction model. Also, m will denote a module and we write 'm' for short, instead of writing "a module" or "a module m". We use defect prediction models fn(z,t) built by setting a threshold t on a defect proneness model fp(z), so that fn(z,t) = positive ⇔ t ≤ fp(z) and fn(z,t) = negative ⇔ fp(z)< t. For instance, we can use a Binary Logistic Regression (BLR) (Hardin and M 2002;Hosmer et al. 2013) model, which gives the probability that m is positive, to build a defect prediction model by setting a probability threshold t. Different values of t may lead to different classifiers, with different performance.
The performance of a defect prediction model can be assessed based on a confusion matrix, whose schema is shown in Table 1. Table 2 shows the performance metrics we use in this paper. They include some of the most used ones and provide a comprehensive idea about the performance of a defect prediction model. The first three columns of Table 2 provide the name, the definition formula, and a concise explanation of the purpose of a metric. The other two columns are explained in Section 3.1.
Specifically, Precision, Recall, and FM (i.e., the F-measure or F-score (van Rijsbergen 1979)) assess performance with respect to the positive class, while Negative Predictive Value (NPV), Specificity, and a new metric that we call "Negative-F-measure" (NM), which mirrors FM, assess performance with respect to the negative class. Youden's J (Youden 1950) (also known as "Informedness"), Markedness (Powers 2012), and φ (also known as Matthews Correlation Coefficient (Matthews 1975)), which is the geometric mean of J and Markedness, are overall performance metrics.
Other metrics can be used as well. At any rate, using different metrics does not affect the way our approach works.

Defining Reference Performance Values
The performance of the application of a defect prediction model on a dataset can be assessed by comparing the values obtained for one or more metrics against specified reference values, which set minimal performance standards. Classifiers not meeting these minimal standards should be discarded. We use two methods to set reference values of performance metrics for defect prediction models: (1) methods based on random policies (see Section 3.1) and (2) deterministic ones (see Section 3.2). In both cases, our goal is to define reference values whose computation does not use any information about the individual modules.

Methods Based on Random Software Defect Prediction Policies
A random software defect prediction policy associates module m with a probability p(m) that m is estimated positive. Thus, a random policy does not define a single defect prediction model, which deterministically labels each m. Rather, we have a set of defect prediction models, each of which is the result of "tossing a set of coins," each with probability p(m) for each m.
The performance of a random policy can be evaluated based on the values of TP, TN, FP, and FN that can be expected, i.e., via an expected confusion matrix in which each cell contains the expected value of the metric in the cell. These expected confusion matrices have already been introduced in Morasca (2014) for estimating the number of faults in a set of modules. The same metrics of Table 2 can be defined for random policies, by using the cells of the expected confusion matrix, e.g., Precision for a random policy is computed as are the expected values of T P and EP , respectively..
Our goal is to define random policies that use no information about m. Lacking any information about the specific characteristics of m, all modules must be treated alike. Thus, each m must be given the same probability p to be estimated positive, i.e., a uniform random policy must be used.
We use uniform random policies as references (see Sections 6 and 7) against which to evaluate the performance of defect prediction models. A classifier that performs worse than what can be expected of a uniform random policy should not be used. Our proposal for reference policies is along the lines of Langdon et al. (2016), Lavazza and Morasca (2017), and Shepperd and MacDonell (2012), where a random approach is used for defining reference effort estimation models. Also, completely randomized classifiers were used as reference models for defect prediction in the empirical studies of (Herbold et al. 2018) and in the application examples of Khoshgoftaar et al. (2001), though the performance of random classifiers was not taken into account in the definition of any performance metric. In what follows, we use "uni" to denote the uniform random policy for some specified value of p.
The expected confusion matrix for random policies depends on p. We use the value of p that sets the hardest challenge for models using variables, i.e., its Maximum Likelihood estimate p = AP n . This special case of uni, which we call "pop" (as in "proportion of positives"), is one of the reference policies used in the empirical study of Section 10. Table 3 shows the values of the cells of an expected confusion matrix for uni and pop policies.
In what follows, we write k = AN AP (following Flach 2003) to summarize properties of the underlying dataset. For instance, instead of AP n , we write 1 1+k . Note that φ uni = φ pop = 0 (and J uni = J pop = Markedness uni = Markedness pop = 0), so uni and pop are never associated with the defectiveness of modules in any dataset. Thus, uni, and especially pop, can be used as reference policies in the evaluation of defect prediction models.

Deterministic Methods
We set a deterministic reference value for φ, which is the best known of the three overall performance metrics of Table 2, i.e., J, Markedness, and φ. The reference value of φ for the uni and pop policies is zero, which sets too low a standard against which to compare the φ values of defect prediction models. Any model that provides a positive association between a defect prediction model and actual labels of modules, no matter how small, would be considered better than the standard. In our empirical study, we select φ = 0.4 as a reference value for φ for medium/strong association, as it is halfway between the medium (φ = 0.3) and strong (φ = 0.5) values indicated in Cohen (1988). We do not select deterministic reference values for the first six metrics in Table 2 because we can already set reference values based on the pop policy.

ROC: Basic Definitions and Properties
A Receiver Operating Characteristic (ROC) curve (Fawcett 2006) is a graphical plot that illustrates the diagnostic ability of a binary classifier fn(z,t) with a scoring function fp(z) as its discrimination threshold t varies.
A ROC curve-which we denote as a function ROC(x) -plots the values of y = Recall = T P AP against the values of x = Fall-out = F P AN = 1-Specificity computed on a test set for all the defect prediction models fn(z,t) obtained by using all possible threshold values t. Examples of ROC curves are in Fig. 1.
The [0, 1]×[0, 1] square to which a ROC curve belongs is called the ROC space (Fawcett 2006). Given a dataset, each point (x, y) of the ROC space corresponds to a defect prediction model's confusion matrix, since the values of x and y allow the direct computation of TP and FP and the indirect computation of TN and FN, since AP and AN are known.
The two variables x and y are related to t in a (non-strictly) monotonically decreasing way. Hence, a ROC curve ROC(x) is a non-strictly monotonically increasing function of x.
We now introduce the definition of Upper-left Rectangle of a point (x, y) of the ROC space, which will be used in the remainder of the paper.

Definition 1 Upper-left Rectangle (ULR) of a Point.
The Upper-left Rectangle ULR(x,y) of a Point (x,y) is the closed rectangle composed of those points (x',y') of the ROC space such that x ≤ x ∧ y ≥ y.  An example of ULR is represented by the highlighted rectangle in Fig. 2, which shows ULR( 1 1+k , 1 1+k ). All of the points of ULR(x,y) are not worse than (x,y) itself for any sensible performance metric. They correspond to defect prediction models with no more false negatives nor more false positives than the defect prediction model corresponding to (x,y). Point (0, 1) has no other point in its ULR(0,1), so it corresponds to the best classifier, which provides perfect estimation.
ROC curves have long been used in Software Defect Prediction (Arisholm et al. 2007;Beecham et al. 2010b;Catal 2012;Catal and Diri 2009;Singh et al. 2010), typically to have an overall evaluation of the performance of the defect prediction models learned based on z with all possible values of t.
The evaluation of fn(z,t) for all values of t, i.e., the overall evaluation of fp(z), is typically carried out by computing the Area Under the Curve.

Definition 2 Area Under the Curve (AUC).
The Area Under the Curve is the area below ROC(x) in the ROC space. The longer ROC(x) lingers close to the left and top sides of the ROC space, the larger AUC. Since the total area of the ROC space is 1, the closer AUC is to 1, the better. Hosmer et al. (2013) propose the intervals in Table 4 as interpretation guidelines for AUC as a measure of how well fn(z,t) discriminates between positives and negatives for all values of t.
When comparing defect proneness models, fp 1 (z 1 ) (associated with AU C 1 ) is preferred to fp 2 (z 2 ) (associated with AU C 2 ) if and only if AU C 1 > AUC 2 (Hanley and McNeil 1982).
The Gini coefficient G = 2 AUC-1 is a related metric also used for the same purposes (Gini 1912). G takes values in the [0, 1] range and was defined in such a way as to be applied only to ROC curves that are never below the diagonal y = x. As there is a one-toone functional correspondence between them, AUC and G provide the same information. Column "G range" in Table 4 shows how Hosmer at al.'s guidelines for AUC (Hosmer et al. 2013) can be rephrased in terms of G.
Other metrics have been defined in addition to AUC and G. We concisely review some of them in Section 13.1.

Evaluation Issues
ROC curves have been widely studied and used in several fields, and a few issues have been pointed out about their definition and evaluation. We concisely review them in Section 13. Here, we focus on two issues about the use of AUC as a sensible way for providing (1) an evaluation of a defect proneness model and (2) a comparison between two defect proneness models.

Evaluation of a Defect Proneness Model: The Diagonal
The diagonal of the ROC space represents the expected performance of random policies. Table 3 shows that, given a value of p, i.e., the diagonal y = x is the expected ROC curve under a random policy, for each possible value of p. Since the points in ULR(x, y) are not worse than (x, y), the upper-left triangle of the ROC space delimited by the diagonal is also the set of points corresponding to defect prediction models whose performance is not worse than the expected performance of random policies. In practice, the upper-left triangle is the truly interesting part of the ROC space when building useful defect prediction models. It is well-known (Fawcett 2006) that, if a classifier corresponds to a point in the lower-right triangle, a better classifier can be obtained simply by inverting its estimates.
However, AUC is computed by also taking into account the lower-right triangle of the ROC space. Notice that AUC is the area between ROC(x) and the reference ROC curve y = 0, which corresponds to a defect prediction model that estimates TP = 0 for all values of t. In other words, AUC quantifies how different a ROC curve is from this extremely badly performing classifier-even worse than what is expected of any random policy. In practice, instead, AUC is to be compared to random policies, as Table 4 shows.
Using random policies, characterized by y = x, as the reference instead of y = 0 appears more adequate. Instead of the area under the curve in the entire ROC space, one can use the area under the curve in the upper-left triangle, and normalize it by the area of the triangle, i.e., 1/2. This is actually the value of the Gini coefficient G, when ROC(x) is entirely above the diagonal. If that is not the case, one can define a modified ROC curve ROC'(x) that coincides with ROC(x) when ROC(x) is above the diagonal, and coincides with the diagonal otherwise. Practically, this corresponds to using the defect prediction models for all of those values of t in which one obtains better performance than random policies, and random policies otherwise, which are an inexpensive backup estimation technique one can always fall back on.
However, even this modified version of G may not be satisfactory for practitioners' and researchers' goals, which may require that one or more performance metrics of a defect prediction model be higher than some specified minimum reference values, and not simply better than a random classifier. The approach proposed in this paper (see Section 6) extends the idea of comparing the performance of models with respect to random policies by taking into account specific performance metrics, and compares their values when they are computed with the defect prediction models obtained based on a defect proneness model and all possible thresholds.

Comparison of Defect Proneness Models with AUC : Deceiving Cases
Suppose we have two defect proneness models fp 1 (z) and fp 2 (z). Figure 1a shows that the ROC curve of fp 2 (z) is always above the ROC curve of fp 1 (z), i.e., the Recall and Fall-out of fp 2 (z) are never worse than those of fp 1 (z). Accordingly, the AUC of fp 2 (z) is greater than the AUC of fp 1 (z). fp 2 (z) is at least as good as fp 1 (z) for all choices of t, but we do not need AUC to decide which defect proneness model is better. Figure 1b instead shows two intersecting ROC curves. It is not straightforward to decide which defect proneness model is better by simply looking at the ROC curves, since neither curve dominates the other. Using AUC would have us conclude that fp 1 (z) is not worse than fp 3 (z), since the AUC of fp 1 (z)is greater than the AUC of fp 3 (z). However, Fig. 1b also shows that the AUC of fp 1 (z) is greater than that of fp 3 (z) mainly because the ROC curve of fp 1 (z) is above the ROC curve of fp 3 (z) when FP/AN> 0.6, i.e., when Fall-out is quite high and the defect prediction models obtained with both defect proneness models provide quite bad estimates.
Thus, just because the AUC of a defect proneness model is greater than that of another does not automatically mean that the former defect proneness model is preferable. Instead, we should restrict the comparison to the zone (i.e., the threshold range) where the defect prediction models obtained behave "acceptably".
In the following sections, we propose methods to "purge" AUC from the noise originated by defect prediction models that do not perform well enough. The resulting indications are expected to be more reliable, hence more useful in practice and also more correct from a theoretical point of view.

Evaluation Based on Relevant Areas
Suppose we select a performance metric PFM and a random or deterministic method MTD for selecting a reference value to evaluate the acceptability of a defect prediction model. Let us denote by PFM MT D the reference performance value of MTD when evaluated via PFM. For instance, if we take FM as PFM and pop as MTD, we have FM pop = 1 1+k , as shown in Table 2.
Among the defect prediction models that can be generated with fn(z,t), the ones that should be considered performing sufficiently well are those that provide a better value of PFM than the reference value, e.g., better than what can be expected of a random policy. These are the practically useful defect prediction models, hence the ones that should be taken into account when evaluating the overall performance of fp(z) . For instance, if we decide to assess the performance of fn(z,t) based on PFM 1 = Recall and MTD 1 = pop, we should only take into account those values of t in which fn(z,t) has Recall > Recall pop = 1 1+k .
Note that there are always values of t that are so small or so large to make estimates' performance according to some PFM similar or even worse than the performance with a reference policy. Hence, it does not make sense to evaluate fn(z,t) for all the values of t. Instead, we take into account the points (x, y) of a ROC curve that satisfy inequality y > Recall pop = 1 1+k (see Table 2), i.e., the points above the horizontal straight line y = 1 1+k . Recall captures one aspect of performance, mainly based on true positives, but other aspects can be of interest. Suppose we decide to assess the performance of fn(z,t) based on PFM 2 = Fall-out, which captures performance by taking into account the false positives, and MTD 2 = pop. We should only take into account the values of t in which fn(z,t) has Fall-out < Fall-out pop = 1−Specificity pop = 1 1+k . We are thus interested only in the points satisfying inequality x < 1 1+k , i.e., left of the vertical straight line x = 1 1+k . If we are interested in the points of the ROC curve that are better than pop for both Recall and Fall-out, then both inequalities must be satisfied, and the evaluation must consider only the points of the ROC curve in ULR( 1 1+k , 1 1+k ), i.e., the highlighted rectangle in Fig. 2. It is up to the practitioners and researchers to decide which metrics are of interest for their goals. For instance, they can use FM as P F M 1 and NM as P F M 2 and pop as both MT D 1 and MT D 2 , to consider defect prediction models that perform better than pop for both FM and NM. The points of the ROC curve to take into account are represented in Fig. 3a, above and to the left of the two oblique straight lines with equations y = k 2k+1 x + 1 2k+1 for FM and y = (k + 2)x − 1 for NM, as we show in Table 5.
More generally, other metrics and reference policies may be defined and used well beyond the ones illustrated in this paper. Different choices of metrics and reference policies may lead to delimiting any subset of the ROC space. Clearly, if one is interested in using several metrics and several corresponding reference policies, the subset of the ROC space is the intersection of the single subsets, each of which is built by means of a metric and a reference policy.
However, not all ROC space subsets are useful or sensible. We introduce the notion of "Region of Interest," to define which ones should be used.

Definition 3 Region of Interest (RoI).
A subset of the ROC space is said to be a Region of Interest (RoI) if and only if it contains the upper-left rectangles of all of its points, i.e., The border of a RoI is the part of the boundary of the RoI in the interior of the ROC space, i.e., it is the boundary of the RoI without its parts that also belong to the ROC space perimeter. The border of the RoI is the part of its boundary that really provides information on how the RoI is delimited, since the perimeter of the ROC space can be taken for granted as a delimitation.
The union of the light blue and grey regions in Fig. 3a is an example of a RoI, in which FM ≥ FM pop ∧ NM ≥ NM pop (see Section 7.1). An example of a subset of the ROC space that is not a RoI is in Fig. 3b.
RoIs have a few properties, which we prove in Appendix A.
-The intersection of any number of RoIs is a RoI.
-The intersection of any number of RoIs is nonempty.
-The border of a RoI is a (non necessarily monotonically) increasing function. Thus, graphically, a RoI is above and to the left of its border.
With a reference value derived from a random policy, the points on the border of a RoI correspond to unacceptable defect prediction models, as their performance is as good as what can be expected of a random policy. If, instead, the reference value is selected deterministically, all points on the boundary correspond to acceptable classifiers, e.g., one takes φ = 0.4 as minimum φ value for an acceptable defect prediction model. In what follows, we implicitly assume that the points on the border of a RoI are included or not in the RoI depending on whether a reference value has been selected via a random policy or deterministically.

The Ratio of Relevant Areas
We propose to assess a fault-proneness model fp(z) via the Ratio of the Relevant Areas (RRA), which takes into account only the RoI selected by a practitioner or researcher.

Definition 4 Ratio of the Relevant Areas (RRA).
The Ratio of the Relevant Areas of a ROC curve ROC(x) in a RoI is the ratio of the area of the RoI that is below ROC(x) to the total area of the RoI.
In Fig. 3a, the RoI is the union of the light blue and grey regions, in which the light blue region is the part of the RoI below ROC(x). RRA is the ratio of the area of the light blue region to the area of the RoI.
AUC and G are special cases of RRA, obtained, respectively, when the RoI is the whole ROC space and the upper-left triangle. From a conceptual point of view, it is sounder to consider the area under the portion of the ROC curve in the RoI than to consider the areas under the entire ROC curve taken into account by AUC and G: RoI represents the part of the ROC space in which defect prediction models perform sufficiently well to be used.
Take, for instance, the case in which we use reference random policies of interest along with a set of performance metrics of interest to build a RoI. By considering the parts of ROC(x) outside the RoI, one would also take into account values of t that make f n(z, t) worse than a random estimation method. When we know that a given defect prediction model is worse than a random policy for a set of performance metrics, it is hardly interesting to know precisely how well it performs. However, this is what AUC and G do.
Note that Definition 4 is quite general, as it allows the use of different reference policies for different performance metrics even when they are used together. For instance, one may be interested in the points of a ROC curve that are better than FM pop and, at the same time, better than the NM value obtained with uni with p = 0.7. In what follows, however, we assume that the same reference policy is used for all of the performance metrics selected.

RoIs for Specific Performance Metrics and Reference Values
The requirement that a defect prediction model satisfy a minimum acceptable level c for a performance metric PFM corresponds to a RoI in the ROC space. We here show the equations of the borders of the RoIs corresponding to the metrics in Table 2. Appendix B shows how the equations for these borders were obtained, by explaining how these metrics, defined in terms of the cells of the confusion matrix, can be expressed in terms of x and y. These borders are akin to the "iso-performance lines" or "isometrics" proposed in Flach (2003), Provost and Fawcett (2001), and Vilalta and Oblinger (2000). Table 5 summarizes the formulas about the borders of the RoIs for the performance metrics with respect to the positive and the negative classes of Table 2 with the uni and pop reference policies, for completeness. At any rate, we only use pop in the examples and in the empirical study of Section 10.

RoIs for Performance Metrics with Respect to the Positive and Negative Classes
-Column "Formula" provides the definition of the performance metric in each row in terms of x and y. For instance, Precision = y kx+y . -Column "Border" shows the equation of the general straight line that represents the border of the RoI obtained when the performance metric corresponding to the row is given a constant value c. For instance, line y = c 1−c kx includes the points where Precision=c. For practical usage, when we select a specific metric PFM and a reference method MTD, we replace the generic parameter c by the specific PFM MT D chosen.
-Column "uni" shows the equation of the border when c is replaced by PFM uni with probability p, where PFM is the metric in the corresponding row. This is the border of the RoI where defect prediction models have greater value of PFM than expected of the uni policy with probability p. -Likewise, column "pop" shows the equation of the border when c is replaced by P F M pop , where PFM is the metric in the corresponding row.
It can be shown that each equation in the "uni" column in Table 5 describes a pencil of straight lines (Cremona 2005) Figure 4 shows the ROC curve already shown in Fig. 2, along with all the lines corresponding to the borders mentioned in Table 5 for pop. The portion of ROC above and to the left of the borders of all performance metrics (ULR( 1 1+k , 1 1+k ), in this case) is quite small, compared to the entire ROC curve. Thus, there is a relatively small range where t provides f n(z, t) defect prediction models that perform better than pop, according to multiple metrics.
The borders in Table 5 follow expected change patterns when k, c, and p change. Higher values of c are associated with stricter constraints, e.g., the slope of the Precision straight Appendix C details how these borders behave for each metric when k, c, and p change.
We use the pop policy in the empirical validation of Section 10. For notational convenience, we denote by RoI(PFM 1 , PFM 2 , . . . ) the RoI defined by constraint  Table 6 shows the formula and the border obtained for each of the three overall performance metrics PFM in Table 2 when one sets a minimum acceptable value c for it, i.e., one requires PFM ≥ c. Unlike Tables 5, and 6 does not contain columns "uni" and "pop," because we showed in Table 2 that J , Markedness, and φ are all equal to 0 under random policies. Therefore, a RoI is defined by means of a deterministically chosen value of c.

RoIs for Overall Metrics
The lines for constant Youden's J are straight lines parallel to the diagonal. As for Markedness, it can be shown that the constant lines are parabolas, with symmetry axis y = −kx + 1+k 2 + k(k−1) 2c(1+k 2 ) . The details are in Appendix D. φ has received the most attention among these three metrics in the past. It can be shown that the border for φ = c is an ellipse that goes through points (0, 0) and (1, 1) for all values  Figure 5a shows the ellipse for the berek dataset (from the SEACRAFT repository) with c = 0.4, a value that represents medium/strong association between a defect prediction model and actual faultiness (Cohen 1988). Note that there are two unconnected parts of the ROC space in which φ ≥ c, delimited by the dashed part of the ellipse and the solid part of the ellipse. Based on Definition 3, only the upper-left part above the dotted arc of the ellipse is a legitimate RoI. Figure 5b shows the borders of the RoIs associated with φ=0.4 (the lowest line), 0.6, and 0.8 (the highest line). By comparing Figs. 5 and 4, it is easy to see that the points of the ROC curve that satisfy the constraints mentioned in Table 5 also satisfy constraint φ ≥ 0.4. However, only a few points of the ROC curve (corresponding to a few selected values of t) satisfy constraint φ ≥ 0.6. No point of the ROC curve satisfies constraint φ ≥ 0.8.

Taking Cost into Account
We have so far considered the evaluation of defect prediction models with respect to the performance of estimates. Though the notion of performance is important, practitioners are usually also interested in other characteristics of estimates, such as the cost of misclassifying a faulty module as not faulty, or vice versa. As we show in Section 8.1, there is a clear relationship between the choice of a performance metric and the cost of misclassification.
We first show how to derive the border of a RoI based on the misclassification cost. Like most of the literature, we assume that each false negative (resp., positive) has the same cost c F N (resp., c F P ), so total cost TC is Hand (2009) TC can be computed in terms of x and y of the ROC space as follows By setting λ = c F N c F N +c FP and dividing T C by n(c F N + c F P ) (which is independent of the defect prediction model used), we can focus on Normalized Cost NC = T C n(c F N +c FP ) (Khoshgoftaar and Allen 1998) NC is related to Unitary Cost UC = T C n = (c F N + c F P )N C, so constraints on UC get immediately translated into constraints on NC and vice versa.
Usually, c F N is much greater than c F P , as false negatives have more serious consequences than false positives, financially and otherwise. Accordingly, λ is usually much closer to 1 than to 1/2 (value 1/2 corresponds to c F N = c F P ).

Borders Based on Random Policies
For any random policy, thanks to basic properties of expected values, we have and for the uni and pop policies we have, based on Table 3 E Thus, one should use only those defect prediction models whose NC is less than the expected normalized cost of a random policy, i.e., NC < AP ·AN n 2 . Since E[NC pop ] is independent of the specific cost per false negative or positive, Formula (7) provides a general result that applies to all defect prediction models, and regardless of the way they have been built, e.g., with or without using techniques based on defect proneness models and thresholds.
The is easier to satisfy when AP ≈ AN , i.e., for balanced datasets, and more difficult when this is not the case. As we show in Section 8.2, lower values of NC call for better performing defect prediction models.
Via mathematical transformations, it can be shown that the borders of the RoIs that satisfy inequality NC < k (1+k) 2 for different values of λ are represented by the following pencil of straight lines, with center in ( 1 1+k , 1 1+k ) It can be shown that the slope of the border decreases as λ varies from 0 to 1. Thus, the border rotates around center point ( 1 1+k , 1 1+k ) in a clockwise fashion from vertical straight line x = 1 1+k to horizontal straight line y = 1 1+k . A special case occurs when λ = k 1+k , since the line is the diagonal y = x.
Recall that all of the straight lines for all of the performance metrics in Table 5 go through center point ( 1 1+k , 1 1+k ) too. Thus, they are special cases of the straight lines described in Formula (8), for specific values of λ. Specifically, we have, in increasing order of the value of λ: for Fall-out, λ = 0; for NM, λ = k 2(1+k) = AN 2n ; for Precision and for NPV, λ = k 1+k = AN n ; for FM, λ = 2k+1 2k+2 = 1 2 + AN 2n ; and for Recall, λ = 1. Thus, the selection of any of these metrics is not simply an abstract choice on how to assess the performance of defect prediction models, but implies the choice of a specific cost model with a specific λ = c F N c F N +c FP , which implies a specific ratio c F N c FP . Based on observations of past projects' faults and fault removal costs, one could estimate a likely value k e for the ratio AN /AP and a likely range [λ l , λ u ] for λ, to evaluate a given fp(z, t) classifier based on the RoI identified by

Cost-reduction RoIs
Practically useful RoIs should represent strict constraints for defect prediction models. For instance, take λ = 0.9, i.e., suppose that a false negative is 9 times as expensive as a false positive. The corresponding line described in Formula (8) for the ROC curve in Fig. 4 is Empirical Software Engineering the lowest one in Fig. 6, which does not appear to set a strict constraint and may be of little practical interest.
To obtain a more useful constraint, a software manager may set a maximum acceptable value for the unitary cost, which translates into a maximum value NC max for NC. NC max can be expressed as a fraction μ of E[NC pop ], i.e., NC max = μ k (1+k) 2 . Constraint NC < NC max defines a RoI with border The border depends on parameters k, λ, and μ. We describe in Appendix F the effect of having different values of k, which happens with different datasets. We here investigate the effect of selecting different values of μ, possibly in combination with different values of λ.
Formula (10) shows that μ only influences the intercept of the border. For given values of k and λ, the smaller μ (i.e., the smaller NC max ), the higher the border line. Figure 6 shows the borders for the berek dataset (which has k = 27/16) when λ = 0.9, for μ = 1, μ = 0.75, and μ = 0.5. Formula (8) is a special case of Formula (10) with μ = 1 and it can be easily shown that, for any given value of μ, Formula (10) describes a pencil of straight lines, one for each value of λ, with center in (μ 1 1+k , 1 − μ k 1+k ). Thus, for any given value of μ, different values of λ have the same effect as we described in Section 8.1.
As μ varies, the center point (μ 1 1+k , 1 − μ k 1+k ) moves on the straight line y = −kx + 1. It moves upwards and to the left as μ decreases, as expected, since the constraint becomes stricter. When μ tends to 0, then the center point tends to (0, 1), which represents perfect classification.
We denote as RRA(λ = c λ , μ = c μ ) the value of RRA obtained for the RoI whose border is identified by using λ = c λ and choosing μ = c μ .
The value of μ is related to the performance of defect prediction models quantified by any metric. For instance, take Precision and suppose that a technique for defect prediction models guarantees a minimum value of Precision=c. The border corresponding to Preci-sion=c is y = c 1−c kx (see Table 5). This straight line intersects line y = −kx + 1 at point . This is the cost reduction proportion that can be obtained with a technique that improves the value of Precision from Precision pop to c.
Conversely, suppose we focus on Precision and we plan to achieve a μ cost reduction. The required improvement in Precision is c = 1 − kμ 1+k . Similar computations can be carried out for all other performance metrics. The results are in Appendix E.

Evaluating RRA
RRA is clearly related to AUC and Gini's G, so one may wonder if RRA is better than AUC or G, and, if so, to what extent.
First, the main difference between RRA, on one side, and AUC and G, on the other side, is that RRA assesses a defect proneness model only based on the points of a ROC curve corresponding to defect prediction models that are worth evaluating, while AUC and G use all of the points of a ROC curve.
We here assess the extent to which our approach restricts the area of the portion of the ROC space that is taken into account as compared to AUC and G, by computing the areas of two types of RoIs that we have already used in the previous sections and that we also use in the empirical study of Section 10. The area of RoI(Recall, Fall-out) is equal to k (k+1) 2 . As a function of k, this area has maximum value 1 4 , attained for k = 1, i.e., AP = AN . Thus, the computation of RRA (Recall, Fall-out) takes into account at most only one-fourth of the portion of the ROC space taken into account by AUC (whose area is 1), and one-half of the portion taken into account by G (whose area is 0.5). The more k tends to 0 or infinity, the more the area of the RoI shrinks. For instance, for k = AN AP = 4, the area of the RoI is 4 25 = 0.16, which corresponds to 16% of the area considered for AUC and 32% of the area considered by G.
The same phenomenon occurs for the area of RoI(FM, NM), which is equal to 3k (k+2)(2k+1) , with maximum value 1 3 when k = 1. For k = 4, the value is 2 9 0.22, i.e., 22% of the area considered for AUC and 44% of the area considered by G.
Thus, these two RoIs take into account a rather reduced proportion of the ROC space if compared to AUC or even G. In practice, by considering irrelevant regions, AUC and G incorporate a large "noise" that is higher for projects with a relatively small (or large) defect density. Section 10 shows some results we obtained for these areas in our empirical study, along with results about the proportion of classifiers of ROC curves that fall into our RoIs.
Second, in two cases, the relationships between the values of RRA and of the performance metrics AUC and G are necessarily strong only for models that are either exceptionally good or exceptionally bad. Suppose that a model is so good as to have AUC = 1, then the entire ROC space is under the ROC curve, and both G and RRA are therefore equal to 1. The converse is also true. If RRA = 1, then the ROC curve is above the entire RoIs chosen. Since any RoI includes the perfect estimation (0, 1) point, then the ROC curve goes through it, and AUC = 1 and G = 1. For continuity, exceptionally good models that achieve near-perfect estimation are very likely to have values of AUC, G, and RRA close to 1. The empirical study in Section 10 provides some evidence on the values of AUC for which this relationship exists between RRA and the existing metrics AUC and G even in approximate form.
When it comes to bad models, the implication is unidirectional, in general. Suppose that a model is so bad as to have AUC = 0.5 and therefore G = 0, then the ROC curve coincides with the diagonal, which implies that RRA = 0 for all kinds of RRA. However, RRA = 0 does not imply AUC = 0 and G = 0. As a first example, take the ROC curve in Fig. 5b. We have RRA(φ = 0.8) = 0, while AUC > 0. As a second example, suppose that the ROC curve goes through point ( 1 1+k , 1 1+k ). Suppose that the value of AUC is greater than 0-which is true except if the ROC curve entirely coincides with the diagonal. We have RRA (Recall, Fall-out)=0. This happens because the ROC curve never enters the region of interest. Thus, for bad models and, as we see in Section 10, for models that are not exceptionally good, RRA can provide a different perspective and evaluation than AUC and G.
Third, RRA is customizable, since it allows practitioners and researchers to define the set of points of a ROC curve they are interested in, i.e., those in a RoI built by selecting specific performance metrics and reference policies, while this is not possible with AUC and G.
Fourth, the assessment of whether RRA is better than AUC or G requires defining what "better" actually means in this case. RRA, AUC, and G all provide an overall evaluation of the performance of the defect prediction models fn(z,t) for all values of t. As such, RRA, AUC, and G are aggregate functions of these models. Thus, to assess them, we would need another aggregate function that provides the "ideal" figure of merit against which we can compare the performance of RRA, AUC, and G. Unfortunately, such "ideal" function does not exist or is not known. Total Cost TC would be a suitable "ideal" figure of merit, but it is unknown. That is why metrics like AUC and G were introduced and used in the first place: they would not have been introduced if TC were known. At any rate, RRA can take into account different cost models via parameter λ and different cost requirements via parameter μ in ways that AUC and G can not.
Therefore, the empirical study we present in Section 10 does not and cannot have the goal of showing whether RRA is "better" than AUC and G, or any other metric, for that matter. Rather, we want to show the differences in the assessment of defect proneness models between RRA and the existing ones, and show that RRA can be more reliable and less misleading.

Empirical Study
We analyzed 67 datasets for a total of 87,185 modules from the SEACRAFT repository (https://zenodo.org/communities/seacraft). These data were collected by Jureczko and Madeyski (2010) from real-life projects of different types, and have been used in several defect prediction studies (e.g., (Bowes et al. 2018;Zhang et al. 2017)). The number of modules in the datasets ranges between 20 and 23,014 with an average of 1,300, a standard deviation of 3,934, and a median of 241.
For each module (a Java class, in this case), all datasets report data on the same 20 independent variables. 1 In addition, the datasets provide the number of defects found in each class, which we used to label modules as actually negative and positive. The datasets are fairly different in terms of AP /n ratio, which ranges from 2% to 98.8%. The histogram in Fig. 8 shows the frequency distribution of the proportion of defective modules in the datasets. Though there is a majority of small values in the distributions of n, AP /n, and LOC-as is for software projects in general-fairly large values are well represented (e.g., half of the projects are larger than 59,000 LOC).
For each dataset, we built a BLR model and a Naive Bayes (NB) model using all available measures as independent variables.
Here are the Research Questions that we address in our empirical study.

RQ1
To evaluate defect proneness models in practice, to what extent are the regions of the ROC space used by RRA more adequate than the regions used by AUC and G?
With RQ1, we investigate if, in real-life projects, it is possible to have substantial differences between RRA and the existing performance metrics AUC and G, as these performance metrics take into account different regions of the ROC space and therefore different defect prediction models. RQ2 How frequently are there substantial differences between RRA and traditional performance metrics AUC and G in using more adequate regions of the ROC space? By answering RQ2, we check whether RRA is truly useful only in corner cases for extreme projects, or for a large share of the population of projects.

RQ1: Extent of Adequacy of ROC Space Regions Used
For each of the 67 datasets, the BLR model we obtained had a higher AUC value than the NB model, with only one exception, for the xalan 2.7 project. In what follows, we present the results for the 67 BLR models and also discuss the results for the NB model for the xalan 2.7 project.
To answer RQ1, we consider the interpretation of AUC proposed by Hosmer et al. (2013) and illustrated in Table 4. Accordingly, we split the BLR models we obtained into three classes: acceptable, excellent, and outstanding. Note that we obtained only one BLR model with AUC < 0.7, specifically, with AUC = 0.69. As this value is very close to the 0.7 lower boundary of the acceptable AUC category, we include it here in the acceptable AUC category of models, instead of analyzing it by itself in a separate category. Then, we computed the values of RRA (Recall, Fall-out) and RRA(φ = 0.4): Table 7 provides a summary of RRA values for each AUC category.
The NB model obtained for the xalan 2.7 has AUC = 0.95, RRA(Recall, Fall-out)= 0 and RRA(φ = 0.4)= 0.01, while the corresponding BLR model has AUC = 0.69 RRA(Recall, Fall-out)= 0.03 and RRA(φ = 0.4)= 0. Note that the values of RRA are very low for both models, even though the two models greatly differ in the values of AUC. Moreover, RRA shows that the apparently outstanding NB model is actually not acceptably accurate. To better understand the relationship between AUC and RRA, we split the excellent AUC range into two sub-ranges, and also split the resulting model sets according to AP /n. Specifically, we split the model set with excellent AUC into those having AUC ∈ (0.8, 0.87] and those having AUC ∈ (0.87, 0.9]; threshold 0.87 was chosen to have enough models in each subset. As for the 25 models with outstanding AUC, 16 were perfect prediction classifiers, with AUC = 1. We put them in a separate category from those with AUC ∈ (0.9, 1).
The results are in Table 8, where "mid" indicates values between 0.2 and 0.8 of AP /n, i.e., in the middle part of the [0, 1] interval of AP /n and "l/h" indicates values of AP /n that are either in the low or high range, i.e., less than 0.2 or greater than 0.8. By "any," we indicate that we did not split the models based on AP /n. Table 8 shows that BLR models with AUC in the (0.8,0.87] range and "l/h" values for AP /n mostly have low values of RRA: e.g., the median RRA(φ = 0.4) is just 0.07 for these models. Models with AUC in the (0.87,0.9] range and mid values of AP /n have instead higher values of RRA. The other models are characterized by variable RRA, hence they should be evaluated individually. As for models with outstanding AUC values, Table 8 obviously confirms that, as noted in Section 9, models with AUC=1 also have RRA=1. The models with AUC∈ (0.9, 1) have generally high RRA (Recall, Fall-out) and RRA(φ=0.4), even though the median of RRA(φ=0.4)= 0.55 is a bit over half the value of RRA(φ=0.4) of the models with AUC= 1. There are two important exceptions, as noted above, i.e., the BLR model for jedit 4.3 with outstanding AUC= 0.9 and extremely poor RRA(φ=0.4)= 0.02 and the NB model for xalan 2.7 with outstanding AUC= 0.95 and extremely poor RRA(φ = 0.4)= 0.01.
To provide additional evidence, Fig. 7 shows the values of AUC and RRA(φ=0.4) for the models obtained from datasets with AP /n ≤ 0.2 or AP /n ≥ 0.8.
Together, Table 8 and Fig. 7 indicate that the strong relationship between the values of AUC and RRA we described in Section 9 only holds when AUC is extremely close to 1, but no longer exists for values of AUC that are considered excellent or even outstanding. Thus, in response to RQ1, we can observe that RRA appears to take into account a more useful region of the ROC space than AUC and G in the evaluation of a defect prediction model. As a consequence, RRA provides more realistic evaluations than traditional performance metrics, which provide unreliable evaluations, under some conditions.

RQ2: Frequency of Using more Adequate Regions
To answer RQ2, we need to check whether low or high values of defect density AP/n are frequent or rare. Figure 8 shows the distribution of defect density values (rounded to the first decimal) of the projects we considered. Quite a large share (31 projects out of 67) have AP /n ≤ 0.2, i.e., a fairly small defect density, while 2 out of 67 have AP /n ≥ 0.8, i.e., a very high defect density. Thus, at least for the considered datasets, for about one half of the models, traditional indicators are bound to provide responses based on very large portions of ROC curves where predictions are worse than random.
Also, notice that the range of "l/h" values of AP /n is 40% of the entire AP /n range (i.e., [0,1]), but accounts for 49% of datasets, i.e., projects are more concentrated in these subintervals of AP /n, for which larger portions of AUC are meaningless.
To obtain additional quantitative insights, we performed some additional analysis. We computed the areas of RoI(Recall, Fall-out) for the analyzed datasets. We found that the mean area of RoI(Recall, Fall-out) is 0.164, while the median is 0.173, and the standard deviation is 0.067. 49% of datasets have RoI(Recall, Fall-out) whose area is ≤ 0.16, i.e., more than 84% of the ROC space used to compute AUC and 68% of the area above the diagonal considered to compute G are in the region representing classifiers that are random or worse than random.
Similarly, we computed the areas of RoI(FM, NM): the mean area is 0.22, the median is 0.24, and the standard deviation is 0.09. 49% of datasets have RoI(F M, N M) ≤ 0.242, i.e., more than 75% of the ROC space used to compute AUC and 52% of the area above the diagonal considered to compute G are in the region representing classifiers that are random or worse than random (0.242 is the area or RoI(FM, NM) when AP /n is 0.2 or 0.8).
In conclusion, for a large share of datasets, evaluations based on AUC or G are largely based on regions of the ROC space that should not be considered.
However, one may suppose that the classifiers represented by a ROC curve may be concentrated mostly in RoIs such as RoI (Recall, Fall-out). If this is the case, then some of the issues related to using defect prediction models that have performance worse than random policies may be alleviated. Thus, we investigated how many points in the ROC curves we found are outside RoI (Recall, Fall-out). We found that all the models have over 50% of the classifiers out of the RoI, and 64% of the models have over 2/3 of the classifiers out of the RoI.
Finally, Fig. 9 shows boxplots comparing the distributions of AUC, G, RRA (Recall, Fallout) and RRA(φ = 0.4). Figure 9a concerns all the models: it can be seen that AUC provides quite optimistic evaluations: except for just one case, AUC is greater than 0.7, with mean and median well above 0.8. On the contrary, the values of RRA (Recall, Fall-out) are more widely spread, indicating that RRA (Recall, Fall-out) discriminates models more severely and realistically. Values of RRA(φ = 0.4) are even more widely spread, with lower mean and median. Figure 9b provides the same comparison, considering only the models of the 33 out of 67 datasets having "l/h" AP /n. It is noticeable that the distribution of AUC changes only marginally, with respect to Fig. 9a; on the contrary, the distributions of RRA (Recall, Fallout) and RRA(φ = 0.4) center on lower values, showing that the indications by AUC are far too optimistic.
Based on the collected evidence, we can answer RQ2 by stating that RRA indicators are useful quite frequently. On the contrary, it appears that AUC indications are seldom reliable.

The Software Engineering Perspective
The meaning of RRA is the same as the meaning of AUC and G, at a high level: all these indicators provide an evaluation of the performance of a defect proneness model. Accordingly, RRA can be applied just like AUC and G. However, as discussed in Section 7, RRA can be adapted to specific needs and goals. For instance, performance evaluation can be based on φ or on FM and NM. Accordingly, the meaning of RRA is more specific than the meaning of AUC or G. We now outline how our proposal can be used during software development.

Defect Prediction Model Selection
Suppose that the software manager of a software project, e.g., ivy 2.0, needs to use a defect prediction model based on several measures. The considered model appears reasonably good, since it has AUC=0.87, RRA(FM,NM)=0.42,RRA(Recall, Figure 10 shows the corresponding ROC curve and RoI (Recall, Fall-out). When it comes to choosing a defect proneness threshold to build a defect prediction model, the software manager realizes that 1. by selecting the models corresponding to points close to (0,1), i.e., those on the dotted curve in Fig. 10, as is often suggested, one obtains a model with Fall-out worse than random estimations; 2. by selecting the model corresponding to the tangent point on the highest isocost line that touches the curve, i.e., those touched by the dashed line in Fig. 10, (as suggested in Powers (2011), for instance), one chooses a model whose Fall-out is worse than random estimations.
Thus, conventional wisdom concerning the position of the best fault prediction model in the ROC space is not always reliable. On the contrary, the RoI highlighted in Fig. 10 suggests where useful thresholds should be chosen from.

Defect Proneness Model Selection
Suppose that a software manager has two defect proneness models (for instance, built with different modeling techniques) and needs to decide which one to use. AUC provides a spurious assessment of the performance of the defect proneness models, because it is based even on parts of the ROC space that are not of interest for the software manager. As Fig. 1b shows, two ROC curves ROC 1 (x) and ROC 2 (x) can intersect each other in such a way that the final selection is based mostly on parts of the ROC space that should not be considered.
However, by focusing only on the RoI, the software manager may find that ROC 1 (x) is predominantly (or even always) above ROC 2 (x) in the RoI, so the defect proneness model corresponding to ROC 1 (x) should be preferred to the one corresponding to ROC 2 (x) when building a defect prediction model. Thus, the software manager can make a better informed (and even easier) decision as to which defect proneness model to use.

Assessing Costs and Benefits of Additional Measures
Suppose a software manager is in charge of a software project in which no real systematic code measurement process is in place, but only data on modules' size (expressed in LOC) and defectiveness are currently available. The software manager builds a LOC-based defect proneness model fp(LOC) and uses AUC to decide whether the model's performance is good enough. For instance, suppose this was the case of the xerces 1.4 project. The value of AUC for fp(LOC) is 0.75, which is right in the middle of the acceptable range. However, if the project manager also checks performance with the RRA metrics, the values RRA(Recall, Fall-out) = 0.2 and RRA(φ = 0.4) = 0.0006 show that fp(LOC) actually has much poorer performance that AUC would indicate.
Thus, the project manager may decide that more measures are needed to build better performing models and start a systematic code measurement collection process and finally get a fp model based on multiple code measures. Suppose that the total set of measures obtained after implanting the new measurement process is the one in the datasets by Jureczko and Madeyski. The BLR model that uses all of them has very high RRA (Recall, Fall-out) =0.77 and RRA(φ = 0.4)=0.73. Thus, this model has much better performance than the LOC-based model and therefore higher trustworthiness.
However, there may be costs associated with establishing a systematic measurement program, which need to be weighed against the benefits of having better performing models.
For instance, if additional measures can be had by building or buying an automated tool that analyzes software code, the software manager incurs a one-time cost. If, instead, the collection itself of the measures requires the use of resources, then there is a cost associated with every execution of the measurement program. As an example, suppose that a well-performing model requires the knowledge of the Function Points associated with the software system. Counting Function Points can be quite expensive (Jones 2008;Total Metrics 2007). Thus, by using RRA -based metrics, the software manager can have a better assessment of the costs and benefits due to an expanded measurement program.

Adaptability to Goals
Unlike with AUC and G, software managers can "customize" RRA for their projects and goals. The value of λ to be used in RRA(λ, μ) is derived from the unit costs c F N and c F P . Thus, RRA(λ, μ) depends on the characteristics of a project, as different projects have different values of c F N and c F P and therefore of λ. So, RRA(λ, μ) takes into account costs more precisely than AUC and G can for a specific project.
Parameter μ is related to the project goals, since it is the desired proportion of unitary cost reduction that determines the maximum unitary cost (see Section 8.2). Using RRA(λ, μ) allows software managers to restrict the selection of a defect prediction model only among those defect proneness models for which RRA(λ, μ) > 0. This kind of selection cannot be carried out by using AUC and G, which may actually be misleading, as shown in Section 10.
In addition, in Section 8.2, we showed that there is a relationship between performance metrics and the cost reduction proportion μ that can be achieved. Poorly performing models imply low levels of cost reduction and, conversely, high levels of cost reduction imply the need for well-performing models. This may call for building better models, as shown, for instance, in Section 11.3.
Software Defect Prediction researchers can use our proposal to have a more precise assessment of the quality of defect prediction models. Like software managers, they can also customize RRA for their goals. For instance, they can use RRA(φ) to have a general assessment of models or focus on specific performance metrics by using RRA(FM,NM), for instance, or any other ones. The variety of ways in which RRA can be defined and used goes beyond what has already been defined and used to delimit the part of the ROC curve to consider in other fields such as medical and biological disciplines (see the review of the literature in Section 13.1).

Threats to Validity of the Empirical Study
We here address possible threats to the validity of our empirical study, which we used to demonstrate our proposal and provide further evidence about it.
Some external validity threats are mitigated by the number of real-life datasets and the variety in their characteristics such as application domains, AP n ratio, number of modules, and size, as indicated in Section 10.
The values of AUC and G are computed, according to common practice, based on the training set used to build defect prediction models; similarly, to compute RRA, we took the training set as the test set too. Using a different test set than the training set may change the value of AP n . This concept drift would affect the RoIs to be taken into account (e.g., RoI(Recall, Fall-out)), and, therefore, the value of RRA (e.g., RRA (Recall, Fall-out)). Thus, we may have obtained different results with different test sets than the training sets.
As for the construct validity of RRA, based on Sections 5-8, Section 9 shows that RRA specifically addresses some of the construct validity issues related to AUC and G, which are widely used in Software Defect Prediction and several other disciplines.
Construct validity may be threatened by the performance metrics used. For instance, FM has been widely used in the literature, but it also has been largely criticized (Shepperd et al. 2014). We also used Precision, Recall, Specificity, NPV, NM, and φ, to have a more comprehensive picture and set of constraints based on different perspectives. At any rate, our approach is not limited to any fixed set of performance metrics, and any other may be used as well, as long as it is based on confusion matrices.
Also, we used BLR and NB because of the reasons explained in Section 10. Other techniques may be used, but the building of models is not the goal of this paper: we simply needed models for demonstrative purposes.

Related Work
Given the importance of defect prediction in Software Engineering, many studies have addressed the definition of defect proneness models. They are too many to mention here. ROC curves have been often used to evaluate defect proneness models, as reported by systematic literature reviews on defect prediction approaches and their performance (Arisholm et al. 2010;Beecham et al. 2010a;Hall et al. 2012).
There has been an increasing interest in ROC curves in the Software Defect Prediction and, more generally, Empirical Software Engineering literature in the last few years. For instance, 82 papers using ROC curves appeared in in the 2007-2018 period in three major Software Engineering publication venues, namely, "IEEE Transactions on Software Engineering," "Empirical Software Engineering: an International Journal," and the "International Symposium on Empirical Software Engineering and Measurement," while no papers using ROC curves appeared before 2007.
We here first describe and discuss proposals for performance metrics that have appeared in the general literature on ROC curves analysis, to address some of the issues related to the adoption of AUC (Section 13.1).
Then, we review a few of the related works published in the Empirical Software Engineering literature, to provide an idea of what kind of work has been done with ROC curves (Section 13.2).
Also, we show in Section 13.3 how cost modeling can be addressed by our approach even with different cost models than the one we use in Section 8.

ROC Curve Performance Metrics
A few proposals define performance metrics that take into account only portions of a ROC. These approaches define various forms of a partial AUC metric (pAUC), which has also been implemented in the R package pROC, available at https://web.expasy.org/ pROC/ (Robin et al. 2011). McClish (1989 computes pAUC as the area under ROC(x) in an interval [x 1 , x 2 ] between two specified values of x. To compare the performance of digital and analog mammography tests, Baker and Pinsky (2001) compare the partial AUCs for the two different ROC curves in an interval between two small x values. For mammography-related applications too, Jiang et al. (1996) propose a different version of partial AUC, in which they only take into account the high-recall portion of the ROC space, in which y >ȳ, whereȳ is a specified value of y. They define a metric as the ratio of the area under ROC(x) and above y =ȳ to the area above y =ȳ, i.e., 1 −ȳ. Dodd and Pepe (2003) introduce a nonparametric estimator for partial AUC, computed based on an interval of x, like in McClish (1989), or based on an interval of y. All four papers carry out further statistical investigations (e.g., the definition of statistical tests for comparing the areas under different ROC curves), based on statistical assumptions.
McClish also defines a "standardized" version of pAU C that takes into account only the part of the vertical slice in the [x 1 , x 2 ] interval that is also above the diagonal, i.e., the trapezoid delimited by the diagonal y = x, the vertical lines x = x 1 and x = x 2 , and the horizontal line y = 1. Specifically, the standardized metric is based on the ratio (which we call here pG) between, on one hand, the area under the curve and above the diagonal and, on the other hand, the area of the trapezoid. The standardized metric is then defined as 1 2 (1 + pG). Note that pG coincides with the partial Gini index that was defined along the same lines in Pundir and Seshadri (2012), by computing the normalized value of Gini's G in an interval [x 1 , x 2 ], and was used in Lessmann et al. (2015) with x in the interval [0, 0.4]. Like between AUC and G, there is a relationship between pAU C and pG. It can be easily shown analytically that pAU C = pG (1 −x) +x, wherex = x 1 +x 2 2 is the midpoint of the [x 1 , x 2 ] interval.
These approaches and ours share the idea that the evaluation of ROC(x) can be of interest with respect to portions of the curve. However, the portion of the ROC space taken into account is either a vertical slice or trapezoid or a horizontal slice of the ROC space, and does not take different forms, like the ones that are possible with our approach, e.g., the ones depicted in Figs. 3a and 5. Also, these portions of the ROC space are not necessarily RoIs according to Definition 3. Thus, there are some RoIs that are not taken into account by these approaches, and vice versa. The reason lies in the goals of these proposals and ours. Our goal is to identify the classifiers whose performance is better than some reference values. The other approaches aim to limit the set of interesting values of either Recall or Fallout. So, they may take into account classifiers whose performance is worse than reference values, e.g., those obtained via random classifiers.
The meaning of AUC has also been investigated by Hand (2009), who finds that computing AUC is equivalent to computing an average minimum misclassification cost with variable weights. Specifically, Hand finds that the expected minimum misclassification cost is equal to 2 AP ·AN n 2 (1-AUC) when the values of c F N and c F P are not constant, but depend on the classifier used. This poses a conceptual problem, since c F N and c F P should instead depend on the software process costs and the costs related to delivering software modules with defects, and not on the classifier used. Our proposal partially alleviates Hand's issue, by delimiting and reducing the set of classifiers taken into account, so the variability of the values of c F N and c F P is reduced. Hand also defines an alternative metric, H , which, however, relies on two assumptions: (1) c F N + c F P and λ = c F N c F N +c FP are statistically independent; (2) the probability density function of λ used for the computation of E[T C] is a Beta distribution with specified parameters. As for the second assumption, Hand discards the use of a uniform distribution because it would treat extreme and more central values of λ as likely. Hand also advocates the use of different functions that can be more closely related to practitioners' and researchers' goals, if available. Hand shows how H can be estimated. Given the correspondence between straight lines and cost models, the identification of a RoI with our proposal may help delimit the set of the values of λ to take into account.
Other performance metrics for ROC curves are described in Swets and Pickett (1982). These metrics are all based on a binormal ROC space, in which the abscissa and the ordinate represent the normal deviates obtained from x and y. One of the recommended metrics is A z , which is the area under the curve of a transformed ROC curve in the binormal ROC space. Since a border can be transformed into a line in the binormal ROC space, our RoIs can help take into account only the portion of the transformed ROC curve that is relevant, and compute only the area under that portion. Other metrics are d , d e , and m, which are generalized by d a , which represents the distance of a transformed ROC curve from the origin of the binormal ROC space. Also, Swets and Pickett (1982) mention metric β, which can be computed based on costs and benefits of positive and negative observations, which can only be subjectively assessed.
Papers (Flach 2003;Provost and Fawcett 2001;Vilalta and Oblinger 2000) define "isoperformance lines" or "isometrics," i.e., those ROC space lines composed of classifiers with the same value for some specified performance metric. Our proposal uses those lines as borders for RoIs and shows how to derive them starting from random policies, to delimit RoIs.

ROC Curves and AUC in Empirical Software Engineering
ROC curves have been used in Empirical Software Engineering for the assessment of models for several external software attributes (Fenton and Bieman ;Morasca 2009).
Here are just a few recent examples of the variety of ways in which ROC curves have been used to assess defect prediction models: Di Nucci et al. (Nucci et al. 2018) use ROC curves and AUC for models based on information about human-related factors; McIntosh and Kamei (2018) for change-level defect prediction models; Nam et al. (2018) for heterogeneous defect prediction; Herbold et al. (2018) to assess the performance of cross-project defect prediction approaches.
As for other external software attributes: Kabinna et al. (2018) use ROC curves and AUC to assess the change proneness of logging statements; da Costa et al. (2018) to study in which future release a fixed issue will be integrated in a software product; Murgia et al. (2018) to assess models for identifying emotions like love, joy, and sadness in issue report comments; Ragkhitwetsagul et al. (2018) to evaluate code similarity; Arisholm et al. (2007) to assess the performance of predictive models obtained via different techniques to identify parts of a Java system with a high probability of fault; Dallal and Morasca (2014) to evaluate module reusability models; Posnett et al. (2011) to study the risk of having fallacious results by conducting studies at the wrong aggregation level; Cerpa et al. (2010) to evaluate models of the relationships linking variables and factors to project outcomes; Malhotra and Khanna (2013) to assess change proneness models.
ROC curves (with and without AUC) have also been used in Empirical Software Engineering studies to find optimal thresholds t for fn (z,t) to build a defect prediction model. For instance, Shatnawi et al. (2010) use AUC to quantify the strength of the relationship between a variable z and defect proneness. The threshold selected by Tosun and Bener (2009) corresponds to the ROC curve point at minimum Euclidean distance from the ideal point (0, 1), which represents perfect estimation. The threshold selected by Sánchez-González et al. (2012) corresponds to the farthest point from the ROC diagram diagonal (see also Mendling et al. (2012)).

Cost Modeling
The cost related to the use of defect prediction models has been the subject of several studies in the literature that focused on misclassification costs (Hand 2009;Jiang and Cukic 2009;Khoshgoftaar and Allen 1998;Khoshgoftaar et al. 2001;Khoshgoftaar and Seliya 2004), or used cost curves (Drummond and Holte 2006;Jiang et al. 2008). A recent paper (Herbold) defines a cost model based on the idea that a defect may affect several modules and a module may be affected by several defects. Herbold's cost model also allows the use of different Verification and Validation costs for different modules, different costs for different defects, and different probabilities that Verification and Validation activities miss a defect.
Here we provide a detailed discussion for the cost model investigated by Zhang and Cheung (2013). Specifically, Zhang and Cheung use the overall cost of a prediction model C p = c F P (T P + F P ) + c F N F N, which includes c F P T P , and derive two inequalities that must be satisfied by the confusion matrix of a defect prediction model. We now show how this cost model can be studied in our approach, by defining the inequalities proposed by Zhang and Cheung as RoI borders.
The first inequality is derived by comparing the value of C p obtained with a binary classifier and the value obtained by trivially estimating all modules positive. When λ = 1 2 , the equation of the first border is This is a pencil of straight lines going through point (1, 1). A straight line from this pencil defines an effective border (i.e., the straight line is in the upper-left triangle) if and only if its slope is between 0 and 1, i.e., 0 < (1−λ)k 2λ−1 < 1. When λ > 1 2 , the slope is nonnegative and it can be shown that the slope is less than 1 if and only if c FP c F N < AP n . When λ < 1 2 , the slope is negative, and the straight lines are outside the ROC space. When λ = 1 2 , the equation of border is x = 1, which is not an effective border in the ROC space.
The second inequality is derived by comparing the value of C p obtained with a defect prediction model and the value obtained with the uni random policy. The second inequality, however, turns out to always set the diagonal y = x as the border straight line, so it does not introduce any real constraints.
At any rate, any cost model based on the cells of the confusion matrix related to a defect prediction model can be dealt with by our approach.

Conclusions and Future Work
In this paper, new concepts and techniques are proposed to assess the performance of a given defect proneness model fp(z) when building defect prediction models. The proposed assessment is based on two fundamental concepts: 1) only models that outperform reference ones-including, but not limited to, random estimation-should be considered, and 2) any combination of performance metrics can be used. Concerning the latter point, not only do we allow researchers and practitioners to use the metric they like best (e.g., F-measure or φ), but we introduce the possibility of evaluating models against cost, and we show that there is a clear correspondence between performance metrics (such as Recall, Precision, F-measure, etc.) and cost.
Using the proposed technique, practitioners and researchers can identify the thresholds worth using to derive defect prediction models based on a given defect proneness model, so that the obtained models perform better than reference ones. Our approach helps practitioners evaluate competing defect prediction models and select adequate ones for their goals and needs. It allows researchers to assess software development techniques based only on those defect prediction models that should be used in practice, and not on irrelevant ones that may bias the results and lead to misleading conclusions.
Unlike the traditional AUC metric, which considers the entire ROC curve, our approach considers only the part of the ROC curve where performance-evaluated via the metrics of choice-is better than reference performance values, which can be provided by reference models, for instance.
We show that RRA-when used with suitable areas of interest, like those that exclude random behaviors-is theoretically sounder than traditional ROC-based metrics (like AUC and Gini's G). The latter are special cases or RRA, but computed on areas that include worse than random classifiers.
We also applied RRA, G, and AUC to models obtained from 67 real-life projects, and compared the obtained indications. RRA appeared to provide much deeper insight into the actual performance of models.
RRA proved more adequate than AUC and G in capturing the information used in the evaluation of defect prediction models. Specifically, AUC and G appeared to consider a large amount of information pertaining to random (and worse) performance conditions. As a consequence, AUC and G often reported high performance levels, while the performance of the corresponding models was much lower. In these cases, RRA provided much more realistic indications, revealing these low performance levels. Our analysis showed also that AUC and G can be quite frequently misleading.
Although in the empirical validation (Section 10) only measures taken on modules were used as independent variables, other types of measures-e.g., process measures-could be used in exactly the same way. That is, our approach is applicable to a broader class of models than those considered in this paper.
As a further generalization, our approach can be used outside Software Defect Prediction. What is needed is a scoring function, so that a ROC curve can be built. Thus, if such a model for, say, availability is known, it can be used in our approach in exactly the same way as defect proneness models.
Even more generally, the approach can be conceptually used for any kind of scoring function, so it can be used in disciplines beyond Empirical Software Engineering. Also, the approach can be used with other kinds of constraints that can be set on scoring function. For instance, one can also set a constraint on the value of the first derivative of the scoring function, as we did in our previous work  to define risk-averse thresholds for defect proneness models.
Future work will be needed to provide more evidence about the usefulness and the limitations of the approach, including -the assessment of the approach on more datasets -the use of additional performance metrics, to have a more complete idea of the performance of the classification -a more in-depth study of the characteristics of RRA, for instance, by introducing statistical tests to check whether the differences in the values of RRA of different ROC curves are statistically significant -the investigation of other techniques for obtaining an overall assessment of a defect proneness model -the investigation of other cost models, such as the one recently introduced by (Herbold) -the application to other external attributes that can be quantified by means of probabilistic models (Krantz et al. 1971;Morasca 2009), e.g., maintainability, usability, reusability.
We here show how to compute the performance metrics of Table 2 in terms of x and y. First, they can all be built as functions of the ratios of the cells of a confusion matrix (TP, FP, TN, and FN) to the row and column totals (AP, EP, AN, and EN). This is true by definition for the metrics that assess performance with respect to the positive class Precision and Recall, and therefore for F-measure too. Likewise, this is true by definition for the metrics that assess performance with respect to the negative class NPV and Specificity, and therefore for NM too. As for overall metrics, this is the case for J and Markedness by definition, so it is also the case for φ. Now, we show how to express these ratios in terms of x and y. We start with the ratios of the cells to AP : T P AP = y; F P AP = F P Via an example, we show how the other ratios can now be derived. Ratio T P EP is clearly equal to T P AP AP EP . Now, AP EP = AP F P +T P = 1 FP AP + T P AP = 1 kx+y . Thus, T P EP = y kx+y . Likewise, we can compute all other ratios whose denominator is EP , by multiplying the corresponding ratios obtained with AP at the denominator by 1 kx+y . As for the ratios with denominator AN , we multiply the corresponding ratio obtained with AP as the denominator by AP AN = 1 k . Finally, for the ratios whose denominator is EN, we multiply the corresponding ratio obtained with AP as the denominator by AP

Appendix C: The Behavior of Border Straight Lines
We discuss how the borders obtained for each performance metric change depending on k, c, and p (see Table 5).
Precision. The general straight line y = c 1−c kx goes through the origin and has a slope that is proportional to k and increases with c. Thus, the higher k and/or c, the stricter the constraint, i.e., the smaller the RoI , as expected. The uni straight line is y = x regardless of p, so it is also the pop straight line. This is the diagonal of the ROC space, so it does not provide any additional constraints on the values of x and y, as only ROC curves in the upper-left half of the ROC space should be considered, as explained in Section 5.1. Recall. The general straight line is horizontal. As expected, the constraint becomes stricter as c increases. Unlike with Precision, k has no influence on the strictness of the constraint (unless c itself is a function of k, like in the pop case). The uni straight line shows that the higher p, the stricter the constraint. The pop straight line can be rewritten as y = AP n . F-measure. The general straight line intersects the x-axis at x = − 1 k and the y-axis at y = c 2−c ≤ 1. Both the intercept and the coefficient of x increase with c, and the coefficient also increases with k. As for the uni straight line, given p, the larger k, the larger the slope and the intercept, so the stricter the constraint. Also, given k, the larger p, the larger the slope and the intercept, so the stricter the constraint. The uni line intersects the vertical line x = 1 at y = pk+p pk+1 ≤ 1, so it extends across the entire horizontal span of the ROC space. In the pop straight line, when k increases from 0 to ∞, the slope monotonically increases from 0 to 1 and the intercept monotonically decreases from 1 2 to 0. Thus, when k = 0, we have the horizontal line y = 1 2 , which, as k increases, tilts and tends to the diagonal y = x when k → ∞. NPV. The general straight line goes through point (1, 1) and has a slope proportional to k and an intercept that decreases with k. Thus, the higher k, the less strict the constraint. Also, the slope decreases and the intercept increases when c increases, i.e., the constraint becomes stricter as the minimum acceptable value of NPV increases. The uni (and therefore pop) straight line is y = x, like in the Precision case. Specificity. The general straight line is vertical. As expected, the constraint becomes stricter as c increases. Unlike with NPV, k has no influence on the strictness of the constraint (unless, again, c itself is a function of k, like in the pop case). The uni straight line shows that the higher p, the stricter the constraint. The pop straight line can be rewritten as x = AP n . NM. In the general straight line, given a value of c, the coefficient of x increases with k and the intercept decreases. Given a value of k, the coefficient of x decreases with c and the intercept increases. As for the uni straight line, when p is fixed, the higher k, the steeper the slope, but the lower the intercept. When k is fixed, the slope increases when p increases, but the intercept decreases. In both cases, the constraint becomes less strict.
The straight line has a negative intercept − p 1−p k, it intersects the horizontal line y = 0 at x = pk 1−p+k ≤ 1, and the horizontal line y = 1 at x = 1−pk 1−p+k ≤ 1, so it extends across the entire vertical span of the ROC space. With any random policy, when k = 0 the straight line is the diagonal y = x, and when k → ∞, the straight line tends to become the vertical straight line x = p.
In the pop straight line, p and k are no longer independent, so the results on the behavior of slope and intercept when k varies while p is fixed no longer hold. Instead, with pop, the slope increases with k and the intercept remains constant.

Appendix D: Conic Sections as Borders
Here, we show that the border equations for Markedness (see Appendix D.1) and φ (see Appendix D.2) of Table 6 are conic sections, and discuss some of their properties.
Recall that a conic section can be analytically represented via the quadratic equation (which we describe with doubled terms for mathematical convenience)

D.1 Markedness
The border equation for Markedness in Table 6 is, in implicit form, It is immediate to show that B 2 = AC. Thus, the curve in Formula (13) is a parabola.
The following properties of the parabola can be proven.
-There is one degenerate case that we deal with immediately, which occurs when c = 0. The parabola of Formula (13) degenerates into the straight line y = x, i.e., the diagonal. From this point on, we therefore assume that c > 0. -The symmetry axis of the parabola is -The directrix of the parabola is It is immediate to show that the parabola goes through the origin, which is below the directrix. Since parabolas never intercept their directrices, the entire parabola of Formula (13) lies below the directrix. Therefore, the vertex of the parabola lies below the directrix too and the parabola lies to the "south-east" of its vertex.
-The ellipse intercepts the boundaries of the ROC space at (0, 0), (0, c 2 (k+1) c 2 +k ), ( 1−c 2 1+c 2 k , 1), (1, 1), (1, (1−c 2 )k) c 2 +k ), ( c 2 (k+1) 1+c 2 k , 0). -The ellipse is centered in point ( 1 2 , 1 2 ), in the center of the ROC space -The equation of the major axis of the ellipse is -Regardless of the value of k, when c 2 → 0, the slope tends to +1, i.e., the major axis tends to the diagonal of the ROC space. When c 2 → 1, the slope tends to 0, i.e., the major axis tends to the horizontal straight line y = 1 2 . -When k < 1, the slope of the major axis is a monotonically decreasing function of c 2 and when c 2 → 1, the slope tends to 0, i.e., the major axis tends to the horizontal straight line y = 1 2 . -When k > 1, the slope of the major axis is a monotonically increasing function of c 2 , and when c 2 → 1, the slope tends to +∞, i.e., the major axis tends to the vertical straight line x = 1 2 . -The slope of the major axis is an increasing function of k, regardless of the value of c.
When k → 0, the slope tends to 0 and when k → +∞, the slope tends to +∞.

Appendix E: On the Relationships between μ, c, and k
Here, for each metric in Table 5, we show how to compute 1) the value of the cost reduction proportion μ that can be achieved with a technique that improves the value of a metric PFM from PFM pop to c; 2) the improvement of PFM required to obtain a μ cost reduction. In Section 8.2, we showed the results only for Precision, which we repeat here for completeness. For each metric PFM, we find the intersection point between the straight line that represent the locus in the ROC space where PFM is equal to a constant value c and the straight line y = −kx + 1 to which point (μ 1 1+k , 1 − μ k 1+k ) belongs. Each point on that straight line corresponds to a different value of μ and, conversely, for each value of μ that point belongs to a straight line where PFM = c for a specific value of c.
Precision. We need to solve the following linear system By equating x = 1 − c and x = μ 1 1+k , we obtain μ = (1 − c)(1 + k) and c = 1 − μ 1+k . Specificity. The border is x = 1 − c, which, as we saw for NPV is associated with y = −(1 − c)k + 1. We therefore obtain exactly the same results as with NPV. NM. We need to solve the following linear system y = 2−c c kx − 2 1−c c k + 1 y = −kx + 1 Again (and as expected, since NM is the harmonic mean of NPV and Specificity), we have the same results as with NPV and Specificity, since x = 1 − c y = −(1 − c)k + 1

Appendix F: On the Effect of k
We here describe the analytic results that show the effect of k in two cases: when NC max is kept fixed (Section F.1) and when μ is kept fixed (Section F.2).

F.1 Keeping NC max Fixed for Different Values of k
Suppose that a software manager has chosen a value of NC max , but he or she is not entirely sure about the value of k for the project at hand. Since the pencil of straight lines described by Formula (10) go through center point (μ 1 1+k , 1 − μ k 1+k ), we need to study how that point varies when NC max is fixed and k varies. Depending on the possible positions of the center point, one can make decisions on the level of performance of the defect prediction models that need to be used. To this end, we solve NC max = μ k (1+k) 2 for μ, and we obtain μ = NC max (1+k) 2 k . By substituting μ into the coordinates of center point (μ 1 1+k , 1−μ k 1+k ), we obtain point (N C max 1+k k , 1−NC max (1+k)). The x-and y-values of this point vary with k, i.e., they describe a curve in parametric form, where k is the parameter. The parametric equations of the curve are x = NC max 1 + k k , y = 1 − NC max (1 + k) By solving the first for k, we obtain k as a function of x. We replace k in the second equation by using this function of x and we obtain the following relation between x and y xy − (1 − NC max )x − NC max y + NC max = 0 (20) It can be shown that it is the equation of an equilateral hyperbola, with vertical asymptote x = NC max and horizontal asymptote y = 1 − NC max . We are interested in the part of this hyperbola that is above the diagonal. It can be shown that this hyperbola intersect the diagonal at two points The part under the space root sign is nonnegative, because NC max ≤ 1 4 , as we showed in Section 8.1. Thus, the software manager, based only on the desired value of NC max , can compute all the points in the ROC space in which NC = NC max .
Note also that different projects will have different values of k. If the software manager decides that all projects should have NC = NC max , (20) also describes the set of points in the ROC space where this happens.

F.2 Keeping μ Fixed for Different Values of k
Suppose now that the software manager wants the normalized cost for every project to be at the most equal to NC max = μ k (1+k) 2 , i.e., with the same proportion μ for every value of k. The value of NC max however changes across projects, but the software manager may expect that, because projects with different values of k may have different needs. We here show how center point (μ 1 1+k , 1 − μ k 1+k ) moves in the ROC space when k varies. Based on the definition of center point (μ 1 1+k , 1 − μ k 1+k ), when k vary, we have a curve in parametric form, whose parametric equations are By eliminating k, we obtain the following relation between x and y y = x + 1 − μ This straight line is parallel to the diagonal. The part of this straight line that belongs to the ROC space is the segment corresponding to values of x in the interval [0, μ]. Software project management and effort estimation; Software process modeling, measurement and improvement; Open Source Software. He was involved in several international research projects, and he also served as reviewer of EU projects. He is co-author of over 170 scientific articles, published in international journals, in the proceedings of international conferences, or in books. He has served on the PC of several international Software Engineering conferences, and in the editorial board of international journals.