Receiver operating characteristic (ROC) curves: equivalences, beta model, and minimum distance estimation

Receiver operating characteristic (ROC) curves are used ubiquitously to evaluate scores, features, covariates or markers as potential predictors in binary problems. We characterize ROC curves from a probabilistic perspective and establish an equivalence between ROC curves and cumulative distribution functions (CDFs). These results support a subtle shift of paradigms in the statistical modelling of ROC curves, which we view as curve fitting. We propose the flexible two-parameter beta family for fitting CDFs to empirical ROC curves and derive the large sample distribution of minimum distance estimators in general parametric settings. In a range of empirical examples the beta family fits better than the classical binormal model, particularly under the vital constraint of the fitted curve being concave.


Introduction
Through all realms of science and society, the assessment of the predictive ability of scores or features for binary outcomes is of critical importance. To give but a few examples, biomarkers are used to diagnose diseases, weather forecasts serve to anticipate extreme precipitation events, judges need to assess recidivism in convicts, banks use customers' particulars to assess credit risk, and email messages are to be identified as spam or legitimate.
In these and myriads of similar settings, receiver operating characteristic (ROC) curves are key tools in the evaluation of predictive potential (Fawcett 2006;Flach 2016).
A ROC curve is simply a plot of the hit rate (HR) against the false alarm rate (FAR) across the range of thresholds for a real-valued score. Specifically, consider the joint distribution Q of the pair (S, Y), where the score S is real-valued, and the event Y is binary, with the implicit understanding that higher values of S provide stronger support for the event to materialize ( Y = 1 ). The joint distribution Q of (S, Y) is characterized by the prevalence 1 = Q(Y = 1) ∈ (0, 1) along with the conditional cumulative distribution functions (CDFs) Any threshold value s can be used to predict a positive outcome ( Y = 1 ) if S > s and a negative outcome ( Y = 0 ) if S ≤ s , to yield a classifier with hit rate (HR), and false alarm rate (FAR), Terminologies abound and differ markedly between communities. For example, the hit rate has also been referred to as true positive rate (TPR), sensitivity or recall. The false alarm rate is also known as false positive rate (FPR) or fall-out and equals one minus the true negative rate, specificity, or selectivity.
The term raw ROC diagnostic refers to the set-theoretic union of the points of the form (FAR(s), HR(s)) � within the unit square. 1 The ROC curve is a linearly interpolated raw ROC diagnostic, and therefore it also is a point set that may or may not admit a direct interpretation as a function. However, if F 1 and F 0 are continuous and strictly increasing, the raw ROC diagnostic and the ROC curve can be identified with a function R, where R(0) = 0, and R(1) = 1 . High hit rates and low false alarm rates are desirable, so the area under the ROC curve (AUC) is a widely used, positively oriented measure of the predictive potential of a score or feature. In data analytic practice, the measure Q is the empirical distribution of a sample (s i , y i ) n i=1 of real-valued scores s i and corresponding binary observations y i . In this setting, it suffices to consider the unique values of s 1 , … , s n to generate the raw ROC diagnostic, and linear interpolation yields the empirical ROC curve, as illustrated in our data examples.
The remainder of our note is organized as follows. Section 2 establishes rigorous versions of fundamental theoretical results that thus far have been available informally or in special cases only. In particular, we demonstrate an equivalence between ROC curves and CDFs. In Sect. 3 we introduce the flexible yet parsimonious two-parameter beta model, which uses the CDFs of beta distributions to model ROC curves, and we discuss estimation and testing in this context. The paper closes with data examples in Sect. 4. . (1)

Fundamental properties of ROC curves
Consider the random vector (S, Y) where S is a real-valued score, feature, covariate, or marker, and Y is the binary response. As before, we refer to the joint distribution of (S, Y) as Q . Let , and let denote the marginal cumulative distribution function (CDF) of the score S. We write G(s−) for the left-hand limit of a function G at s ∈ R , and we let (a, b) � (1) = a and (a, b) � (2) = b denote coordinate projections.

Raw ROC diagnostics and ROC curves
In this setting ROC diagnostics concern the points of the form (FAR(s), HR(s)) � within the unit square. Formally, the raw ROC diagnostic for the bivariate distribution Q is the point set The raw ROC diagnostic along with a single marginal does not characterize Q , due to its well known invariance properties (Fawcett 2006;Krzanowski and Hand 2009). However, the raw ROC diagnostic along with both marginal distributions determines Q.

Theorem 1
The joint distribution Q of (S, Y) is characterized by the raw ROC diagnostic and the marginal distributions of S and Y.

Proof
The mapping g ∶ [0, 1] 2 → [0, 1] defined by (a, b) � ↦ (1 − a) 0 + (1 − b) 1 induces a bijection between the raw ROC diagnostic R * and the range of F. Therefore, it suffices to note that Q(S ≤ s, Y ≤ y) = 0 for y < 0, Briefly, a ROC curve is obtained from the raw ROC diagnostic by linear interpolation. Formally, the full ROC diagnostic or ROC curve is the point set within the unit square, where is a possibly degenerate, nondecreasing line segment. The choice of linear interpolation to complete the raw ROC diagnostic into the ROC curve (3) is natural and persuasive, as the line segment L s represents randomized combinations of the classifiers associated with its end points. The raw ROC diagnostic can be recovered from the ROC curve and the two marginal distributions, as the mapping g in the proof of Theorem 1 induces a bijection between the raw ROC diagnostic and the range of F that can be expressed in terms of 1 and 0 . From this simple fact the following result is immediate.

Corollary 1
The joint distribution Q of (S, Y) is characterized by the ROC curve and the marginal distributions of S and Y.
Given a ROC curve R, an obvious task is to find CDFs F 0 and F 1 that realize R. For a particularly simple and appealing construction, let F 0 be the CDF of the uniform distribution on the unit interval, and define F 1 as F 1 (s) = 0 for s ≤ 0,

Concave ROC curves
We proceed to elucidate the critical role of concavity in the interpretation and modelling of ROC curves. Its significance is well known and has been reviewed in monographs (Pepe 2003;Zhou et al. 2011). Nevertheless, we are unaware of any rigorous treatment in the extant literature that applies in both continuous and discrete settings, which is what we address now.
In the regular setting we suppose that F 1 and F 0 have continuous, strictly positive Lebesgue densities f 1 and f 0 in the interior of an interval, which is their common support. For every s in the interior of the support, we can define the likelihood ratio, and the conditional event probability, We note the equivalence of the following three conditions: The likelihood ratio is nondecreasing. (c) The conditional event probability is nondecreasing.
Theorem 2 In the regular setting statements (a), (b), and (c) are equivalent.
Next we consider the discrete setting where we assume that the support of the score S is finite or countably infinite. This setting includes, but is not limited to, the case of empirical ROC curves. For every s in the discrete support of S, we can define the likelihood ratio, and the conditional event probability, Theorem 3 In the discrete setting statements (a), (b), and (c) are equivalent.
The critical role of concavity in the interpretation and modelling of ROC curves stems from the monotonicity condition (c) on the conditional event probability, which needs to be invoked to justify the thresholding that lies at the heart of ROC analysis. When theoretical models are fit to empirical ROC curves, the model parameters can be restricted to guarantee concavity. Empirical ROC curves typically fail to be concave, but can be morphed into their concave hull, by subjecting a score to the pool-adjacent violators algorithm, thereby converting it into an isotonic, calibrated probabilistic classifier (Fawcett and Niculescu-Mizil 2007).

An equivalence between ROC curves and probability measures
We move on to provide concise and practically relevant characterizations of ROC curves.

Theorem 4 There is a one-to-one correspondence between ROC curves and probability measures on the unit interval.
Proof Given a ROC curve, we remove any vertical line segments, except for the respective upper endpoints, to yield the CDF of a probability measure on the unit interval. Conversely, given the CDF of a probability measure on the unit interval, we interpolate vertically at any jump points to obtain a ROC curve. This mapping is a bijection, and save for the symmetries in (4) it is realized by the construction in the proof of Corollary 1. ◻ We say that a curve C in the Euclidean plane is nondecreasing if a 0 ≤ a 1 is equivalent to The following result then is immediate.

Corollary 2
The ROC curves are the nondecreasing curves in the unit square that connect the points (0, 0) � and (1, 1) � .
Natural analogues apply under the further constraints of either strict or non-strict concavity. From applied perspectives, these results support a shift of paradigms in the statistical modelling of ROC curves. In extant practice, the emphasis is on modelling the conditional distributions F 0 and F 1 . Our characterizations suggest a subtle but important change 1 3 of perspective, in that ROC modelling can be approached as an exercise in curve fitting, with any nondecreasing curve that connects (0, 0) � to (1, 1) � being a permissible candidate.

Parametric models, estimation, and testing
The binormal model is by far the most frequently used parametric model and "plays a central role in ROC analysis" (Pepe 2003, p. 81). Specifically, the binormal model assumes that F 1 and F 0 are Gaussian with means 1 ≥ 0 and strictly positive variances 2 0 and 2 1 , respectively. We are in the regular setting of Sect. 2.2, and the resulting ROC curve is represented by the function R ∶ [0, 1] → [0, 1] with R(0) = 0, and R(1) = 1 , where Φ is the CDF of the standard normal distribution, = ( 1 − 0 )∕ 1 ≥ 0 is a scaled difference in expectations, and = 0 ∕ 1 is the ratio of the respective standard deviations. The associated area under the curve is AUC( , ) = Φ( ∕ √ 1 + 2 ) . A binormal ROC curve is concave only if = 1 or, equivalently, if F 0 and F 1 differ in location only.
Our next result demonstrates that this restriction is unavoidable if location-scale families are used to model the class conditional distributions.

Proposition 1 Given any strictly increasing CDF F on
for some 0 , 1 ∈ R and 0 , 1 > 0 . Then the ROC curve associated with the conditional CDFs F 0 and F 1 is non-concave whenever 0 ≠ 1 .

The beta model
Motivated and supported by the characterizations in Sect. 2, we propose a curve fitting approach to the statistical modelling of ROC curves, with the two-parameter family of the cumulative distribution functions (CDFs) of beta distributions being a particularly attractive model. Specifically, consider the beta family with ROC curves represented by the function where b , (q) ∝ q −1 (1 − q) −1 is the density of the beta distribution with parameter values > 0 and > 0 . This type of ROC curve arises as a special case of the setting in the case study by Zou et al. (2004Zou et al. ( , p. 1263, which models the class conditional distributions via beta densities. A beta ROC curve is concave if ≤ 1 and ≥ 2 − , and its AUC value is AUC( , ) = ∕( + ) (Vogel 2019, Appendix 3.C). The condition for concavity is much less stringent than for the binormal family, where it constrains the admissible parameter space to a single dimension.
(5) R(p) = Φ( + Φ −1 (p)) for p ∈ (0, 1), If further flexibility is desired, mixtures of beta CDFs, i.e., functions of the form where w 1 , … , w n > 0 with w 1 + ⋯ + w n = 1 , 1 , … , k > 0 , and 1 , … , k > 0 , approximate any regular ROC curve to any desired accuracy, as demonstrated by the following result. Recall from Sect. 2.2 that in the regular setting the ROC curve can be identified with the function R in (1), where F 1 and F 0 have continuous, strictly positive Lebesgue densities f 1 and f 0 in the interior of an interval, which is their common support. A ROC curve is regular if it arises in this way and strongly regular if furthermore the derivative R ′ is bounded.

Theorem 5 For every strongly regular ROC curve R there is a sequence of mixtures of beta CDFs with integer parameters that converges uniformly to R.
Proof We apply the construction in the proof of Corollary 1 and define F 1 as in (4). Due to the assumption of strong regularity, F 1 admits a density on (0, 1) that can be extended to a continuous function f 1 on [0, 1]. The arguments in Bernstein's probabilistic proof of the Weierstrass approximation theorem (Levasseur 1984) show that as n → ∞ the sequence converges to f 1 (q) uniformly in q ∈ [0, 1] . Furthermore, a n = ∫ 1 0 m n (q) dq → ∫ 1 0 f 1 (q) dq = 1 as n → ∞ , and for n = 1, 2, … the mapping p ↦ M n (p) = ∫ p 0 m n (q) dq∕a n respresents a mixture of beta CDFs. The uniform convergence of m n to f 1 implies that for every > 0 there exists an n ′ such that for all integers n > n ′ uniformly in p ∈ [0, 1] . The statement of the theorem follows. ◻

Minimum distance estimation
For the parametric estimation of ROC curves for continuous scores various methods have been proposed, including maximum likelihood, approaches based on generalized linear models, and minimum distance estimation (Pepe 2003;Zhou et al. 2011). Here we pursue the minimum distance estimator, which is much in line with our curve fitting approach.
We assume a parametric model in the regular setting of Sect. 2.2, where now the ROC curve depends on a parameter ∈ Θ ⊆ R k . Specifically, we suppose that for each ∈ Θ the ROC curve is represented by a smooth function where F 1, and F 0, admit continuous, strictly positive densities f 1, and f 0, in the interior of an interval, which is their common support. We also require that the true parameter value 0 is in the interior of the parameter space Θ , where the derivative exists and is finite for p ∈ (0, 1) , and the partial derivative R (i) (p; ) with respect to component i of the parameter vector = ( 1 , … , k ) exists and is continuous for i = 1, … , k and p ∈ (0, 1). We adopt the asymptotic scenario of Hsieh and Turnbull (1996), where at sample size n there are n 0 and n 1 = n − n 0 independent draws from F 0, and F 1, with corresponding binary outcomes of zero and one, respectively, where n = n 0 ∕n 1 converges to some ∈ (0, ∞) as n → ∞ . For ∈ Θ we define the difference process n (p; ) =R n (p) − R(p; ) , where the function R n (p) represents the empirical ROC curve. The minimum distance estimator ̂n = (̂1, … ,̂k) n then satisfies where ‖ n (⋅ ; )‖ = (∫ 1 0 n (p; ) 2 dp) 1∕2 is the standard L 2 -norm. If n is large, ̂n exists and is unique with probability approaching one (Millar 1984), and so we follow the extant literature in ignoring issues of existence and uniqueness.
The minimum distance estimator has a multivariate normal limit distribution in this setting, as suggested by the asymptotic result of Hsieh and Turnbull (1996) that under the usual √ n scaling the difference process n (p; ) has limit at = 0 , where B 1 and B 2 are independent copies of a Brownian bridge. In contrast to the results in Section 4 of Hsieh and Turnbull (1996), which concern minimum distance estimation for the binormal model and ordinal dominance curves, the following theorem applies to general parametric families and ROC curves.

Proof
We are in the setting of Theorem 2.2 of Hsieh and Turnbull (1996), according to which there exists a probability space with sequences (B 1,n ) and (B 2,n ) of independent versions of Brownian bridges such that almost surely, and uniformly in p on every interval [a, b] ⊂ (0, 1) . We proceed to verify the regularity conditions for Theorem 3.6 of Millar (1984). As regards the identifiability condition (3.2) and the differentiability condition (3.5) it suffices to note that n (p; ) − n (p; 0 ) = R(p; 0 ) − R(p; ) is nonrandom, continuously differentiable with respect to p and the components of the parameter vector , and independent of n. The boundedness condition (3.3) is trivially satisfied and the convergence condition (3.4) is implied by (11). Finally, we apply (2.17), (2.18), (2.19), and (2.20) in Section II of Millar (1984) to yield (8) and (9), where the covariance function of the process in (7) is whence K(s, t; 0 ) is as stated in (10). ◻ This result allows for asymptotic inference about model parameters, by plugging in ̂n for 0 in the expression for the asymptotic covariance. For the binormal model (5) we have = ( , ) , R ( ) (p; ) = ( + Φ −1 (p)) , R ( ) (p; ) = Φ −1 (p) ( + Φ −1 (p)) , and R � (p; ) = ( + Φ −1 (p))∕ (Φ −1 (p)) , where is the standard normal density, so that the integrals in (9) can readily be evaluated numerically. Under the beta model (6) we have = ( , ) and R � (p; ) = b , (p) . While closed form expressions for the partial derivatives of R(p; ) with respect to and exist, they are difficult to evaluate, and we approximate them with finite differences.

Testing goodness-of-fit
A natural question is whether a given parametric model fits the data at hand, and for doing this we propose a simple Monte Carlo test that applies to any parametric model class C . Specifically, given a dataset of size n with n 0 instances where the binary outcome is zero and n 1 = n − n 0 instances where it is one, our goodness-of-fit test proceeds as follows. We use the notation of Sect. 3.2 and denote the number of Monte Carlo replicates by M.
1. Fit a model from class C to the empirical ROC curve, to yield the minimum distance estimate data . Compute d data as the L 2 -distance between the fitted and the empirical ROC curve. 2. For m = 1, … , M , (a) draw a sample of size n under data , with n 0 and n 1 instances from F 0, data and F 1, data and associated binary outcomes of zero and one, respectively, (b) fit a model from class C to the empirical ROC curve, to yield the minimum distance estimate, and (c) compute d m as the L 2 -distance between the fitted and the empirical ROC curve. Under the null hypothesis of the ROC curve being generated by a random sample within class C the Monte Carlo p-value is very nearly uniform, as is readily seen in simulation experiments.

Empirical examples
Basic information about the datasets in our empirical examples is given in Table 1. In the dataset from Etzioni et al. (1999), the ratio of free to total prostate-specific antigen (PSA) two years prior to diagnosis in serum from patients later found to have prostate cancer is compared to age-matched controls. The datasets from Sing et al. (2005, Figure1a) and Robin et al. (2011, Figure 1) are prominent examples in the widely used ROCR and pROC packages in R. They concern a score from a linear support vector machine (SVM) trained to predict the usage of HIV coreceptors, and the S100 biomarker as it relates to a binary clinical outcome. The dataset from Vogel et al. (2018, Figure 6d) considers probability of precipitation forecasts from the European Centre for Medium-Range Weather Forecasts (ECMWF) for the binary event of precipitation occurrence in the West Sahel region. Figure 1 shows binormal and beta ROC curves fitted to the empirical ROC curves, both in the unrestricted case and under the constraint of concavity. The respective unrestricted and restricted minimum distance estimates, the fit in terms of the L 2 -distance to Table 1 Basic information about the datasets, and minimum distance estimates under the unrestricted and concave binormal and beta models for the ROC curves in Fig. 1 Fit is in terms of the L 2 -distance to the empirical ROC curve, and the p-value is from the goodness-of-  Table 1. In the unrestricted case, the binormal and beta fits are visually nearly indistinguishable. The fitted binormal ROC curves fail to be concave and change markedly when concavity is enforced. For the beta ROC curves, the differences between restricted and unrestricted fits are less pronounced, and in the example from Vogel et al. (2018) the unrestricted fit is concave. For this dataset, our goodness-of-fit test rejects both the unrestricted and the concave binormal model, but does not reject the beta model-so the use of the more flexible beta family is of relevance and import. Generally, in the constrained case the improvement in the fit under the more flexible beta model as compared to the classical binormal model is substantial. For more detailed analyses of these datasets, which include a demonstration of the use of our asymptotic results to generate confidence bands, as well as four-parameter extensions of the beta family, we refer to Vogel (2019, Section 3.4 learning. Datasets and code in R (R Core Team 2021) for replicating our results and implementing the proposed estimators and tests are available online (Vogel and Jordan 2021).