1 Introduction

A receiver operating characteristic (ROC) curve is a graphical representation of two performance measures of binary classifiers, the false positive rate (FPR) and the true positive rate (TPR). The FPR is the probability of erroneously reporting negative instances as being positive, whereas the TPR is that of correctly reporting positive instances. The ROC space is a set of (FPR,TPR) pairs. Traditionally the ROC space is visualized by plotting the FPR on the x axis and the TPR on the y axis. A classifier that reports a class label corresponds to a point in the ROC space. To be specific, suppose a diagnostic test uses a continuous variable X to diagnose a certain disease; if the value of X is larger than a critical value c, the subject of the diagnosis is classified into the disease (positive) class; otherwise into the non-disease (negative) class. Each critical value corresponds to a classifier. We use class conditional survival functions, defined as the complement of class conditional distribution functions, for notational convenience. Let S ND and S D be class conditional survival functions of X for the negative and the positive classes, respectively. That is, S i (c)=P i (X>c) for i=ND,D, where P i is a probability measure of X on class i. The FPR and the TPR of the given classifier are

$$ \begin{array}{l} \mathrm{FPR}(c) = S_\mathrm{ND}(c),\\[3pt] \mathrm{TPR}(c) = S_\mathrm{D}(c). \end{array} $$
(1)

The ROC curve plots TPR(c) against FPR(c) for all values of c. Explicitly, for p∈[0,1],

$$ R(p)={S}_{\mathrm{D}} \bigl( S^{-1}_{\mathrm{ND}} (p) \bigr) $$
(2)

by substituting FPR(c) with p and eliminating c.

A natural question that arises is how to estimate R(p) from the observed data. If the TPR and the FPR are estimated by the empirical survival functions \(\hat{S}_{\mathrm{D}}\) and \(\hat{S}_{\mathrm{ND}}\), i.e., the proportions of the true positives and the false positives in the training data set, the estimated curve is a piecewise constant function called the empirical ROC curve. The empirical ROC curve is a nonparametric maximum likelihood estimator (NPMLE) of the R(p). We may also impose geometrical constraints, such as convexity,Footnote 1 when estimating R(p). Lloyd (2002) studies nonparametric and semiparametric maximum likelihood estimation of a convex ROC curve. Parametric methods enjoy the ability of producing a smooth, as well as convex, ROC curve. These methods assume that S ND and S D belong to a specific parametric family of distributions that guarantees (2) is convex. Pan and Metz (1997) and Metz and Pan (1999) use the normal error distribution, Dorfman et al. (1997) consider the gamma distribution, and Campbell and Ratnaparkhi (1993) introduce the Lomax family of curves.

The ROC curve of randomized diagnoses traces the least convex majorant (LCM) of the given ROC curve, well known as the ROC convex hull (ROCCH) to the machine learning community (Provost and Fawcett 2001). The properties and applications of the ROCCH have been extensively studied in the machine learning literature: Pareto optimality (Kim et al. 2006), repairing local non-convexity (Flach and Wu 2005), and cost-sensitive classification (Lim and Pyun 2009), to name a few. In particular, in the use of the ROCCH for classifier calibration, i.e., to transform classifier scores into posterior class probabilities, Fawcett and Niculescu-Mizil (2007) show that their ROCCH-based calibration method is equivalent to the pool-adjacent-violation (PAV) isotonic regression-based method by Zadrozny and Elkan (2002). However, its connection with maximum likelihood estimation has not been much explored, to the best of our knowledge.

In this paper, we show that the ROCCH is the nonparametric maximum likelihood estimator (NPMLE) of the true ROC curve when it is assumed convex. We formulate the NPMLE problem as a convex optimization problem, whose solution yields the ROCCH (Sect. 2). This convex programming formulation allows us to consider a conditional NPMLE, which also has the ROCCH as the optimal solution (Sect. 3). The benefit of this conditional NPMLE interpretation is that the uncertainty in the ROCCH can be systematically evaluated. To demonstrate this, we propose a conditional bootstrap procedure (Sect. 4).

2 Nonparametric maximum likelihood estimation of convex ROC curves

2.1 Likelihood

We can formulate the problem of NPMLE of ROC curves (under no constraint) using the class conditional survival functions (1) and the prior class probability that will be introduced shortly. Consider independent random samples from two classes, ND (negative) and D (positive). Let \(x_{i1},\ldots,x_{i n_{i}}\) be the observed diagnostic scores from class i whose survival function is S i for i=ND,D, i.e., n ND and n D are the sizes of the negative and the positive classes, respectively. Complete observations occur on a subset of scores x 1<x 2<⋯<x m , where {x 1,…,x m } is the union of all observed scores \(x_{i1},\ldots,x_{i n_{i}}\) for i=ND,D. Note that this union partitions the axis of scores into m+1 intervals. Assuming that the observations are mutually independent, the interval frequencies for each class follow a multinomial distribution (Metz et al. 1998). Then the likelihood of the observation is written as

(3)

where π 0 is the prior probability of class D, and d ij denotes the number of observations of class i in the semi-closed interval (x j−1,x j ]. (We interpret x 0=−∞ and x m+1=∞ so that S i (x 0)=1 and S i (x m+1)=0.) \(\mathcal{L}( \pi_{0} )\) is maximized at \(\hat{\pi}_{0} = n_{\mathrm{D}}/(n_{\mathrm{ND}}+n_{\mathrm{D}})\) independent of \(\mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} )\). It turns out that the maximizer of \(\mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} )\) is the pair of empirical class conditional survival functions \((\hat{S}_{\mathrm{ND}}, \hat{S}_{\mathrm{D}})\). Hence the NPMLE of the ROC curve with no constraint is the empirical ROC curve

(4)

or a plug-in estimator of (2). Note that only \(\mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} )\) is needed to estimate the ROC curve.

2.2 Geometric programming formulation

The NPMLE of a convex ROC curve can be obtained by solving (3) after imposing appropriate constraints, and we show in this section that the resulting optimization problem is formulated as a geometric program (GP), a special class of convex optimization problems. What we want to solve is the following problem.

$$ \everymath{\displaystyle} \begin{array}{l@{\quad }l} \mathrm{maximize} & \mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} ) = \prod_{i= \mathrm{ND}, \mathrm{D}} \prod_{j=1}^{m+1} \bigl\{{S}_i (x_{j-1}) -S_i (x_{j}) \bigr\}^{d_{ij}} \\[6pt] \mbox{subject to} & R(p)= S_{\mathrm{D}} \bigl( S^{-1}_{\mathrm{ND}} ( p) \bigr) \mbox{ is convex in } p \in [0,1]. \end{array} $$
(5)

The solution to (5) is a pair of distributions that change their values only at the finite number of points x 1,…,x m : if an estimated pair \((\tilde{S}_{\mathrm{ND}}^{\mathrm{con}}, \tilde{S}_{\mathrm{D}}^{\mathrm{con}})\) does not have such property, we could find an alternative solution satisfying the convexity constraint, whose likelihood is larger than that of \((\tilde{S}_{\mathrm{ND}}^{\mathrm{con}}, \tilde{S}_{\mathrm{D}}^{\mathrm{con}} )\) (Kaplan and Meier 1958; Johansen 1978; Feltz and Dykstra 1985). Therefore we can fully specify convexity of the ROC curve in terms of the observed points and write problem (5) as

$$ \everymath{\displaystyle} \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} ) = \prod_{i={\mathrm{ND}}, {\mathrm{D}}} \prod_{j=1}^{m+1} \bigl\{ S_i (x_{j-1}) - S_i (x_{j}) \bigr\}^{d_{ij}} \\[13pt] \mbox{subject to} & \displaystyle \frac{{S}_{\mathrm{D}}(x_j)- S_{\mathrm{D}} (x_{j-1})}{{S}_{\mathrm{ND}}(x_j) - S_{\mathrm{ND}} (x_{j-1})} \le \frac{{S}_{\mathrm{D}}(x_{j+1})- S_{\mathrm{D}} (x_{j})}{{S}_{\mathrm{ND}}(x_{j+1}) - S_{\mathrm{ND}} (x_{j})},\quad j=1,\ldots,m. \end{array} $$
(6)

Note that S ND and S D are non-increasing in x. Now write

$$p_{ij}={S}_i(x_j) / S_i(x_{j-1}), \quad i=\mathrm{ND},\mathrm{D}, \ j=1,\ldots,m. $$

so that \(S_{i}(x_{j}) = \prod^{j}_{r=1}p_{ir}\), and introduce auxiliary variables q ij =1−p ij . Then (6) is written as:

(7)

If we further relax the equality constraints p ij +q ij =1 with inequalities p ij +q ij ≤1, problem (7) becomes a GP:

$$ \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L} \bigl( \bigl\{ (p_{ij},q_{ij})\bigr\} \bigr) \\[9pt] \mbox{subject to} & \displaystyle \biggl(\frac{p_{\mathrm{ND},j}}{q_{\mathrm{ND},j}} \biggr) \biggl( \frac{q_{\mathrm{D},j}}{p_{\mathrm{D},j}} \biggr) \biggl( \frac{q_{\mathrm{ND},(j+1)}}{q_{\mathrm{D},(j+1)}}\biggr) \le 1, \quad \mbox{for}\ j=1,\ldots,m,\\[12pt] & p_{ij}+q_{ij} \le 1, \quad i=\mathrm{ND},\mathrm{D},~j=1,\ldots,m. \end{array} $$
(8)

A standard GP is not in general convex, but can be transformed to a convex form using a simple change of variables, e.g., u ij =logp ij and v ij =logq ij here. To see the equivalence of the relaxed GP formulation (8) to the original problem (7), observe that the quantity p ij +q ij is monotone increasing in both p ij and q ij and the objective is increasing in these variables. We see that at the optimal point \(\{ (\bar {p}_{ij},\bar {q}_{ij} ) \}\), the inequality constraints p ij +q ij ≤1 must be tight for all i and j. Otherwise there exist i′ and j′ with \(\bar {p}_{i'j'}+ \bar {q}_{i'j'} <1\). Then \(\{ (\bar {p}_{ij},\bar {q}_{ij} ) \}\) cannot be optimal since a point \(\{ (\hat{p}_{ij}, \hat{q}_{ij} ) \}\) with \(\hat{p}_{ij}=\bar {p}_{ij}\) and \(\hat{q}_{ij}= 1-\bar {p}_{ij}\) for i=ND,D, j=1,…,m is feasible for (8) and \(\hat{q}_{i'j'}= 1-\bar {p}_{i'j'} > \bar {q}_{i'j'}\). This results in

$$\mathcal{L} \bigl( \bigl\{ (\hat{p}_{ij}, \hat{q}_{ij} ) \bigr\} \bigr) > \mathcal{L} \bigl( \bigl\{ (\bar {p}_{ij}, \bar {q}_{ij} ) \bigr\} \bigr), $$

which is a contradiction. Therefore the two problems are equivalent.

2.3 NPMLE yields the ROCCH

The solution to the GP (8) is readily available as the ROCCH of the empirical ROC curve (4), without needing a numerical GP solver, e.g., ggplab (Mutapcic et al. 2006). To see this, write the full likelihood (3) in terms of class conditional densities, instead of survival functions, in two alternative factorizations:

(9)
(10)

where f ND(x) and f D(x) are class conditional densities with f i (x ij )=S i (x j−1)−S i (x j ), i=ND,D;Footnote 2 f(x)=(1−π 0)f ND(x)+π 0 f D(x) is the marginal density of score X; and π(x) is the posterior class probability given X=x, so that π 0=∫π(x)f(x)dx. Lloyd (2002) shows that the following two-step optimization procedure maximizes (9) (equivalently (10)) subject to the convexity constraint on the ROC curve being estimated, hence solves the GP (8).

Step 1::

estimate π(x) nonparametrically so that

$$\everymath{\displaystyle} \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L}(\pi) = \prod_{j=1}^{m}\pi(x_j)^{d_{\mathrm{D},j}} \bigl(1-\pi(x_j)\bigr)^{d_{\mathrm{ND},_j}}\\[6pt] \mbox{subject to} & \pi(x)\mbox{~monotone nondecreasing.} \end{array} $$
Step 2::

estimate f ND and f D so that

$$\everymath{\displaystyle} \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L}(f_{\mathrm{ND}},f_{\mathrm{D}}) = \prod_{i=\mathrm{ND},\mathrm{D}} \prod_{j=1}^{n_i} f_i(x_{ij})^{d_{ij}} \\[9pt] \mbox{subject to} & f_{\mathrm{D}}(x) / f_{\mathrm{ND}}(x) \propto \pi(x) / \bigl(1-\pi(x) \bigr). \\ \end{array} $$

The solution to Step 1 is specified by the discrete density \(\hat{\pi}(x)\) that is the PAV isotonic regression of the observed proportion d D,j /(d ND,j +d D,j ) of the positive class at each x=x j . The solution to Step 2 is given by

$$ \hat{f}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_j) = \left\{ \begin{array}{l@{\quad }l} (d_{\mathrm{ND},j}+d_{\mathrm{D},j}) \mu /(n_{\mathrm{D}} \hat{\phi}(x_j) + n_{\mathrm{ND}} \mu ), & \hat{\phi}(x_j) < \infty, \\[6pt] 0, & \hat{\phi}(x_j) = \infty, \end{array} \right. $$
(11)

and

$$ \hat{f}_{\mathrm{D}}^{\mathrm{lloyd}}(x_j) = \left\{ \begin{array}{l@{\quad }l} (d_{\mathrm{ND},j}+d_{\mathrm{D},j}) \hat{\phi}(x_j) /(n_{\mathrm{D}} \hat{\phi}(x_j) + n_{\mathrm{ND}} \mu ), & \hat{\phi}(x_j) < \infty, \\[6pt] (d_{\mathrm{ND},j}+d_{\mathrm{D},j}) / n_{\mathrm{D}}, & \hat{\phi}(x_j) = \infty , \end{array} \right. $$
(12)

where \(\hat{\phi}(x) = \hat{\pi}(x)/(1-\hat{\pi}(x))\). \(\hat{\phi}(x)=\infty\) if \(\hat{\pi}(x)=1\). μ is chosen so that both \(\hat{f}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{j})\) and \(\hat{f}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{j})\) sum to one. This solution yields an estimate of the class probability \(\hat{\pi}_{0} =n_{\mathrm{D}}/(n_{\mathrm{ND}}+n_{\mathrm{D}})\). The NPMLE of the convex ROC curve is obtained by reconstructing the class conditional survival functions \(\tilde{S}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{j})= \sum_{l > j} \hat{f}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{l})\) and \(\tilde{S}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{j}) = \sum_{l > j} \hat{f}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{l})\), and plotting \(\tilde{S}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{j})\) against \(\tilde{S}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{j})\).

To see why this estimate coincides with the ROCCH, it suffices to recognize that \(\hat{\pi}(x)\), the PAV isotonic regression of the proportion of the positive class at a given x, is essentially the classifier score calibrated using the same regression method (Zadrozny and Elkan 2002). The estimated ROC curve is that determined by the calibrated scores. This is precisely the ROCCH, because of the equivalence between the ROCCH and the PAV regression-based calibration as discussed in Sect. 1 (for more details, see Fawcett and Niculescu-Mizil 2007). This connection between the NPMLE and the ROCCH has not been known previously.

The GP formulation (8) can be used to impose a wider class of constraints to the NPMLE problem, e.g., ordering of several convex ROC curves that establishes superiority of a classifier to another; for various order constraints in GP, see Lim et al. (2009). More importantly in this paper, the GP (8) provides a crucial insight leading to the results of the next section.

3 Conditional NPMLE that yields the ROCCH

Surprisingly, even if we condition that each FPR estimate is equal to the corresponding empirical FPR, the resulting NPMLE of the convex ROC curve still coincides with the ROCCH. To be specific, assume that \(S_{\mathrm{ND}} := \hat{S}_{\mathrm{ND}}\). Let {ν 1,…,ν l }, ν 1<⋯<ν l , be a subset of {1,…,m} such that each \(\hat{S}_{\mathrm{ND}} (x_{\nu_{j}})\) is unique, i.e., \(\hat{S}_{\mathrm{ND}} (x_{\nu_{j-1}}) \neq \hat{S}_{\mathrm{ND}} (x_{\nu_{j}})\) for any j. By the assumption, we set p ND,j as its empirical estimate

$$ \hat{p}_{\mathrm{ND},j}= \hat{{S}}_{\mathrm{ND}}(x_{\nu_j}) / \hat{{S}}_{\mathrm{ND}}(x_{\nu_{j-1}}) $$
(13)

for j=1,…,l. (We interpret \(x_{\nu_{0}}=-\infty\) and \(x_{\nu_{l+1}} = \infty\) so that \(S_{i}(x_{\nu_{0}})=1\) and \(S_{i}(x_{\nu_{l+1}})=0\), i=ND,D.) The conditional NPMLE is then formulated as follows.

(14)

where variables are {p D,j }, \(p_{\mathrm{D},j}= S_{\mathrm{D}}(x_{\nu_{j}}) / S_{\mathrm{D}}(x_{\nu_{j-1}})\). Note that each d D,j is appropriately redefined to be the number of observations of class D in the semi-closed interval \((x_{\nu_{j-1}},x_{\nu_{j}}]\) of scores. We refer to this problem as conditional NPMLE. Note that (14) can also be rewritten as a GP in a similar fashion to the unconditional NPMLE (7).

That the ROCCH, or the LCM of the empirical ROC curve, is the conditional NPMLE of the convex ROC curve can be summarized by the following theorem.

Theorem 1

Let \(\tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}}\) be the class conditional survival function such that the curve \((\hat{{S}}_{\mathrm{ND}}, \tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}} )\) is the LCM of the empirical ROC curve \((\hat{{S}}_{\mathrm{ND}}, \hat{{S}}_{\mathrm{D}} )\). Then, \(\tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}}\) solves the conditional NPMLE (14). More precisely, \(\tilde{p}_{\mathrm{D},j} = \tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{j}})/\tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_{j-1}})\) solves (14).

Proof

We consider an iterative (coordinate ascent) procedure to solve (14), which iteratively updates p D,k by maximizing (14) with respect to p D,k for k=1,…,l. In updating p D,k , all other p D,j with jk are held fixed at their current estimates. Since the problem (14) is equivalent to a GP, which can be converted to a convex problem, and the auxiliary variables q D,j , introduced to construct the GP, satisfy q D,j =1−p D,j for all j (see Sect. 2.2), the suggested coordinate ascent procedure will solve the problem. It is easy to see that for each step of the iterative procedure solves the following subproblem.

$$ \begin{array}{l@{\quad }l} \mbox{maximize} & {n_{\mathrm{D},k}} \log p_{\mathrm{D},k} + d_{\mathrm{D},k} \log (1- p_{\mathrm{D},k})\\[6pt] \mbox{subject to} & L_k \le p_{\mathrm{D},k} \le U_k, \end{array} $$
(15)

where \(n_{\mathrm{D},j}=\sum_{r=j+1}^{l+1} d_{\mathrm{D},r}\) denotes the number of observations of class D whose scores are greater than \(x_{\nu_{j}}\); L k and U k are bounds determined by the other coordinates p D,j , jk.

By construction, it suffices to show that \(\{\tilde{p}_{\mathrm{D},j} \}\) is a fixed point of the iterative procedure. For each k, we fix all the coordinates except for p D,k at \(\tilde{p}_{\mathrm{D},j}\), i.e., \(p_{\mathrm{D},j} = \tilde{p}_{\mathrm{D},j}\), jk. Then,

$$L_{k} = \frac{ \{ \hat{p}_{\mathrm{ND},k}(1-\hat{p}_{\mathrm{ND},(k+1)} ) \} / \{(1- \hat{p}_{\mathrm{ND},k} ) ( 1- \tilde{p}_{\mathrm{D},(k+1)}) \} }{ 1+ \{\hat{p}_{\mathrm{ND},k}(1-\hat{p}_{\mathrm{ND},(k+1)} ) \} / \{(1-\hat{p}_{\mathrm{ND},k} ) ( 1- \tilde{p}_{\mathrm{D},(k+1)}) \}}, $$

for k=1,…,l, and

$$U_{k} = 1- \frac{ \tilde{p}_{\mathrm{ND},(k-1)} (1-\tilde{p}_{\mathrm{ND},k} ) ( 1- \tilde{p}_{\mathrm{D},(k-1)} )}{ \tilde{p}_{\mathrm{D},(k-1)} (1- \tilde{p}_{\mathrm{ND}, (k-1)} )} $$

for k=2,…,l. For k=1, we set U 1=1.

Now consider

$$\hat{p}_{\mathrm{D},j}= \hat{{S}}_{\mathrm{D}}(x_{\nu_j}) / \hat{{S}}_{\mathrm{D}}(x_{\nu_{j-1}}), \quad j=1,\ldots,l. $$

Together with \(\{ \hat{p}_{\mathrm{ND},j} \}\) defined in (13), \(\{ \hat{p}_{\mathrm{D},j} \}\) constitutes the empirical ROC, which solves the unconstrained version of the unconditional NPMLE (7). Therefore \(\{ \hat{p}_{\mathrm{D},j} \}\) solves the unconstrained version of the conditional NPMLE (14). It follows that \(\hat{p}_{\mathrm{D},k}\) maximizes the objective of (15) provided the constraint was removed. Let \(\hat{p}_{\mathrm{D},k}^{\mathrm{local}}\) denote the (constrained) solution to (15). Showing that \(\hat{p}_{\mathrm{D},k}^{\mathrm{local}} = \tilde{p}_{\mathrm{D},k}\) completes the proof.

The following property of \(\tilde{p}_{\mathrm{D},k}\) locally characterizes not only the LCM, but also the empirical ROC curve in the neighborhood of \([ x_{\nu_{k-1}}, x_{\nu_{k}} ]\), leading to identification of the solution \(\hat{p}_{\mathrm{D},k}^{\mathrm{local}}\). Observe that \(\tilde{p}_{\mathrm{D},k}<U_{k}\) if and only if the inequality in the convexity constraint in (5) is strict for j=k−1. In other words, the LCM changes its slope at \((\hat{S}_{\mathrm{ND}}(x_{\nu_{k-1}}),\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_{k-1}}))\). Change of slope of the LCM occurs if and only if it touches the empirical ROC curve, hence we have \(\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k-1}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k-1}})\) (note that in general \(\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x) \ge \hat{S}_{\mathrm{D}}(x) \)). Similarly, \(L_{k} < \hat{p}_{\mathrm{D},k}^{\mathrm{lcm}}\) if and only if \(\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k}}) \). Since the LCM is convex by construction, \(\tilde{p}_{\mathrm{D},k}\) always satisfies \(L_{k} \le \tilde{p}_{\mathrm{D},k} \le U_{k}\). Depending on the tightness of these bounds, there are four cases to consider:

  1. 1.

    \(L_{k}< \tilde{p}_{\mathrm{D},k} < U_{k}\): From the observation above, \(\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_{k-1}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k-1}})\) and \(\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k}}) \). Then,

    $$\tilde{p}_{\mathrm{D},k} = \tilde{S}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_k})/\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k-1}}) = \hat{S}_{\mathrm{D}}(x_{\nu_k})/\hat{S}_{\mathrm{D}}(x_{\nu_{k-1}}) = \hat{p}_{\mathrm{D},k}, $$

    i.e., \(L_{k}< \hat{p}_{\mathrm{D},k} < U_{k}\). Since \(\hat{p}_{\mathrm{D},k}\) is the unconstrained maximizer of the objective of (15), which is convex in p D,k , \(\hat{p}_{\mathrm{D},k}\) also solves the constrained problem (15). Therefore \(\hat{p}_{\mathrm{D},k}^{\mathrm{local}} = \hat{p}_{\mathrm{D},k} = \tilde{p}_{\mathrm{D},k}\).

  2. 2.

    \(L_{k} = \tilde{p}_{\mathrm{D},k} < U_{j}\): We have

    (16)
    (17)

    Dividing (17) by (16) we obtain

    (18)

    i.e., the unconstrained maximizer \(\hat{p}_{\mathrm{D},k}\) of (15) is less than L k . Combined with the convexity of the objective, this implies that the constrained maximizer of (15) satisfies \(\hat{p}_{\mathrm{D},k}^{\mathrm{local}}=L_{k} =\tilde{p}_{\mathrm{D},k}\).

  3. 3.

    \(L_{k} < \tilde{p}_{\mathrm{D},k} = U_{j}\): This case is essentially the same as case 2, with L k replaced by U k and the inequality in (18) reversed.

  4. 4.

    \(L_{k}=\tilde{p}_{\mathrm{D},k} = U_{k}\): Since L k =U k , the solution to the constrained maximizer of (15) is \(\hat{p}_{\mathrm{D},k}^{\mathrm{local}}=L_{k}=U_{k}=\tilde{p}_{\mathrm{D},k}\).

Therefore we have \(\hat{p}_{\mathrm{D},k}^{\mathrm{local}} = \tilde{p}_{\mathrm{D},k}\) in all four cases. □

Although both the conditional and the unconditional NPMLEs result in the ROCCH as the estimated ROC curve, the two methods produce distinct estimates of the TPR and the FPR. In general, unconditional NPMLE gives smoother estimates of these since it gives distinct scores to the samples. To understand this, we show in Table 1 estimated quantities including TPRs and FPRs using these two NPMLE methods from the example presented in Fawcett and Niculescu-Mizil (2007). The first two columns represent the observation, where the score is sorted in decreasing order. Class label 1 corresponds to the positive class (D), and 0 to the negative class (ND). The third and the fourth columns are the empirical FPRs and TPRs, so that \((\hat{S}_{\mathrm{ND}},\hat{S}_{\mathrm{D}})\) constitutes the empirical ROC curve. The fifth column consists of numerical solutions of the GP (14) given the empirical FPRs, so that \((\hat{S}_{\mathrm{ND}},\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}})\) constitutes the conditional NPMLE. Columns 6 through 11 are for the unconditional NPMLE computed using Lloyd’s method discussed in Sect. 2.3. In particular, the last two columns \(\tilde{S}_{\mathrm{ND}}^{\mathrm{lloyd}}\) and \(\tilde{S}_{\mathrm{D}}^{\mathrm{lloyd}}\) together make up the unconditional NPMLE of the convex ROC curve. The boldfaced indicates that both NPMLEs coincide with the ROCCH and meet the empirical ROC curve at the same points. These points are the vertices of the ROCCH. The conditional NPMLE estimates the TPRs effectively only at these vertex points, whereas the unconditional NPMLE estimates them (and the FPRs) in between.

Table 1 An illustration of the unconditional and the conditional NPMLEs

Finally, it is worth note that from the (anti-)symmetry in the formulation (6) and the proof of Theorem 1 we can obtain the ROCCH by fixing the TPR estimate:

Corollary 1

Let \(\tilde{S}_{\mathrm{ND}}^{\mathrm{lcm}}\) be the class conditional survival function such that the curve \((\tilde{S}_{\mathrm{ND}}^{\mathrm{lcm}}, \hat{S}_{\mathrm{D}})\) is the least concaveFootnote 3 majorant of the empirical ROC curve \((\hat{S}_{\mathrm{ND}}, \hat{{S}}_{\mathrm{D}} )\) seen from the TPR axis. Then, \(\tilde{S}_{\mathrm{ND}}^{\mathrm{lcm}}\) solves the conditional NPMLE (14) with the roles of ND and D switched.

4 Conditional bootstrap of the ROCCH

That the conditional NPMLE of a convex ROC curve coincides with the ROCCH of the empirical ROC curve suggests an useful bootstrap procedure to estimate the variance of the ROCCH (see, e.g., Macskassy et al. 2005). This conditional bootstrap procedure, which samples separately from the positive and the negative groups, allows us to evaluate the variance component contributed by each of the groups being compared (Hinkley 1988; Tibshirani and Knight 1999). Decomposing the variance components is advantageous because each term constitutes an achievable minimum total variance of the ROCCH estimate when the size of the corresponding group increases. Therefore, this procedure also provides a simple means to compute the total variance of the ROCCH when the sample is imbalanced (Mladenic and Grobelnik 1999). The pointwise confidence limit for a convex ROC curve \(R(p)=S_{\mathrm{D}}(S_{\mathrm{ND}}^{-1}(p))\) relies on the variance of its conditional NPMLE \(\tilde{\mathbf{R}}(p)=\tilde{\mathbf{S}}_{\mathrm{D}}( \hat{\mathbf{S}}_{\mathrm{ND}}^{-1} (p) )\), or the ROCCH. (We use boldface letters to emphasize that the corresponding quantities are random. Normal-faced letters are their realizations.) From the law of total variance, the variance of the ROCCH can be decomposed as

(19)

where the first and the second terms indicate the sampling variability from the negative (ND) and the positive (D) groups, respectively. The expectations in (19) can be approximated using a mode (or mode-type) approximation as

(20)

and

(21)

where \(\hat{S}_{\mathrm{D}}\) and \(\hat{S}_{\mathrm{ND}}\) (normal-faced) are the observed TPR and FPR. Now it is seen that

(22)

a separation of the contribution to the variance from the positive and the negative groups, respectively. Observe that the first term (resp. the second term) of the right-hand side becomes the achievable minimum total variance (the left-hand side) by increasing n D (resp. n ND); the second term (resp. the first term) degenerates to zero as n D (resp. n ND) increases.

The first (“positive”) term is computed as follows.

Algorithm 1
figure 1

Conditional bootstrap method for variance component decomposition

Note that Theorem 1 takes action in line 5. For the second (“negative”) term, switch the positive and the negative groups in the above procedure. Note that this separate evaluation is not possible in a naive bootstrap, that resamples the whole n D+n ND observations. When the sample is imbalanced, e.g., n Dn ND, then

$$\operatorname {Var}\bigl[ \hat{S}_{\mathrm{D}} \bigl( \tilde{ \mathbf{S}}_{\mathrm{ND}}^{-1}(p) \bigr) \bigr]\approx 0 \quad \mbox{and} \quad \operatorname {Var}\bigl[ \tilde{\mathbf{S}}_{\mathrm{D}} \bigl( \tilde{ \mathbf{S}}_{\mathrm{ND}}^{-1} (p) \bigr) \bigr] \approx \operatorname {Var}\bigl[ \tilde{\mathbf{S}}_{\mathrm{D}} \bigl( \hat{S}_{\mathrm{ND}}^{-1}(p) \bigr) \bigr], $$

so that only the “positive” conditional bootstrap suffices.

We conducted a simple numerical study to demonstrate how the proposed conditional bootstrap procedure approximates the variance components in (19) and how these components vary as the sample gets imbalanced. We set n D=50 and varied n ND=50, 100, 200, 300, and 1000. The scores of the negative group were distributed normally with mean 0 and variance 1, and the scores of the positive group are distributed normally with mean 0.5 and variance 1. For each choice of n ND, we generated B=500 data sets and applied the conditional bootstrap to estimate the “positive” and the “negative” variance terms. We compared them with their true values in (19). The results are shown in Fig. 1 for n ND=100 and 1000. Note that the bootstrap estimates of both variance terms are very close to the true values. In particular, the “negative” variance estimate almost vanishes at n ND=1000.

Fig. 1
figure 2

Illustration of conditional bootstrap for n D=50. In the figure, the circle indicates the true variance; the cross indicates bootstrap estimate of the variance; the dash-dot indicates the 5 % and 95 % bootstrap confidence limits of the variance, obtained from 500 bootstrap samples

5 Conclusion

In this paper we interpreted the ROC convex hull, which has been known as an efficient tool to account for the class-dependent misclassification cost in designing a classifier, from a maximum likelihood estimation perspective. We provided two nonparametric maximum likelihood formulations subject to the convexity constraint on the ROC curve and showed that the ROCCH is derived as the solution to both NPMLE problems. In particular the conditional NPMLE interpretation of the ROCCH enables standard machinery, such as the bootstrap, to assess uncertainties in the ROCCH. The proposed conditional bootstrap method can estimate the finite-sample variabilities of the ROCCH arising from the positive and the negative class separately, and allow us to find the achievable confidence limit of the ROCCH efficiently for imbalanced samples.