ROC convex hull and nonparametric maximum likelihood estimation

Lim, Johan; Won, Joong-Ho

doi:10.1007/s10994-012-5290-y

ROC convex hull and nonparametric maximum likelihood estimation

Technical Note
Published: 04 May 2012

Volume 88, pages 433–444, (2012)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

ROC convex hull and nonparametric maximum likelihood estimation

Download PDF

Johan Lim¹ &
Joong-Ho Won²

3070 Accesses
4 Citations
Explore all metrics

Abstract

The ROC convex hull (ROCCH) is the least convex majorant of the empirical ROC curve, and represents the optimal ROC curve of a set of classifiers. This paper provides a probabilistic view to the ROCCH. We show that the ROCCH can be characterized as a nonparametric maximum likelihood estimator (NPMLE) of a convex ROC curve. We provide two NPMLE formulations, one unconditional and the other conditional, both of which yield the ROOCH as the solution. The solution technique relates the NPMLEs to convex optimization and classifier calibration. The connection between the NPMLEs and the ROCCH also suggests efficient algorithms to compute NPMLEs of a convex ROC curve, and a conditional bootstrap procedure for assessing uncertainties in the ROCCH.

Robust Algorithms via PAC-Bayes and Laplace Distributions

Strongly Consistent Detection for Nonparametric Hypotheses

Bayesian Robust Regression with the Horseshoe+ Estimator

1 Introduction

A receiver operating characteristic (ROC) curve is a graphical representation of two performance measures of binary classifiers, the false positive rate (FPR) and the true positive rate (TPR). The FPR is the probability of erroneously reporting negative instances as being positive, whereas the TPR is that of correctly reporting positive instances. The ROC space is a set of (FPR,TPR) pairs. Traditionally the ROC space is visualized by plotting the FPR on the x axis and the TPR on the y axis. A classifier that reports a class label corresponds to a point in the ROC space. To be specific, suppose a diagnostic test uses a continuous variable X to diagnose a certain disease; if the value of X is larger than a critical value c, the subject of the diagnosis is classified into the disease (positive) class; otherwise into the non-disease (negative) class. Each critical value corresponds to a classifier. We use class conditional survival functions, defined as the complement of class conditional distribution functions, for notational convenience. Let S _ND and S _D be class conditional survival functions of X for the negative and the positive classes, respectively. That is, S _i(c)=P _i(X>c) for i=ND,D, where P _i is a probability measure of X on class i. The FPR and the TPR of the given classifier are

$$ \begin{array}{l} \mathrm{FPR}(c) = S_\mathrm{ND}(c),\\[3pt] \mathrm{TPR}(c) = S_\mathrm{D}(c). \end{array} $$

(1)

The ROC curve plots TPR(c) against FPR(c) for all values of c. Explicitly, for p∈[0,1],

$$ R(p)={S}_{\mathrm{D}} \bigl( S^{-1}_{\mathrm{ND}} (p) \bigr) $$

(2)

by substituting FPR(c) with p and eliminating c.

A natural question that arises is how to estimate R(p) from the observed data. If the TPR and the FPR are estimated by the empirical survival functions $\hat{S}_{\mathrm{D}}$ and $\hat{S}_{\mathrm{ND}}$, i.e., the proportions of the true positives and the false positives in the training data set, the estimated curve is a piecewise constant function called the empirical ROC curve. The empirical ROC curve is a nonparametric maximum likelihood estimator (NPMLE) of the R(p). We may also impose geometrical constraints, such as convexity,^{Footnote 1} when estimating R(p). Lloyd (2002) studies nonparametric and semiparametric maximum likelihood estimation of a convex ROC curve. Parametric methods enjoy the ability of producing a smooth, as well as convex, ROC curve. These methods assume that S _ND and S _D belong to a specific parametric family of distributions that guarantees (2) is convex. Pan and Metz (1997) and Metz and Pan (1999) use the normal error distribution, Dorfman et al. (1997) consider the gamma distribution, and Campbell and Ratnaparkhi (1993) introduce the Lomax family of curves.

The ROC curve of randomized diagnoses traces the least convex majorant (LCM) of the given ROC curve, well known as the ROC convex hull (ROCCH) to the machine learning community (Provost and Fawcett 2001). The properties and applications of the ROCCH have been extensively studied in the machine learning literature: Pareto optimality (Kim et al. 2006), repairing local non-convexity (Flach and Wu 2005), and cost-sensitive classification (Lim and Pyun 2009), to name a few. In particular, in the use of the ROCCH for classifier calibration, i.e., to transform classifier scores into posterior class probabilities, Fawcett and Niculescu-Mizil (2007) show that their ROCCH-based calibration method is equivalent to the pool-adjacent-violation (PAV) isotonic regression-based method by Zadrozny and Elkan (2002). However, its connection with maximum likelihood estimation has not been much explored, to the best of our knowledge.

In this paper, we show that the ROCCH is the nonparametric maximum likelihood estimator (NPMLE) of the true ROC curve when it is assumed convex. We formulate the NPMLE problem as a convex optimization problem, whose solution yields the ROCCH (Sect. 2). This convex programming formulation allows us to consider a conditional NPMLE, which also has the ROCCH as the optimal solution (Sect. 3). The benefit of this conditional NPMLE interpretation is that the uncertainty in the ROCCH can be systematically evaluated. To demonstrate this, we propose a conditional bootstrap procedure (Sect. 4).

2 Nonparametric maximum likelihood estimation of convex ROC curves

2.1 Likelihood

We can formulate the problem of NPMLE of ROC curves (under no constraint) using the class conditional survival functions (1) and the prior class probability that will be introduced shortly. Consider independent random samples from two classes, ND (negative) and D (positive). Let $x_{i1},\ldots,x_{i n_{i}}$ be the observed diagnostic scores from class i whose survival function is S _i for i=ND,D, i.e., n _ND and n _D are the sizes of the negative and the positive classes, respectively. Complete observations occur on a subset of scores x ₁<x ₂<⋯<x _m, where {x ₁,…,x _m} is the union of all observed scores $x_{i1},\ldots,x_{i n_{i}}$ for i=ND,D. Note that this union partitions the axis of scores into m+1 intervals. Assuming that the observations are mutually independent, the interval frequencies for each class follow a multinomial distribution (Metz et al. 1998). Then the likelihood of the observation is written as

(3)

where π ₀ is the prior probability of class D, and d _ij denotes the number of observations of class i in the semi-closed interval (x _j−1,x _j]. (We interpret x ₀=−∞ and x _m+1=∞ so that S _i(x ₀)=1 and S _i(x _m+1)=0.) $\mathcal{L}( \pi_{0} )$ is maximized at $\hat{\pi}_{0} = n_{\mathrm{D}}/(n_{\mathrm{ND}}+n_{\mathrm{D}})$ independent of $\mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} )$. It turns out that the maximizer of $\mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} )$ is the pair of empirical class conditional survival functions $(\hat{S}_{\mathrm{ND}}, \hat{S}_{\mathrm{D}})$. Hence the NPMLE of the ROC curve with no constraint is the empirical ROC curve

(4)

or a plug-in estimator of (2). Note that only $\mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} )$ is needed to estimate the ROC curve.

2.2 Geometric programming formulation

The NPMLE of a convex ROC curve can be obtained by solving (3) after imposing appropriate constraints, and we show in this section that the resulting optimization problem is formulated as a geometric program (GP), a special class of convex optimization problems. What we want to solve is the following problem.

$$ \everymath{\displaystyle} \begin{array}{l@{\quad }l} \mathrm{maximize} & \mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} ) = \prod_{i= \mathrm{ND}, \mathrm{D}} \prod_{j=1}^{m+1} \bigl\{{S}_i (x_{j-1}) -S_i (x_{j}) \bigr\}^{d_{ij}} \\[6pt] \mbox{subject to} & R(p)= S_{\mathrm{D}} \bigl( S^{-1}_{\mathrm{ND}} ( p) \bigr) \mbox{ is convex in } p \in [0,1]. \end{array} $$

(5)

The solution to (5) is a pair of distributions that change their values only at the finite number of points x ₁,…,x _m: if an estimated pair $(\tilde{S}_{\mathrm{ND}}^{\mathrm{con}}, \tilde{S}_{\mathrm{D}}^{\mathrm{con}})$ does not have such property, we could find an alternative solution satisfying the convexity constraint, whose likelihood is larger than that of $(\tilde{S}_{\mathrm{ND}}^{\mathrm{con}}, \tilde{S}_{\mathrm{D}}^{\mathrm{con}} )$ (Kaplan and Meier 1958; Johansen 1978; Feltz and Dykstra 1985). Therefore we can fully specify convexity of the ROC curve in terms of the observed points and write problem (5) as

$$ \everymath{\displaystyle} \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L} ( S_{\mathrm{ND}}, S_{\mathrm{D}} ) = \prod_{i={\mathrm{ND}}, {\mathrm{D}}} \prod_{j=1}^{m+1} \bigl\{ S_i (x_{j-1}) - S_i (x_{j}) \bigr\}^{d_{ij}} \\[13pt] \mbox{subject to} & \displaystyle \frac{{S}_{\mathrm{D}}(x_j)- S_{\mathrm{D}} (x_{j-1})}{{S}_{\mathrm{ND}}(x_j) - S_{\mathrm{ND}} (x_{j-1})} \le \frac{{S}_{\mathrm{D}}(x_{j+1})- S_{\mathrm{D}} (x_{j})}{{S}_{\mathrm{ND}}(x_{j+1}) - S_{\mathrm{ND}} (x_{j})},\quad j=1,\ldots,m. \end{array} $$

(6)

Note that S _ND and S _D are non-increasing in x. Now write

$$p_{ij}={S}_i(x_j) / S_i(x_{j-1}), \quad i=\mathrm{ND},\mathrm{D}, \ j=1,\ldots,m. $$

so that $S_{i}(x_{j}) = \prod^{j}_{r=1}p_{ir}$, and introduce auxiliary variables q _ij=1−p _ij. Then (6) is written as:

(7)

If we further relax the equality constraints p _ij+q _ij=1 with inequalities p _ij+q _ij≤1, problem (7) becomes a GP:

$$ \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L} \bigl( \bigl\{ (p_{ij},q_{ij})\bigr\} \bigr) \\[9pt] \mbox{subject to} & \displaystyle \biggl(\frac{p_{\mathrm{ND},j}}{q_{\mathrm{ND},j}} \biggr) \biggl( \frac{q_{\mathrm{D},j}}{p_{\mathrm{D},j}} \biggr) \biggl( \frac{q_{\mathrm{ND},(j+1)}}{q_{\mathrm{D},(j+1)}}\biggr) \le 1, \quad \mbox{for}\ j=1,\ldots,m,\\[12pt] & p_{ij}+q_{ij} \le 1, \quad i=\mathrm{ND},\mathrm{D},~j=1,\ldots,m. \end{array} $$

(8)

A standard GP is not in general convex, but can be transformed to a convex form using a simple change of variables, e.g., u _ij=logp _ij and v _ij=logq _ij here. To see the equivalence of the relaxed GP formulation (8) to the original problem (7), observe that the quantity p _ij+q _ij is monotone increasing in both p _ij and q _ij and the objective is increasing in these variables. We see that at the optimal point $\{ (\bar {p}_{ij},\bar {q}_{ij} ) \}$, the inequality constraints p _ij+q _ij≤1 must be tight for all i and j. Otherwise there exist i′ and j′ with $\bar {p}_{i'j'}+ \bar {q}_{i'j'} <1$. Then $\{ (\bar {p}_{ij},\bar {q}_{ij} ) \}$ cannot be optimal since a point $\{ (\hat{p}_{ij}, \hat{q}_{ij} ) \}$ with $\hat{p}_{ij}=\bar {p}_{ij}$ and $\hat{q}_{ij}= 1-\bar {p}_{ij}$ for i=ND,D, j=1,…,m is feasible for (8) and $\hat{q}_{i'j'}= 1-\bar {p}_{i'j'} > \bar {q}_{i'j'}$. This results in

$$\mathcal{L} \bigl( \bigl\{ (\hat{p}_{ij}, \hat{q}_{ij} ) \bigr\} \bigr) > \mathcal{L} \bigl( \bigl\{ (\bar {p}_{ij}, \bar {q}_{ij} ) \bigr\} \bigr), $$

which is a contradiction. Therefore the two problems are equivalent.

2.3 NPMLE yields the ROCCH

The solution to the GP (8) is readily available as the ROCCH of the empirical ROC curve (4), without needing a numerical GP solver, e.g., ggplab (Mutapcic et al. 2006). To see this, write the full likelihood (3) in terms of class conditional densities, instead of survival functions, in two alternative factorizations:

(9)

(10)

where f _ND(x) and f _D(x) are class conditional densities with f _i(x _ij)=S _i(x _j−1)−S _i(x _j), i=ND,D;^{Footnote 2} f(x)=(1−π ₀)f _ND(x)+π ₀ f _D(x) is the marginal density of score X; and π(x) is the posterior class probability given X=x, so that π ₀=∫π(x)f(x)dx. Lloyd (2002) shows that the following two-step optimization procedure maximizes (9) (equivalently (10)) subject to the convexity constraint on the ROC curve being estimated, hence solves the GP (8).

Step 1::: estimate π(x) nonparametrically so that
$$\everymath{\displaystyle} \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L}(\pi) = \prod_{j=1}^{m}\pi(x_j)^{d_{\mathrm{D},j}} \bigl(1-\pi(x_j)\bigr)^{d_{\mathrm{ND},_j}}\\[6pt] \mbox{subject to} & \pi(x)\mbox{~monotone nondecreasing.} \end{array} $$
Step 2::: estimate f _ND and f _D so that
$$\everymath{\displaystyle} \begin{array}{l@{\quad }l} \mbox{maximize} & \mathcal{L}(f_{\mathrm{ND}},f_{\mathrm{D}}) = \prod_{i=\mathrm{ND},\mathrm{D}} \prod_{j=1}^{n_i} f_i(x_{ij})^{d_{ij}} \\[9pt] \mbox{subject to} & f_{\mathrm{D}}(x) / f_{\mathrm{ND}}(x) \propto \pi(x) / \bigl(1-\pi(x) \bigr). \\ \end{array} $$

The solution to Step 1 is specified by the discrete density $\hat{\pi}(x)$ that is the PAV isotonic regression of the observed proportion d _D,j/(d _ND,j+d _D,j) of the positive class at each x=x _j. The solution to Step 2 is given by

$$ \hat{f}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_j) = \left\{ \begin{array}{l@{\quad }l} (d_{\mathrm{ND},j}+d_{\mathrm{D},j}) \mu /(n_{\mathrm{D}} \hat{\phi}(x_j) + n_{\mathrm{ND}} \mu ), & \hat{\phi}(x_j) < \infty, \\[6pt] 0, & \hat{\phi}(x_j) = \infty, \end{array} \right. $$

(11)

and

$$ \hat{f}_{\mathrm{D}}^{\mathrm{lloyd}}(x_j) = \left\{ \begin{array}{l@{\quad }l} (d_{\mathrm{ND},j}+d_{\mathrm{D},j}) \hat{\phi}(x_j) /(n_{\mathrm{D}} \hat{\phi}(x_j) + n_{\mathrm{ND}} \mu ), & \hat{\phi}(x_j) < \infty, \\[6pt] (d_{\mathrm{ND},j}+d_{\mathrm{D},j}) / n_{\mathrm{D}}, & \hat{\phi}(x_j) = \infty , \end{array} \right. $$

(12)

where $\hat{\phi}(x) = \hat{\pi}(x)/(1-\hat{\pi}(x))$. $\hat{\phi}(x)=\infty$ if $\hat{\pi}(x)=1$. μ is chosen so that both $\hat{f}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{j})$ and $\hat{f}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{j})$ sum to one. This solution yields an estimate of the class probability $\hat{\pi}_{0} =n_{\mathrm{D}}/(n_{\mathrm{ND}}+n_{\mathrm{D}})$. The NPMLE of the convex ROC curve is obtained by reconstructing the class conditional survival functions $\tilde{S}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{j})= \sum_{l > j} \hat{f}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{l})$ and $\tilde{S}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{j}) = \sum_{l > j} \hat{f}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{l})$, and plotting $\tilde{S}_{\mathrm{D}}^{\mathrm{lloyd}}(x_{j})$ against $\tilde{S}_{\mathrm{ND}}^{\mathrm{lloyd}}(x_{j})$.

To see why this estimate coincides with the ROCCH, it suffices to recognize that $\hat{\pi}(x)$, the PAV isotonic regression of the proportion of the positive class at a given x, is essentially the classifier score calibrated using the same regression method (Zadrozny and Elkan 2002). The estimated ROC curve is that determined by the calibrated scores. This is precisely the ROCCH, because of the equivalence between the ROCCH and the PAV regression-based calibration as discussed in Sect. 1 (for more details, see Fawcett and Niculescu-Mizil 2007). This connection between the NPMLE and the ROCCH has not been known previously.

The GP formulation (8) can be used to impose a wider class of constraints to the NPMLE problem, e.g., ordering of several convex ROC curves that establishes superiority of a classifier to another; for various order constraints in GP, see Lim et al. (2009). More importantly in this paper, the GP (8) provides a crucial insight leading to the results of the next section.

3 Conditional NPMLE that yields the ROCCH

Surprisingly, even if we condition that each FPR estimate is equal to the corresponding empirical FPR, the resulting NPMLE of the convex ROC curve still coincides with the ROCCH. To be specific, assume that $S_{\mathrm{ND}} := \hat{S}_{\mathrm{ND}}$. Let {ν ₁,…,ν _l}, ν ₁<⋯<ν _l, be a subset of {1,…,m} such that each $\hat{S}_{\mathrm{ND}} (x_{\nu_{j}})$ is unique, i.e., $\hat{S}_{\mathrm{ND}} (x_{\nu_{j-1}}) \neq \hat{S}_{\mathrm{ND}} (x_{\nu_{j}})$ for any j. By the assumption, we set p _ND,j as its empirical estimate

$$ \hat{p}_{\mathrm{ND},j}= \hat{{S}}_{\mathrm{ND}}(x_{\nu_j}) / \hat{{S}}_{\mathrm{ND}}(x_{\nu_{j-1}}) $$

(13)

for j=1,…,l. (We interpret $x_{\nu_{0}}=-\infty$ and $x_{\nu_{l+1}} = \infty$ so that $S_{i}(x_{\nu_{0}})=1$ and $S_{i}(x_{\nu_{l+1}})=0$, i=ND,D.) The conditional NPMLE is then formulated as follows.

(14)

where variables are {p _D,j}, $p_{\mathrm{D},j}= S_{\mathrm{D}}(x_{\nu_{j}}) / S_{\mathrm{D}}(x_{\nu_{j-1}})$. Note that each d _D,j is appropriately redefined to be the number of observations of class D in the semi-closed interval $(x_{\nu_{j-1}},x_{\nu_{j}}]$ of scores. We refer to this problem as conditional NPMLE. Note that (14) can also be rewritten as a GP in a similar fashion to the unconditional NPMLE (7).

That the ROCCH, or the LCM of the empirical ROC curve, is the conditional NPMLE of the convex ROC curve can be summarized by the following theorem.

Theorem 1

Let $\tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}}$ be the class conditional survival function such that the curve $(\hat{{S}}_{\mathrm{ND}}, \tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}} )$ is the LCM of the empirical ROC curve $(\hat{{S}}_{\mathrm{ND}}, \hat{{S}}_{\mathrm{D}} )$. Then, $\tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}}$ solves the conditional NPMLE (14). More precisely, $\tilde{p}_{\mathrm{D},j} = \tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{j}})/\tilde{{S}}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_{j-1}})$ solves (14).

Proof

We consider an iterative (coordinate ascent) procedure to solve (14), which iteratively updates p _D,k by maximizing (14) with respect to p _D,k for k=1,…,l. In updating p _D,k, all other p _D,j with j≠k are held fixed at their current estimates. Since the problem (14) is equivalent to a GP, which can be converted to a convex problem, and the auxiliary variables q _D,j, introduced to construct the GP, satisfy q _D,j=1−p _D,j for all j (see Sect. 2.2), the suggested coordinate ascent procedure will solve the problem. It is easy to see that for each step of the iterative procedure solves the following subproblem.

$$ \begin{array}{l@{\quad }l} \mbox{maximize} & {n_{\mathrm{D},k}} \log p_{\mathrm{D},k} + d_{\mathrm{D},k} \log (1- p_{\mathrm{D},k})\\[6pt] \mbox{subject to} & L_k \le p_{\mathrm{D},k} \le U_k, \end{array} $$

(15)

where $n_{\mathrm{D},j}=\sum_{r=j+1}^{l+1} d_{\mathrm{D},r}$ denotes the number of observations of class D whose scores are greater than $x_{\nu_{j}}$; L _k and U _k are bounds determined by the other coordinates p _D,j, j≠k.

By construction, it suffices to show that $\{\tilde{p}_{\mathrm{D},j} \}$ is a fixed point of the iterative procedure. For each k, we fix all the coordinates except for p _D,k at $\tilde{p}_{\mathrm{D},j}$, i.e., $p_{\mathrm{D},j} = \tilde{p}_{\mathrm{D},j}$, j≠k. Then,

$$L_{k} = \frac{ \{ \hat{p}_{\mathrm{ND},k}(1-\hat{p}_{\mathrm{ND},(k+1)} ) \} / \{(1- \hat{p}_{\mathrm{ND},k} ) ( 1- \tilde{p}_{\mathrm{D},(k+1)}) \} }{ 1+ \{\hat{p}_{\mathrm{ND},k}(1-\hat{p}_{\mathrm{ND},(k+1)} ) \} / \{(1-\hat{p}_{\mathrm{ND},k} ) ( 1- \tilde{p}_{\mathrm{D},(k+1)}) \}}, $$

for k=1,…,l, and

$$U_{k} = 1- \frac{ \tilde{p}_{\mathrm{ND},(k-1)} (1-\tilde{p}_{\mathrm{ND},k} ) ( 1- \tilde{p}_{\mathrm{D},(k-1)} )}{ \tilde{p}_{\mathrm{D},(k-1)} (1- \tilde{p}_{\mathrm{ND}, (k-1)} )} $$

for k=2,…,l. For k=1, we set U ₁=1.

Now consider

$$\hat{p}_{\mathrm{D},j}= \hat{{S}}_{\mathrm{D}}(x_{\nu_j}) / \hat{{S}}_{\mathrm{D}}(x_{\nu_{j-1}}), \quad j=1,\ldots,l. $$

Together with $\{ \hat{p}_{\mathrm{ND},j} \}$ defined in (13), $\{ \hat{p}_{\mathrm{D},j} \}$ constitutes the empirical ROC, which solves the unconstrained version of the unconditional NPMLE (7). Therefore $\{ \hat{p}_{\mathrm{D},j} \}$ solves the unconstrained version of the conditional NPMLE (14). It follows that $\hat{p}_{\mathrm{D},k}$ maximizes the objective of (15) provided the constraint was removed. Let $\hat{p}_{\mathrm{D},k}^{\mathrm{local}}$ denote the (constrained) solution to (15). Showing that $\hat{p}_{\mathrm{D},k}^{\mathrm{local}} = \tilde{p}_{\mathrm{D},k}$ completes the proof.

The following property of $\tilde{p}_{\mathrm{D},k}$ locally characterizes not only the LCM, but also the empirical ROC curve in the neighborhood of $[ x_{\nu_{k-1}}, x_{\nu_{k}} ]$, leading to identification of the solution $\hat{p}_{\mathrm{D},k}^{\mathrm{local}}$. Observe that $\tilde{p}_{\mathrm{D},k}<U_{k}$ if and only if the inequality in the convexity constraint in (5) is strict for j=k−1. In other words, the LCM changes its slope at $(\hat{S}_{\mathrm{ND}}(x_{\nu_{k-1}}),\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_{k-1}}))$. Change of slope of the LCM occurs if and only if it touches the empirical ROC curve, hence we have $\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k-1}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k-1}})$ (note that in general $\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x) \ge \hat{S}_{\mathrm{D}}(x) $). Similarly, $L_{k} < \hat{p}_{\mathrm{D},k}^{\mathrm{lcm}}$ if and only if $\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k}}) $. Since the LCM is convex by construction, $\tilde{p}_{\mathrm{D},k}$ always satisfies $L_{k} \le \tilde{p}_{\mathrm{D},k} \le U_{k}$. Depending on the tightness of these bounds, there are four cases to consider:

1.
$L_{k}< \tilde{p}_{\mathrm{D},k} < U_{k}$: From the observation above, $\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_{k-1}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k-1}})$ and $\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k}}) = \hat{S}_{\mathrm{D}}(x_{\nu_{k}}) $. Then,
$$\tilde{p}_{\mathrm{D},k} = \tilde{S}_{\mathrm{D}}^{\mathrm{lcm}} (x_{\nu_k})/\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}}(x_{\nu_{k-1}}) = \hat{S}_{\mathrm{D}}(x_{\nu_k})/\hat{S}_{\mathrm{D}}(x_{\nu_{k-1}}) = \hat{p}_{\mathrm{D},k}, $$
i.e., $L_{k}< \hat{p}_{\mathrm{D},k} < U_{k}$. Since $\hat{p}_{\mathrm{D},k}$ is the unconstrained maximizer of the objective of (15), which is convex in p _D,k, $\hat{p}_{\mathrm{D},k}$ also solves the constrained problem (15). Therefore $\hat{p}_{\mathrm{D},k}^{\mathrm{local}} = \hat{p}_{\mathrm{D},k} = \tilde{p}_{\mathrm{D},k}$.
2.
$L_{k} = \tilde{p}_{\mathrm{D},k} < U_{j}$: We have
(16)

(17)
Dividing (17) by (16) we obtain
(18)
i.e., the unconstrained maximizer $\hat{p}_{\mathrm{D},k}$ of (15) is less than L _k. Combined with the convexity of the objective, this implies that the constrained maximizer of (15) satisfies $\hat{p}_{\mathrm{D},k}^{\mathrm{local}}=L_{k} =\tilde{p}_{\mathrm{D},k}$.
3.
$L_{k} < \tilde{p}_{\mathrm{D},k} = U_{j}$: This case is essentially the same as case 2, with L _k replaced by U _k and the inequality in (18) reversed.
4.
$L_{k}=\tilde{p}_{\mathrm{D},k} = U_{k}$: Since L _k=U _k, the solution to the constrained maximizer of (15) is $\hat{p}_{\mathrm{D},k}^{\mathrm{local}}=L_{k}=U_{k}=\tilde{p}_{\mathrm{D},k}$.

Therefore we have $\hat{p}_{\mathrm{D},k}^{\mathrm{local}} = \tilde{p}_{\mathrm{D},k}$ in all four cases. □

Although both the conditional and the unconditional NPMLEs result in the ROCCH as the estimated ROC curve, the two methods produce distinct estimates of the TPR and the FPR. In general, unconditional NPMLE gives smoother estimates of these since it gives distinct scores to the samples. To understand this, we show in Table 1 estimated quantities including TPRs and FPRs using these two NPMLE methods from the example presented in Fawcett and Niculescu-Mizil (2007). The first two columns represent the observation, where the score is sorted in decreasing order. Class label 1 corresponds to the positive class (D), and 0 to the negative class (ND). The third and the fourth columns are the empirical FPRs and TPRs, so that $(\hat{S}_{\mathrm{ND}},\hat{S}_{\mathrm{D}})$ constitutes the empirical ROC curve. The fifth column consists of numerical solutions of the GP (14) given the empirical FPRs, so that $(\hat{S}_{\mathrm{ND}},\tilde{S}_{\mathrm{D}}^{\mathrm{lcm}})$ constitutes the conditional NPMLE. Columns 6 through 11 are for the unconditional NPMLE computed using Lloyd’s method discussed in Sect. 2.3. In particular, the last two columns $\tilde{S}_{\mathrm{ND}}^{\mathrm{lloyd}}$ and $\tilde{S}_{\mathrm{D}}^{\mathrm{lloyd}}$ together make up the unconditional NPMLE of the convex ROC curve. The boldfaced indicates that both NPMLEs coincide with the ROCCH and meet the empirical ROC curve at the same points. These points are the vertices of the ROCCH. The conditional NPMLE estimates the TPRs effectively only at these vertex points, whereas the unconditional NPMLE estimates them (and the FPRs) in between.

Table 1 An illustration of the unconditional and the conditional NPMLEs

Full size table

Finally, it is worth note that from the (anti-)symmetry in the formulation (6) and the proof of Theorem 1 we can obtain the ROCCH by fixing the TPR estimate:

Corollary 1

Let $\tilde{S}_{\mathrm{ND}}^{\mathrm{lcm}}$ be the class conditional survival function such that the curve $(\tilde{S}_{\mathrm{ND}}^{\mathrm{lcm}}, \hat{S}_{\mathrm{D}})$ is the least concave^{Footnote 3} majorant of the empirical ROC curve $(\hat{S}_{\mathrm{ND}}, \hat{{S}}_{\mathrm{D}} )$ seen from the TPR axis. Then, $\tilde{S}_{\mathrm{ND}}^{\mathrm{lcm}}$ solves the conditional NPMLE (14) with the roles of ND and D switched.

4 Conditional bootstrap of the ROCCH

That the conditional NPMLE of a convex ROC curve coincides with the ROCCH of the empirical ROC curve suggests an useful bootstrap procedure to estimate the variance of the ROCCH (see, e.g., Macskassy et al. 2005). This conditional bootstrap procedure, which samples separately from the positive and the negative groups, allows us to evaluate the variance component contributed by each of the groups being compared (Hinkley 1988; Tibshirani and Knight 1999). Decomposing the variance components is advantageous because each term constitutes an achievable minimum total variance of the ROCCH estimate when the size of the corresponding group increases. Therefore, this procedure also provides a simple means to compute the total variance of the ROCCH when the sample is imbalanced (Mladenic and Grobelnik 1999). The pointwise confidence limit for a convex ROC curve $R(p)=S_{\mathrm{D}}(S_{\mathrm{ND}}^{-1}(p))$ relies on the variance of its conditional NPMLE $\tilde{\mathbf{R}}(p)=\tilde{\mathbf{S}}_{\mathrm{D}}( \hat{\mathbf{S}}_{\mathrm{ND}}^{-1} (p) )$, or the ROCCH. (We use boldface letters to emphasize that the corresponding quantities are random. Normal-faced letters are their realizations.) From the law of total variance, the variance of the ROCCH can be decomposed as

(19)

where the first and the second terms indicate the sampling variability from the negative (ND) and the positive (D) groups, respectively. The expectations in (19) can be approximated using a mode (or mode-type) approximation as

(20)

and

(21)

where $\hat{S}_{\mathrm{D}}$ and $\hat{S}_{\mathrm{ND}}$ (normal-faced) are the observed TPR and FPR. Now it is seen that

(22)

a separation of the contribution to the variance from the positive and the negative groups, respectively. Observe that the first term (resp. the second term) of the right-hand side becomes the achievable minimum total variance (the left-hand side) by increasing n _D (resp. n _ND); the second term (resp. the first term) degenerates to zero as n _D (resp. n _ND) increases.

The first (“positive”) term is computed as follows.

Note that Theorem 1 takes action in line 5. For the second (“negative”) term, switch the positive and the negative groups in the above procedure. Note that this separate evaluation is not possible in a naive bootstrap, that resamples the whole n _D+n _ND observations. When the sample is imbalanced, e.g., n _D≪n _ND, then

$$\operatorname {Var}\bigl[ \hat{S}_{\mathrm{D}} \bigl( \tilde{ \mathbf{S}}_{\mathrm{ND}}^{-1}(p) \bigr) \bigr]\approx 0 \quad \mbox{and} \quad \operatorname {Var}\bigl[ \tilde{\mathbf{S}}_{\mathrm{D}} \bigl( \tilde{ \mathbf{S}}_{\mathrm{ND}}^{-1} (p) \bigr) \bigr] \approx \operatorname {Var}\bigl[ \tilde{\mathbf{S}}_{\mathrm{D}} \bigl( \hat{S}_{\mathrm{ND}}^{-1}(p) \bigr) \bigr], $$

so that only the “positive” conditional bootstrap suffices.

We conducted a simple numerical study to demonstrate how the proposed conditional bootstrap procedure approximates the variance components in (19) and how these components vary as the sample gets imbalanced. We set n _D=50 and varied n _ND=50, 100, 200, 300, and 1000. The scores of the negative group were distributed normally with mean 0 and variance 1, and the scores of the positive group are distributed normally with mean 0.5 and variance 1. For each choice of n _ND, we generated B=500 data sets and applied the conditional bootstrap to estimate the “positive” and the “negative” variance terms. We compared them with their true values in (19). The results are shown in Fig. 1 for n _ND=100 and 1000. Note that the bootstrap estimates of both variance terms are very close to the true values. In particular, the “negative” variance estimate almost vanishes at n _ND=1000.

5 Conclusion

In this paper we interpreted the ROC convex hull, which has been known as an efficient tool to account for the class-dependent misclassification cost in designing a classifier, from a maximum likelihood estimation perspective. We provided two nonparametric maximum likelihood formulations subject to the convexity constraint on the ROC curve and showed that the ROCCH is derived as the solution to both NPMLE problems. In particular the conditional NPMLE interpretation of the ROCCH enables standard machinery, such as the bootstrap, to assess uncertainties in the ROCCH. The proposed conditional bootstrap method can estimate the finite-sample variabilities of the ROCCH arising from the positive and the negative class separately, and allow us to find the achievable confidence limit of the ROCCH efficiently for imbalanced samples.

Notes

The use of the term ‘convex’ in the machine learning community in the context of ROC analysis is the opposite to its mathematical definition, as pointed out by Hand (2009). Since this article targets at the machine learning community, we adopt the machine learning convention.
The existence of the class conditional density function and writing it in this form is supported by that S _i changes its value only at the points x ₁,…,x _m; see Sect. 2.2.
Used as the opposite to the notion of “convex” as described in footnote 1.

References

Campbell, G., & Ratnaparkhi, M. V. (1993). An application of Lomax distributions in receiver operating characteristic (ROC) curve analysis. Communications in Statistics. Theory and Methods, 22(6), 1681–1697.
Article MATH Google Scholar
Dorfman, D. D., Berbaum, K. S., Metz, C. E., Lenth, R. V., Hanley, J. A., & Dagga, H. A. (1997). Proper receiver operating characteristic analysis: the bigamma model. Academic Radiology, 4(2), 138–149.
Article Google Scholar
Fawcett, T., & Niculescu-Mizil, A. (2007). PAV and the ROC convex hull. Machine Learning, 68(1), 97–106.
Article Google Scholar
Feltz, C. J., & Dykstra, R. L. (1985). Maximum likelihood estimation of the survival functions of n stochastically ordered random variables. Journal of the American Statistical Association, 80(392), 1012–1019.
MathSciNet MATH Google Scholar
Flach, P. A., & Wu, S. (2005). Repairing concavities in ROC curves. In L. P. Kaelbling & A. Saffiotti (Eds.), IJCAI (pp. 702–707). Denver: Professional Book Center. ISBN 0938075934.
Google Scholar
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123.
Article Google Scholar
Hinkley, D. V. (1988). Bootstrap methods. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 50(3), 321–337.
MathSciNet MATH Google Scholar
Johansen, S. (1978). The product limit estimator as maximum likelihood estimator. Scandinavian Journal of Statistics, 195–199.
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.
MathSciNet MATH Google Scholar
Kim, S., Magnani, A., Samar, S., Boyd, S., & Lim, J. (2006). Pareto optimal linear classification. In Proceedings of the 23rd international conference on machine learning (pp. 473–480), Pittsburgh, Pennsylvania. New York: ACM.
Google Scholar
Lim, J., & Pyun, K. (2009). Cost-effective hidden Markov model-based image segmentation. IEEE Signal Processing Letters, 16(3), 172–175.
Article Google Scholar
Lim, J., Kim, S. J., & Wang, X. (2009). Estimating stochastically ordered survival functions via geometric programming. Journal of Computational and Graphical Statistics, 18(4), 978–994.
Article MathSciNet Google Scholar
Lloyd, C. J. (2002). Estimation of a convex ROC curve. Statistics & Probability Letters, 59(1), 99–111.
Article MathSciNet MATH Google Scholar
Macskassy, S., Provost, F., & Rosset, S. (2005). ROC confidence bands: an empirical evaluation. In Proceedings of the 22nd international conference on machine learning (pp. 537–544). New York: ACM.
Chapter Google Scholar
Metz, C. E., & Pan, X. (1999). “Proper” binormal ROC curves: theory and maximum likelihood estimation. Journal of Mathematical Psychology, 43, 1–33.
Article MathSciNet MATH Google Scholar
Metz, C. E., Herman, B. A., & Shen, J. (1998). Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Statistics in Medicine, 17(9), 1033–1053.
Article Google Scholar
Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and naive Bayes. In Proceedings of the 16th international conference on machine learning.
Google Scholar
Mutapcic, A., Koh, K., Kim, S.-J., & Boyd, S. (2006). GGPLAB: a simple Matlab toolbox for geometric programming, version 1.00. http://stanford.edu/~boyd/ggplab, May 2006.
Pan, X., & Metz, C. E. (1997). The “proper” binormal model: parametric receiver operating characteristic curve estimation with degenerate data. Academic Radiology, 4(5), 380–389.
Article Google Scholar
Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42(3), 203–231.
Article MATH Google Scholar
Tibshirani, R., & Knight, K. (1999). The covariance inflation criterion for adaptive model selection. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 61(3), 529–546.
Article MathSciNet MATH Google Scholar
Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 694–699), Edmonton, Alberta, Canada. New York: ACM.
Chapter Google Scholar

Download references

Acknowledgements

J.L.’s research was supported by a Basic Science Research Program through the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education, Science, and Technology (No. 2010-0011448). J.-H.W. was partially supported by the US National Institutes of Health (NIH) MERIT Award R37EB02784. He was also supported by Dean’s Postdoctoral Fellowship of Stanford Medical School.

Author information

Authors and Affiliations

Department of Statistics, Seoul National University, Seoul, Korea
Johan Lim
VA Cooperative Studies Program, Mountain View, CA, 94043, USA
Joong-Ho Won

Authors

Johan Lim
View author publications
You can also search for this author in PubMed Google Scholar
Joong-Ho Won
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joong-Ho Won.

Additional information

Editor: Carla Brodley.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, J., Won, JH. ROC convex hull and nonparametric maximum likelihood estimation. Mach Learn 88, 433–444 (2012). https://doi.org/10.1007/s10994-012-5290-y

Download citation

Received: 24 September 2011
Accepted: 05 April 2012
Published: 04 May 2012
Issue Date: September 2012
DOI: https://doi.org/10.1007/s10994-012-5290-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

ROC convex hull and nonparametric maximum likelihood estimation

Abstract

Similar content being viewed by others

Robust Algorithms via PAC-Bayes and Laplace Distributions

Strongly Consistent Detection for Nonparametric Hypotheses

Bayesian Robust Regression with the Horseshoe+ Estimator

1 Introduction