1 Introduction

The receiver operating characteristic (ROC) curve is commonly used to describe the accuracy of a medical or another diagnostic test, which classifies individuals into “healthy” and “diseased” categories. For comprehensive review of the literature, see the books Zhou et al. (2002), Pepe (2003) and Krzanowski and Hand (2009). Suppose that the independent real random variables X and Y denote the test score from healthy and diseased patients, respectively, and for a given cutoff point c, the test result is positive if it is greater than c. Let F and G be completely unknown distribution functions of the random variables X and Y, respectively. The sensitivity of the test is defined as SE(c)=1−G(c), which is the probability that a truly diseased individual has a positive test result. Similarly, the specificity of the test is given by SP(c)=F(c) and describes the probability that a truly non-diseased individual has a negative test result. The receiver operating characteristic (ROC) curve is defined as a plot of SE(c) versus 1−SP(c) for −∞≤c≤∞, or equivalently as a plot of

(1)

against t, for t∈[0,1].

There exist many different methods of estimating the ROC curve, but most of them are based on parametric or semiparametric models (e.g. Pepe 2000; Qin and Zhang 2003; Davidov and Nov 2012). In this paper we are interested in nonparametric estimation, which seems to be more reliable. Let \(\pmb{X}_{m}=(X_{1},\ldots,X_{m})\) and \(\pmb{Y}_{n}=(Y_{1},\ldots,Y_{n})\) be simple independent random samples from the healthy and diseased populations, respectively. There exist several methods of estimating nonparametrically the ROC curve from such data. The commonly used nonparametric estimator is the empirical ROC curve of the form

(2)

where \(F_{m}^{-1}\) and G n respectively denote the empirical quantile function and the empirical cumulative distribution function of the samples \(\pmb {X}_{m}\) and \(\pmb{Y}_{n}\), respectively (e.g. Hsieh and Turnbull 1996, Bowyer et al. 2001). Asymptotic properties of this estimator were studied by Hsieh and Turnbull (1996). They showed that, under some basic assumptions for distribution functions F and G,

when m=m(n) is nondecreasing function of n, and m(n)→∞ as n→∞.

The empirical ROC curve retains many properties of the empirical distribution function. It is uniformly convergent to the theoretical curve (Hsieh and Turnbull 1996), but it is also not continuous and not very accurate for small sample sizes. Other methods are needed to obtain a smooth estimator of the ROC curve.

Lloyd (1998) used the kernel smoothing technique to obtain a smooth ROC curve estimator given by

(3)

where

are standard kernel estimators with kernel function K, \(\mathcal {K}(v)= \int_{- \infty}^{v}K(z)dz\) and bandwidth parameters h n and h m . Lloyd and Yong (1999) showed that estimator (3) has better mean squared error properties than the empirical ROC curve. In the problem of kernel density estimation, choosing between the many available kernel functions is relatively unimportant as all give comparable results, but more care needs to be taken over the selection of bandwidth. Therefore, in the kernel ROC curve estimation the main emphasis is put on the bandwidth selection (Zhou and Harezlak 2002; Hall and Hyndmann 2003). Unfortunately, to the best of our knowledge, in the case of estimator (3), there is no uniform, but only pointwise convergence to the theoretical ROC curve.

In the problem of kernel distribution function estimation, Zieliński (2007) proposed to replace the standard smoothing parameter h m by random bandwidth

(4)

where X 1:m X 2:m ≤⋯≤X m:m are order statistics from the sample \(\pmb{X}_{m}\). He obtained a continuous estimator of the unknown distribution function with asymptotic properties similar to the empirical distribution function. Unfortunately, his estimator is not invertible, because it is constant on some subintervals of the real line ℝ, so it cannot be used to obtain a continuous ROC curve estimator.

In the next section, basing on the idea of Zieliński (2007), we propose a construction of continuous and easily invertible estimator of the distribution function. It leads us to obtain a continuous and strictly increasing nonparametric estimator of the ROC curve, which is in fact the smoothed version of the empirical ROC curve. We prove that proposed estimator converges uniformly to the theoretical ROC curve, almost surely. In Sect. 3 we report results of simulation studies and compare efficiency of the proposed estimator with some other nonparametric estimators. In Sect. 4 we apply the proposed estimator to a real data set. Section 5 contains conclusions and some prospects.

2 Smoothed empirical ROC curve

Let \(\pmb{X}_{m}=(X_{1},\ldots,X_{m})\) and \(\pmb{Y}_{n}=(Y_{1},\ldots,Y_{n})\) be random samples from unknown continuous distribution functions F and G defined on the real line ℝ, respectively. Let X 1:m X 2:m ≤⋯≤X m:m and Y 1:n Y 2:n ≤⋯≤Y n:n denote order statistics from the samples \(\pmb {X}_{m}\) and \(\pmb{Y}_{n}\). We set

where L, U are random variables such that L≤min{X 1:m ,Y 1:n } and U≥max{X m:m ,Y n:n } almost surely. Denote

With this notation we define the distribution functions estimators given by

(5)
(6)

where

(7)

where r:[0,1]→[0,1] is a continuous, strictly increasing function such that r(0)=0, r(1)=1, e.g. r(x)=x. In comparison with the kernel estimator proposed by Zieliński (2007), we have replaced the random bandwidth H m of the form (4) by the differences R j (⋅)=Q j+1(⋅)−Q j (⋅) to make the estimators \(\widehat{F}_{m}(x)\) and \(\widehat{G}_{n}(x)\) continuous and strictly increasing on [L,U]. The plain order statistics have been replaced by the statistics \(Q_{j}(\pmb{X}_{m})\) and \(Q_{j}(\pmb{Y}_{n})\), indicating centers of the intervals between the consecutive order statistics. The purpose of this change was to avoid a situation in which the smoothed estimators are always below or above the empirical distribution functions.

Lemma 1

For each x∈ℝ

Proof

We give the proof only of the first inequality. The proof of the second inequality is essentially the same.

Consider arbitrary x∈ℝ. It must be the element of one of the disjoint intervals \((-\infty, Q_{1}(\pmb{X}_{m}))\), \([Q_{j}(\pmb {X}_{m}), Q_{j+1}(\pmb{X}_{m}))\), j=1,2,…,m, \([Q_{m+1}(\pmb {X}_{m}),\infty)\). It is easy to check that \(\widehat{F}_{m}(x)=F_{m}(x)=0\) for \(x<Q_{1}(\pmb{X}_{m})\) and \(\widehat{F}_{m}(x)=F_{m}(x)=1\) for \(x\geq Q_{m+1}(\pmb{X}_{m})\). Consider one of the other cases. Let \(x \in [Q_{j}(\pmb{X}_{m}),Q_{j+1}(\pmb{X}_{m}))\), j=1,2,…,m. Then

Moreover, for i<j

and for i>j

Hence

and consequently

Of course, the same inequality holds in the case of the empirical distribution function F m (x). It follows that

which completes the proof. □

The following lemma is an immediate consequence of Lemma 1 and the Glivenko-Cantelli theorem.

Lemma 2

We have

with probability one.

The inverse function of \(\widehat{F}_{m}(t)\) on [L,U] can be written as

It is clear that \(\widehat{F}_{m}^{-1}(t)\) is continuous and strictly increasing on [0,1]. Since \(\widehat{G}_{n}(t)\) is continuous and strictly increasing on [L,U], it follows that the composition \(\widehat{G}_{n}(\widehat {F}_{m}^{-1}(t))\) is continuous and strictly increasing on [0,1].

Hence we can define the continuous and strictly increasing ROC curve estimator given by

(8)

An appropriate choice of the function r appearing in formula (7) can guarantee differentiability of the estimator (e.g. if function r is differentiable and \(r_{+}'(0)=r_{-}'(1)=0\)). Simultaneously, a determination of the estimator (8) remains as easy as in the case of the empirical ROC curve.

Figure 1 shows an example of a true ROC curve corresponding to the normal distributions \(\mathcal{N}(0,1)\) and \(\mathcal{N}(1,1)\), the fitted empirical ROC curve R m,n and the smoothed nonparametric ROC curve \(\widehat{\mathit{ROC}}_{m,n}\) for m=n=50.

Fig. 1
figure 1

The fitted empirical ROC curve (dotted curve), the smoothed nonparametric estimator of the ROC curve (dashed curve) and the true ROC curve (solid curve) for a sample data, generated from the normal distributions \(\mathcal{N}(0,1)\) and \(\mathcal{N}(1,1)\)

Lemma 3

Let {f n } n∈ℕ be a sequence of nondecreasing continuous surjective functions such that \(f_{n} \colon\mathbb {R} \stackrel{\mathrm{onto}}{\longrightarrow} [0,1]\). Let \(f_{n}^{-1}(y)=\inf{\{x: f_{n}(x)=y\}}\). Assume that \(\sup_{x \in\mathbb {R}}|f_{n}(x)-f(x)| \stackrel{n\rightarrow\infty}{\longrightarrow} 0\), where \(f \colon\mathbb{R} \stackrel{\mathrm{onto}}{\longrightarrow} (0,1)\) is differentiable surjective function such that f′(x)>0 for all x∈ℝ. Then:

  1. (1)

    \(f_{n}^{-1}(y) \stackrel{n\rightarrow\infty}{\longrightarrow} f^{-1}(y)\) for all y∈(0,1),

  2. (2)

    \(\sup_{y \in[a,b]}|f_{n}^{-1}(y)-f^{-1}(y)| \stackrel{n\rightarrow\infty}{\longrightarrow} 0\), for every 0<a<b<1.

Proof

Fix y∈(0,1) and ε>0. To prove statement (1) of the lemma we need to show that for sufficiently large n

Function f n is nondecreasing, so the above inequality is equivalent to

Let us remark that \(f_{n}(f_{n}^{-1}(y))=f_{n}(\inf{\{x: f_{n}(x)=y\}})=y\). Therefore, it suffices to show that

  1. (i)

    f n (f −1(y)−ε)≤y,

  2. (ii)

    f n (f −1(y)+ε)≥y,

for all n greater than same n 0∈ℕ.

Let us now consider the expressions f(f −1(y)−ε) and f(f −1(y)+ε). Lagrange theorem states that there exist c 1∈(f −1(y)−ε,f −1(y)) and c 2∈(f −1(y),f −1(y)+ε) such that

Let ε 0=min{εf′(c 1),εf′(c 2)}>0. The sequence {f n } n∈ℕ converges uniformly to function f, so there exists n 0∈ℕ such that for every nn 0 and for every x∈ℝ

Thus, in particular, for x=f −1(yε and for nn 0

which proves (i), (ii) and hence statement (1) of the lemma.

Let us remark that the convergence proved above is not uniform, because constant n 0 depends on ε 0, whose choice depends on constants c 1 and c 2, so also depends on y. If we consider only y∈[a,b], 0<a<b<1, then we can take \(\varepsilon_{0} = \varepsilon \inf_{c \in[f^{-1}(a)-\varepsilon,f^{-1}(b)+\varepsilon]}{f'(c)} > 0\), which is not depend on y. Therefore there exist n 1∈ℕ, independent of y, such that for nn 1 inequalities (i) and (ii) are satisfied. This proves statement (2) of the lemma. □

We can now formulate the main theorem of the paper.

Theorem 1

Let \(\pmb{X}_{m}=(X_{1},\ldots,X_{m})\) and \(\pmb{Y}_{n}=(Y_{1},\ldots,Y_{n})\) be simple random samples from continuous cumulative distribution functions F and G, respectively. Let \(\widehat{\mathit{ROC}}_{m,n}(t)\) denote the ROC curve estimator given by (8) and let ROC(t)=1−G(F −1(1−t)). Then

almost surely when m=m(n) is nondecreasing function of n, and m(n)→∞ as n→∞.

Proof

Consider the inequality

(9)

It is clear that

By the above inequality and Lemma 2, we have

What is left is to show that the first term on the right side of the inequality (9) converges to zero almost surely as m→∞. To prove this, fix ε>0 and denote M=sup x∈ℝ G′(x)<∞. Let ε 1=ε/2, ε 2=ε/2M and let δ, η∈(0,1) be such that G(F −1(δ))≤ε 1 and G(F −1(η))≥1−ε 1.

Let us remark that

(10)

With respect to the first element of the maximum appearing in the equality (10), the following inequalities hold:

Similarly, we obtain

Finally, we have

Replacing expressions appearing in formula (10), by their upper estimates obtained above, we get

(11)

Let us now remark that the sequence \(\{\widehat{F}_{m}\}_{m \in\mathbb {N}}\) of cumulative distribution function estimators satisfies the conditions of Lemma 3 almost surely. Indeed, \(\widehat{F}_{m} \colon\mathbb{R} \stackrel{\mathrm{onto}}{\longrightarrow} [0,1]\) is nondecreasing continuous surjective function and \(\widehat {F}_{m}(\widehat{F}_{m}^{-1}(t))=t\) for all t∈(0,1). Moreover, by Lemma 2, the sequence \(\{\widehat{F}_{m}\}_{m \in\mathbb {N}}\) converges uniformly to F and F′(x)=f(x)>0 for all x∈ℝ. Therefore, by Lemma 3, statement (1), there exist m 0=m 0(δ,η)∈ℕ such that for mm 0

By Lemma 3, statement (2), there exist m 1∈ℕ such that for mm 1

It follows that for m≥max{m 0,m 1}, we have

This completes the proof. □

3 Simulation study

To investigate the performance of the proposed ROC curve estimator (8) for the realistic sample sizes (m,n), a small simulation study was conducted. We have considered three different combinations of distributions for X and Y: (1) \(X \sim\mathcal{N}(0,1)\), \(Y \sim \mathcal{N}(1,1)\); (2) \(X \sim\mathcal{LG}(0,1)\), \(Y \sim\mathcal {LG}(3,1)\); (3) \(X \sim\mathcal{N}(0,9)\), \(Y \sim\mathcal{LG}(3,1)\). The corresponding true ROC curves are plotted in Fig. 2. The areas under the considered ROC curves equal 0.76025, 0.88697, 0.80561, respectively.

Fig. 2
figure 2

The ROC curve corresponding to: (1) \(X \sim\mathcal{N}(0,1)\), \(Y \sim\mathcal{N}(1,1)\) (solid curve); (2) \(X \sim\mathcal {LG}(0,1)\), \(Y \sim\mathcal{LG}(3,1)\) (dotted curve); (3) \(X \sim \mathcal{N}(0,9)\), \(Y \sim\mathcal{LG}(3,1)\) (dashed curve)

For each ROC curve we have generated 1000 training data sets with (m,n)∈{(16,24),(20,20),(24,16),(40,60),(50,50),(60,40),(100,150),(125,125),(150,100)}. Next, for each data set, we have computed the empirical ROC curve estimator (2), the kernel ROC curve estimator (3) and the ROC curve estimator (8) proposed in this paper. In the problem of bandwidth selection for the kernel estimator, we have used the Normal-reference method proposed by Hall and Hyndmann (2003), which is recommended when the sampled distributions are not far from Normal. They found that in the context of the ROC curve estimation the method proposed gives substantial improvement in the mean integrated squared error (MISE) over other known methods of bandwidth selection.

In constructing the executable computer programs, procedures of the package Mathematica 8.0 were used.

To compare the accuracy of the proposed estimator with the accuracy of the empirical ROC curve estimator and the kernel ROC curve estimator we have estimated their MISE and integrated bias. We also have checked the accuracy of the considered methods in estimating the area under the curve (AUC) by calculating the mean squared error (MSE) of the areas under considered estimators. The estimated values of all mentioned error measures are illustrated in Tables 1, 2 and 3.

Table 1 Estimated values of MISE and percentage reduction in MISE from using the kernel ROC curve estimator \(\widetilde{\mathit{ROC}}_{m,n}\) and the proposed estimator \(\widehat{\mathit{ROC}}_{m,n}\), instead of the empirical ROC curve ROC m,n
Table 2 Estimated values of the integrated bias of the kernel ROC curve estimator \(\widetilde{\mathit{ROC}}_{m,n}\) and the proposed estimator \(\widehat{\mathit{ROC}}_{m,n}\)
Table 3 Estimated values of MSE in the problem of AUC estimation and percentage reduction (+) or percentage increase (−) in MSE from using the kernel ROC curve estimator \(\widetilde{\mathit{ROC}}_{m,n}\) and the proposed estimator \(\widehat{\mathit{ROC}}_{m,n}\), instead of the empirical ROC curve ROC m,n

Table 1 shows that there are quite big differences in accuracy of the investigated estimators. Both \(\widetilde{\mathit{ROC}}_{m,n}\) and \(\widehat{\mathit{ROC}}_{m,n}\) provide more accurate estimators than empirical ROC curve. In most of cases, observed improvement is bigger for the kernel estimator, but it becomes really significant only when the total sample size m+n gets larger. Interestingly, the proposed estimator \(\widehat{\mathit{ROC}}_{m,n}\) performs better when both X and Y arise from the Logistic distributions. It is due to the fact, that the corresponding ROC curve lies closer to the upper left-hand corner of the graph and in such cases the kernel estimators of the ROC curve may perform worse than the empirical ones.

Comparing the integrated bias of the estimators, we do not consider the empirical ROC curve. Note that the integrated bias of the empirical ROC curve is equal to the difference between the expectation of the area under the empirical ROC curve and the real value of AUC. This is obviously equal to zero as the area under the empirical ROC curve is equal to the Mann-Whitney U-statistic, which is an unbiased estimator of the AUC (see e.g. Bamber 1975). For this reason, in Table 2, we only present the estimated integrated bias of the estimators \(\widetilde{\mathit{ROC}}_{m,n}\) and \(\widehat{\mathit{ROC}}_{m,n}\). Both of them have a negative bias, which means that they tend to underestimate the real curve. However, the bias of the estimator proposed in this paper is not big and its absolute value is always much smaller than in the case of the kernel estimator.

Results from Table 3 indicate that the proposed estimator, after integration, provides the most accurate estimator of AUC (in sense of MSE). The kernel estimator performs even worse than the empirical ROC curve, which is probably caused by the largest bias related to this estimator. The estimator of AUC obtained from the empirical ROC curve is unbiased, so its MSE is in fact equal to its variance.

4 Real data analysis

To illustrate our method, we apply it to the set of real data which comes from a clinical study performed from November 2008 to August 2011 by a research team led by Dr. Krzysztof Tupikowski from Department of Urology and Oncological Urology, Wroclaw Medical University (article in press). One investigated the effectiveness of combined treatment of interferon alpha and metronomic cyclophosphamide in patients with metastatic kidney cancer not eligible for thyrosine kinase inhibitors treatment with various negative prognostic factors for survival. It has been approved by an independent local bioethics committee. One of the secondary goals of the study was to assess if there are any predictive factors for response to this novel combination treatment. Table 4 contains presence (1) (or absence—0) of clinical response (CR) observed at 24-th week of treatment, hemoglobin level (HL) and serum fibrinogen concentration (FC) of 31 patients treated per protocol. Missing data are denoted by x. Low HL has been previously associated with short survival and poor response to treatment in disseminated disease (Tonini et al. 2011). High FC is examined as a negative predictor of response to treatment in metastatic kidney cancer patients for the first time.

Table 4 The real data in the form (CR, HL, FC)

The estimators of the ROC curves for HL (left) and FC (right) as the predictive factors (positive and negative, respectively) are plotted in Fig. 3.

Fig. 3
figure 3

The fitted empirical ROC curve (dotted curve), the smoothed nonparametric estimator of the ROC curve (dashed curve) for HL (left) and FC (right)

Figure 3 shows that in the case considered when sample sizes are small (m=17, n=14 for HL, and m=12, n=14 for FC) the proposed estimator seems to be better fitted to a (continuous) ROC curve than the empirical ROC curve.

The estimated values of the AUC, obtained using the empirical ROC curve estimator and the proposed estimator, are 0.710107 and 0.69524 for HL and 0.678643 and 0.680438 for FC, respectively, and they do not differ significantly.

5 Concluding remarks

In this article we have proposed a nonparametric estimator of the ROC curve. The new procedure is simple and constructs a strong consistent estimator of the unknown ROC curve. The strong consistency of this estimator was established in Theorem 1. Simulation study show that the proposed method of estimation has satisfactory finite sample performance. The new estimators of the ROC curve are more accurate (with respect to the MISE) than the empirical ROC and more accurate on average for small sample sizes than the kernel estimators. Applying the proposed method of estimating the ROC curve to estimation of the AUC leads to more accurate (with respect to the MSE) estimators of the AUC than applying the empirical ROC curve in all cases considered and significantly more accurate than applying the kernel estimators. Using the kernel estimators leads even to less accurate estimators of the AUC than using empirical ROC curve. If the sample sizes are small and the distribution functions are completely unknown, we recommend using the proposed method of estimating the ROC curve. In the future research, we are going to use the proposed estimator to construct the confidence region for the ROC curve in the case of small sample sizes.

In this paper we have applied nonparametric approach to the problem of ROC estimation. Another approach, often used in the ROC estimation, is to assume that some unknown transformation converts both populations to a specific form, for example normal, logistic or Weibull. Such approach is often referred to either as a semiparametric or a parametric distribution-free approach. It was considered, for example, by Hsieh and Turnbull (1996), Zou and Hall (2000), Cai and Moskowitz (2004), Davidov and Nov (2012). Hsieh and Turnbull (1996) considered a problem of estimation of the ordinal dominance curve (ODC) given by F(G −1(t)), 0≤t≤1, which is the plot of SP(c) versus 1−SE(c), or equivalently F(c) versus G(c) for −∞≤c≤∞. They assumed that some unspecified monotonic transformation of the measurement scale simultaneously converts the F and G distributions to normal ones. In this case the ODC has the known parametric form

where Φ denotes the standard normal cumulative distribution function. They suggested to estimate the parameters μ and σ applying minimum distance estimators (MDEs) (see Wolfowitz 1957) by the values that minimize the L 2 distance between the empirical and the theoretical ODCs, that is, by the values that solve

(12)

They studied the asymptotic properties of the MDEs, but did not provide any concrete procedure to compute them. In a remark they proposed, as an object for future research, to modify (12) by applying a Φ −1 transformation on both the empirical ODC and the true ODC. Davidov and Nov (2012) followed up on this suggestion and studied in detail the estimators

They used 0<a<b<1 as the integration endpoints rather than 0 and 1, as \(F_{m}(G_{n}^{-1}(t))=0\), and hence \(\varPhi^{-1}(F_{m}(G_{n}^{-1}(t)))=-\infty\), for values of t too close to 0. A similar problem arises with values t too close to 1. In future research, we are going to compare the performance of the nonparametric estimator proposed in this paper with semiparametric estimators considered by Hsieh and Turnbull (1996) and Davidov and Nov (2012), and also with the semiparametric and rank-based estimator considered by Zou and Hall (2000), the maximum profile likelihood estimator and pseudo maximum likelihood estimator proposed by Cai and Moskowitz (2004).