1 Introduction

The receiver operating characteristic (ROC) curve is used to describe the performance of a diagnostic test, which on the basis of some observable measurements, assigns individuals to one of two different groups. For definiteness, let us think of them as the groups of diseased and healthy patients. This medical terminology is related to the fact that, in practice, the ROC curves are mainly used in medicine. However, their applications were recently extended to many other fields like economics and data mining. More information about the ROC curves and their possible applications can be found, for example, in Swets (1988), Pepe (2003) and Krzanowski and Hand (2009). For a given cutoff point \(c \in \mathbb {R}\), let an individual be classified as healthy if its test score is greater than c and as diseased otherwise. Suppose that the real random variables X and Y denote the test score in the groups of healthy and diseased individuals, respectively, and let \(F(x)=P(X\le x)\) and \(G(x)=P(Y\le x)\) be their continuous and strictly increasing distribution functions. The accuracy of the test is typically summarized by the sensitivity and specificity, given by \(SE(c)=1-G(c)\) and \(SP(c)=F(c)\), respectively. The ROC curve is a plot of SE(c) versus \(1-SP(c)\) for all possible cutoff values \(c \in \mathbb {R}\cup \{ -\infty , \infty \}\). Equivalently, it can be defined as

$$\begin{aligned} R(t)=1-G(F^{-1}(1-t)), \quad t\in [0,1]. \end{aligned}$$
(1)

Let \(\pmb {X}_m=(X_1,\ldots ,X_m)\) and \(\pmb {Y}_n=(Y_1,\ldots ,Y_n)\) be independent simple samples from healthy and diseased populations, respectively, and let \(F_m\) and \(G_n\) denote their empirical cumulative distribution functions. The most commonly used nonparametric estimator of R(t) is the empirical ROC curve, which is of the form

$$\begin{aligned} R_{m,n}(t)=1-G_n(F_m^{-1}(1-t)),\quad t \in [0,1]. \end{aligned}$$
(2)

Asymptotic properties of this estimator were studied by Hsieh and Turnbull (1996). Among other things they proved that, under some basic assumptions for F and \(G, R_{m,n}(t)\) converges to the true ROC curve uniformly on [0, 1] with probability one.

Although the empirical ROC curve is very simple and very popular, its obvious weakness is being a step function, while R(t) is continuous and smooth. One of the ways to obtain a continuous estimator of R(t) is to use the kernel smoothing method. Zou et al. (1997) proposed a nonparametric estimator of R(t) from kernel estimates for the density functions of F and G. Lloyd (1998), using kernel estimates directly for F and G, obtained a smooth ROC curve estimator given by

$$\begin{aligned} \widetilde{R}_{m,n}(t)=1-\widetilde{G}_n(\widetilde{F}_m^{-1}(1-t)),\quad t \in [0,1], \end{aligned}$$
(3)

where

$$\begin{aligned} \widetilde{F}_m(x)=\frac{1}{m}\sum _{j=1}^{m}\mathscr {Q}\left( \frac{x-X_j}{h_F}\right) , \quad \widetilde{G}_n(x)=\frac{1}{n}\sum _{j=1}^{n}\mathscr {Q}\left( \frac{x-Y_j}{h_G}\right) \end{aligned}$$

are kernel estimators of F and G with a kernel function \(Q, \mathscr {Q}(v)= \int _{- \infty }^{v}Q(z)dz\) and bandwidths \(h_F\) and \(h_G\). Lloyd and Zhou (1999) proved that estimator (3) has better asymptotic mean squared error (MSE) properties than the empirical ROC curve. Unfortunately, to the best of our knowledge, in the case of estimator (3), there is no uniform, but only pointwise convergence to R(t). Moreover, the kernel ROC curve estimator is not invariant under monotone data transformations, which may be undesirable in some practical applications. The problem of transformation-invariant nonparametric estimation of the ROC curve is considered, e.g., in Du and Tang (2009) and Tang et al. (2010). Finally, estimator (3) involves two separate bandwidth parameters, so special care is required for bandwidth selection (Zhou and Harezlak 2002; Hall and Hyndmann 2003).

To overcome some of the mentioned drawbacks, different methods of smoothing the empirical ROC curve were proposed, including local linear smoothing (Peng and Zhou 2004), Bayesian bootstrap (Gu et al. 2008) and bandwidth-free smoothing of the empirical CDFs (Jokiel-Rokita and Pulit 2012). In this paper, instead of estimating the ROC curve as the composition of estimators of \(F^{-1}\) and G, we use the fact that for \(Z=1-F(Y)\),

$$\begin{aligned} P(Z\le t)=P(Y > F^{-1}(1-t))=1-G(F^{-1}(1-t))=R(t), \end{aligned}$$
(4)

and propose to estimate R(t) as the cumulative distribution function of Z. It is clear that without any knowledge about F, we need to obtain a predictor of the unknown random sample \(\pmb {Z}_n=(1-F(Y_1),\ldots ,1-F(Y_n))\). The simplest way to do this is to substitute the unknown distribution function F by its any estimator \(\hat{F}\). Based on the vector \(\pmb {\hat{Z}}_{n}=(1-\hat{F}(Y_1),\ldots ,1-\hat{F}(Y_n))\), we can directly estimate R(t), using the well known method of the kernel distribution function estimation.

In Sect. 2 we define a new kernel smoothing estimator of the ROC curve, which is invariant to nondecreasing data transformations and involves only one bandwidth parameter. We also show some asymptotic results, including a MSE comparison of the proposed estimator and the kernel-smoothed estimator proposed by Lloyd (1998). In Sect. 3 we propose a method of bandwidth selection. Section 4 contains results of simulation studies. Finally, in Sect. 5 we apply the proposed estimator to real data. All proofs are put in Appendices 1 and 2.

2 Main results

Let \(\pmb {X}_m=(X_1,\ldots ,X_m)\) and \(\pmb {Y}_n=(Y_1,\ldots ,Y_n)\) be independent simple samples from unknown distribution functions F and G with the same supports \(I_F=I_G\subseteq \mathbb {R}\) and with density functions f and g (with respect to Lebesgue measure), respectively. Let K be a continuous symmetric density function with support \([-1,1]\) and denote \(\mathscr {K}(x)=\int _{-\infty }^{x}K(y)dy\). Define the smooth ROC curve estimator as

$$\begin{aligned} \hat{R}_{m,n}(t)=\frac{1}{n}\sum _{i=1}^n\mathscr {K}\left( \frac{t-1+F_m(Y_i)}{h_n}\right) , \end{aligned}$$
(5)

where \(h_n>0\) is a bandwidth parameter and \(F_m\) denotes the empirical distribution function of the sample \(\pmb {X}_m\). For the estimator \(\hat{R}_{m,n}\) to have better (than some other estimators) asymptotic MSE properties, the kernel function K should satisfy some conditions like e.g. \(\int _{-1}^{1}K''(x)dx<0\). Therefore, for simplicity, we assume that K is the Epanechnikov kernel \(K(x)=3/4(1-x^2)\mathbf {1}_{[-1,1]}(x)\).

If we consider the ROC curve as the distribution function of the random variable \(Z=1-F(Y)\) [see (4)], it is easily seen that \(\hat{R}_{m,n}(t)\) is the kernel distribution function estimator based on \(\hat{Z}_i=1-F_m(Y_i)\), which are estimators of \(Z_i=1-F(Y_i), i=1,2,\ldots ,n\).

Remark 1

If we apply the same nondecreasing transformation to samples \(\pmb {X}_m\) and \(\pmb {Y}_n\), the estimator given by (5) does not change, which means that \(\hat{R}_{m,n}(t)\) is transformation invariant. Therefore, without loss of generality, we can assume that \(I_F=I_G =\mathbb {R}\).

Theorem 1

Assume that \(R''(s)\) exists for s near \(t \in (0,1), R''(s)\) is continuous at \(s=t\) and let \(h_n\rightarrow 0\). Then

$$\begin{aligned} \mathrm {Bias}\big (\hat{R}_{m,n}(t)\big ) =&\frac{R''(t)}{10}h_n^2 + O\left( \frac{1}{m} + \frac{1}{m^2h_n^2}\right) + o(h_n^2),\end{aligned}$$
(6)
$$\begin{aligned} \mathrm {Var}\big ( \hat{R}_{m,n}(t) \big ) =&\frac{R(t)\left( 1-R(t) \right) }{n} + \frac{t(1-t)[R'(t)]^2}{m} - \frac{\frac{9}{35}R'(t)\left[ \frac{m}{n}-R'(t) \right] h_n}{m} \nonumber \\&\quad - \frac{3 t^2(1-t)^2[R'(t)]^2}{m^2h_n^2} + \frac{\frac{15}{4} t^3(1-t)^3[R'(t)]^2}{m^3h_n^4}\nonumber \\&\quad + O\left( \frac{m+n}{m^2nh_n} + \frac{m+n}{m^3nh_n^3} + \frac{m+nh_n}{m^4nh_n^5} \right) + o\left( \frac{h_n}{m} \right) . \end{aligned}$$
(7)

Let \(k_n \in \mathbb {N}\) denote the minimal sample size for which the MSE of the empirical ROC curve is no greater than the MSE of estimator (3), based on a sample of size \(n\in \mathbb {N}\). Lloyd and Zhou (1999) showed that, under some assumptions on the kernel function and the bandwidths \(h_F\) and \(h_G\), the difference \(k_n-n\) is divergent to infinity and \(k_n-n\sim n \sqrt{h_F h_G}\). The proposed estimator \(\hat{R}_{m,n}\) has an analogous advantage, not only over the empirical ROC curve, but also over estimator (3) proposed by Lloyd (1998). Assume that

$$\begin{aligned} m=m(n)=n\lambda _n, \quad \lambda _n\rightarrow \lambda \in (0, \infty ), \end{aligned}$$
(M)

and denote

$$\begin{aligned} b_n(t)=\min \left\{ j \in \mathbb {N}: MSE\big ( \tilde{R}_{j}^{\star }(t) \big ) \le MSE\big ( \hat{R}_{n}(t) \big ) \right\} , \end{aligned}$$
(8)

where \(\tilde{R}_{n}^{\star }\) is the kernel ROC curve estimator given by (3), with the asymptotic optimal (in the sense of minimizing the MSE) bandwidths \(h_{F}^{\star }\) and \(h_{G}^{\star }\), which are \(O(m^{-1/3})\) and \(O(n^{-1/3})\), respectively (Lloyd and Zhou 1999; Hall and Hyndmann 2003). The asymptotic bias and variance of \(\widetilde{R}_{m,n}\) are given by

$$\begin{aligned} \mathrm {b}\left( \tilde{R}_{n}(t)\right)= & {} \frac{\alpha R''(t)}{2}\left[ h_F^2 f^2(F^{-1}(1-t)) + \frac{t(1-t)}{m}\right] \nonumber \\&\quad + \frac{1}{2}(h_F^2-h_G^2)g'(F^{-1}(1-t))+ o\left( h_F^2+\frac{1}{m}\right) ,\end{aligned}$$
(9)
$$\begin{aligned} \mathrm {Var}\left( \tilde{R}_{n}(t) \right)= & {} \frac{R(t)\left[ 1-R(t) \right] }{n}+\frac{t(1-t)[R'(t)]^2}{m} + O\left( \frac{h_G}{n}\right) , \end{aligned}$$
(10)

and were derived by Lloyd (1998). Therefore, using the condition (M), we get

$$\begin{aligned} \mathrm {MSE}\left( \tilde{R}_{n}^{\star }(t) \right) = \frac{R(t)\left[ 1-R(t) \right] }{n}+\frac{t(1-t)[R'(t)]^2}{\lambda _n n} + O\left( n^{-\frac{4}{3}}\right) . \end{aligned}$$
(11)

Theorem 2

Suppose that assumptions of Theorem 1 hold and the condition (M) is satisfied. Then, if \(nh_n^2\rightarrow \infty \) and \(nh_n^4\rightarrow 0\), we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{b_n(t)}{n}=1. \end{aligned}$$

Under the additional assumptions \(nh_n^3\rightarrow 0\) and \(\lambda -\lambda _n=O\left( n^{-1/3} \right) \), we get

$$\begin{aligned} \lim _{n\rightarrow \infty }h_n^2(b_n(t)-n)=\frac{3 t^2(1-t)^2[R'(t)]^2}{R(t)\left[ 1-R(t) \right] \lambda ^2+t(1-t)[R'(t)]^2\lambda }\ge 0, \end{aligned}$$

and the above limit is strictly positive if \(R'(t)>0\). Finally, if we assume that \(nh_n^2\rightarrow \delta \in (0,\infty )\) and \(\delta >\frac{5}{4\lambda }t(1-t)\), then

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{b_n(t)}{n}=\frac{\delta ^2}{\delta ^2- \frac{3t^2(1-t)^2[R'(t)]^2\left\{ \lambda \delta - \frac{5}{4}t(1-t)\right\} }{R(t)\left[ 1-R(t)\right] \lambda ^3+t(1-t)[R'(t)]^2\lambda ^2}}\ge 1, \end{aligned}$$

and the above limit is strictly positive if \(R'(t)>0\).

Remark 2

The MSE of the ROC curve estimator proposed by Peng and Zhou (2004), with asymptotically optimal choice of bandwidth, has the same form as the MSE of the estimator proposed by Lloyd (1998), given by (11) (see Peng and Zhou (2004), Sect. 3). Therefore, Theorem 2 remains true if, instead of the estimator \(\tilde{R}_{n}^{\star }\) appearing in definition (8) of \(b_n(t)\), we insert the Peng and Zhou’s estimator.

3 Bandwidth selection

In this section we deal with the issue of choosing the parameter \(h_n\), appearing in (5). In the problem of bandwidth selection when estimating the distribution function, to the best of our knowledge, only two methods have been investigated: plug-in and cross-validation. The plug-in bandwidth choice was studied e.g. by Altman and Leger (1995) and Polansky and Baker (2000). The least-squares cross-validation method was analyzed in Sarda (1993) and in Bowman et al. (1998). It seems that an idea presented in the last paper may be adapted to our problem. Bowman et al. proposed the method which minimizes the function

$$\begin{aligned} CV_0(h)=\frac{1}{n}\sum _{i=1}^{n}\int \left( I(x-x_i)-\widetilde{F}_{n,-i}(x,h) \right) ^2dx, \end{aligned}$$
(12)

where \(I(x-x_i)=1\) if \(x-x_i\ge 0\) and 0 in other case, and \(\widetilde{F}_{n,-i}(x,h)\) denotes the kernel distribution function estimator constructed from the data with observation \(x_i\) omitted. Analogously, one can choose the bandwidth parameter \(h_n\) by minimizing the function

$$\begin{aligned} CV(h)=\frac{1}{n}\sum _{i=1}^{n}\int _{0}^{1}\left( I(t-1+F_m(y_i))-\hat{R}_{m,n,-i}(t) \right) ^2dt, \end{aligned}$$
(13)

where

$$\begin{aligned} \hat{R}_{m,n,-i}(t)=\frac{1}{n-1}\sum _{j \ne i}\mathscr {K}\left( \frac{t-1+F_m(y_j)}{h_n}\right) . \end{aligned}$$
(14)

This method of bandwidth selection works decently and usually leads to the estimator \(\hat{R}_{m,n}\) with the MSE smaller than in the case of the empirical ROC curve. However, when the sample sizes are small, it is not very stable and often gives too small or too large parameters \(h_n\), which results in under- or oversmoothed estimated ROC curves, respectively. Moreover, procedure of numerical minimization of the function CV, repeated many times, is time consuming

For that reason we propose another method of choosing the parameter \(h_n\). From Theorem 2 it follows that for fixed \(t \in (0,1)\) and for \(nh_n^2 \rightarrow \delta \), where \(\frac{5}{4\lambda }t(1-t)<\delta <\infty \), we have

$$\begin{aligned} \lim _{n\rightarrow \infty }\frac{b_n}{n}=\frac{\delta ^2}{\delta ^2- \frac{3t^2(1-t)^2[R'(t)]^2\left\{ \lambda \delta - \frac{5}{4}t(1-t)\right\} }{R(t)\left[ 1-R(t)\right] \lambda ^3+t(1-t)[R'(t)]^2\lambda ^2}}:=\varPsi (\delta )\ge 1, \end{aligned}$$

and it is easy to check that the function \(\varPsi (\delta )\) is maximized for

$$\begin{aligned} \delta ^{\star }=\frac{5t(1-t)}{2\lambda }. \end{aligned}$$

Therefore, for fixed \(t \in (0,1)\), to maximize the asymptotic relative efficiency of \(\hat{R}_{n}(t)\) with respect to \(\tilde{R}_{n}^{\star }(t)\), the bandwidth parameter \(h_n\) should be selected in such a way that \(nh_n^2\rightarrow \delta ^{\star }\). Hence, we propose to choose the bandwidth parameter which depends on t and is of the form

$$\begin{aligned} h_n^{\star }(t)=c_n\sqrt{\frac{\delta ^{\star }}{n}}=\frac{c_n\sqrt{5t(1-t)}}{\sqrt{2n\lambda }}, \end{aligned}$$
(15)

where \(c_n\) is some sequence converging to 1. Note that our proposed method of bandwidth selection gives the smoothing parameter \(h_n^{\star }(t)\) which, in contrast to the optimal bandwidth(s) obtained by other methods relating to some other kernel ROC curve estimators (e.g. Lloyd and Zhou 1999; Hall and Hyndmann 2003; Peng and Zhou 2004), does not depend on the unknown distribution functions F and G. Therefore the parameter \(h_n^{\star }(t)\) is easy to compute. Moreover, \(h_n^{\star }(t)\) becomes small near the ends of the interval [0, 1], which results in a reduction of the bias of the proposed estimator, especially for t close to 0.

4 Simulation study

A small simulation study was performed to investigate the efficiency of the proposed estimator of the ROC curve for the limited sample sizes. We considered four different combinations of the distribution functions. In the first two studies, both F and G belong to the same family of distributions, normal or logistic. The parameters are selected so that the resulting ROC curves have similar shapes (see Fig. 1). In the other two studies, F and G are different: if one is normal, the other is logistic. In this case also the corresponding ROC curves are completely different. In the simulations we used 1000 samples of equal sizes \(m=n=20,50\). For each of the considered ROC curves and sample sizes, we computed the empirical ROC curve (EM), the smoothed empirical ROC curve (SEM) of Jokiel-Rokita and Pulit (2012), based on smoothed empirical CDFs, the Bayesian bootstrap estimator (BB) proposed by Gu et al. (2008), the local linear smoothing estimator (LLS) of Peng and Zhou (2004), the Lloyd’s kernel-smoothed estimator (KS) and the new kernel smoothing estimator (NKS) proposed in this paper.

Fig. 1
figure 1

The ROC curve corresponding to: (1) \(X \sim \mathscr {N}(0,1), Y \sim \mathscr {N}(1,1)\) (solid curve); (2) \(X \sim \mathscr {LG}(0,2), Y \sim \mathscr {LG}(3,2)\) (dotted curve); (3) \(X \sim \mathscr {LG}(0,1), Y \sim \mathscr {N}(2.5,9)\) (dashed curve); (4) \(X \sim \mathscr {N}(0,9), Y \sim \mathscr {LG}(2.5,1)\) (dot-dashed curve)

Fig. 2
figure 2

The graphs of estimated MSE on the unit interval for the sample sizes \(m=n=20\) and for data from: a \(F \sim \mathscr {N}(0,1), G \sim \mathscr {N}(1,1)\); b \(F \sim \mathscr {LG}(0,2), G \sim \mathscr {LG}(3,2)\); c \(X \sim \mathscr {LG}(0,1), Y \sim \mathscr {N}(2.5,9)\); d \(X \sim \mathscr {N}(0,9), Y \sim \mathscr {LG}(2.5,1)\)

Fig. 3
figure 3

The graphs of estimated MSE on the unit interval for the sample sizes \(m=n=50\) and for data from: a \(F \sim \mathscr {N}(0,1), G \sim \mathscr {N}(1,1)\); b \(F \sim \mathscr {LG}(0,2), G \sim \mathscr {LG}(3,2)\); c \(X \sim \mathscr {LG}(0,1), Y \sim \mathscr {N}(2.5,9)\); d \(X \sim \mathscr {N}(0,9), Y \sim \mathscr {LG}(2.5,1)\)

Although the choice of the sequence \(c_n\) appearing in (15) does not affect on the asymptotic behavior of the estimator \(\widetilde{R}_{m,n}\), for the limited sample sizes the best results are achieved when \(c_n\approx 1.5-2.5\), depending on the estimated ROC curve and the value of n. In the simulation study, choosing \(h_n^{\star }(t)\) for our estimator, for simplicity, we decided to take \(c_n=1+1.8n^{-1/5}\) in all the considered cases. In the problem of bandwidth selection for the kernel estimator (KS), we used the normal-reference method proposed by Hall and Hyndmann (2003), which is recommended when the sampled distributions are not far from normal. The authors found that in the context of the ROC curve estimation, proposed method give substantial improvement in the mean integrated squared error over other known methods of bandwidth selection. Finally, in the case of the local linear smoothing estimator (LLS), we choose the smoothing parameter which minimizes the mean trimmed integrated squared error, assuming knowledge of the distribution functions F and G, [see Peng and Zhou (2004), Sect. 3].

Figures 2 and 3 display the results of the simulations for the sample sizes \(m=n=20\) and \(m=n=50\), respectively. Every figure contains four plots corresponding to four different ROC curves which are to be estimated (see Fig. 1) and every single plot compares the considered ROC curve estimators in term of their mean squared error (MSE) on the unit interval. The obtained results indicate that the proposed estimator (NKS) is competitive with other estimators, also for the limited sample sizes. In the problem of estimation of the ROC curve it performs better than the empirical ROC curve (EM), the smoothed empirical ROC curve (SEM) and the Bayesian bootstrap estimator (BB). In some of the cases it is also better than two other estimators, (KS) and (LLS).

Supplementary materials to the paper, containing some box-plots comparing the accuracy of the estimators in term of MSE when estimating AUC, are available at https://drive.google.com/file/d/0B3L4pdDwuWxvT0RRbmUzWGtaa28/view?pli=1.

5 Real data analysis

To illustrate our method, we apply it to the set of real data which comes from a clinical study performed from November 2008 to August 2011 by a research team led by Dr. Krzysztof Tupikowski from Department of Urology and Oncological Urology, Wroclaw Medical University.

One investigated the efficacy of combined treatment of interferon alpha and metronomic cyclophosphamide in patients with metastatic kidney cancer not eligible for thyrosine kinase inhibitors treatment with various negative prognostic factors for survival. It has been approved by an independent local bioethics committee. One of the secondary goals of the study was to assess if there are any predictive factors for response to this novel combination treatment.

Table 1 contains presence (1) (or absence - 0) of clinical response (CR) observed at 24-th week of treatment, hemoglobin level (HL) and serum fibrinogen concentration (FC) of 31 patients treated per protocol. Missing data are denoted by x. Low HL has been previously associated with short survival and poor response to treatment in disseminated disease (Tonini et al. 2011). High FC is examined as a negative predictor of response to treatment in metastatic kidney cancer patients for the first time.

Table 1 The real data in the form (CR, HL, FC)

The estimators of the ROC curves for HL (left) and FC (right) as the predictive factors (positive and negative, respectively) are plotted in Fig. 4.

Fig. 4
figure 4

The fitted empirical ROC curve (dotted curve) and the proposed estimator of the ROC curve (solid curve) for HL (a) and FC (b)