Abstract
This paper examines the asymptotic properties of a binary response model estimator based on maximization of the Area Under receiver operating characteristic Curve (AUC). Given certain assumptions, AUC maximization is a consistent method of binary response model estimation up to normalizations. As AUC is equivalent to Mann-Whitney U statistics and Wilcoxon test of ranks, maximization of area under ROC curve is equivalent to the maximization of corresponding statistics. Compared to parametric methods, such as logit and probit, AUC maximization relaxes assumptions about error distribution, but imposes some restrictions on the distribution of explanatory variables, which can be easily checked, since this information is observable.
Similar content being viewed by others
Notes
\(k=1\) leads to a degenerate case, when the parameter of interest is normalized to 1 or \(-\)1.
References
Agarwal S, Har-Peled S, Roth D (2005) A uniform convergence bound for the area under the ROC curve. In: Proceedings of the 10th international workshop on artificial intelligence and, statistics, pp 1–8
Ailon N, Mohri M (2007) An efficient reduction of ranking to classification. Technical Report TR-2007-903, New York University
Balcan MF, Bansal N, Beygelzimer A, Coppersmith D, Langford J, Sorkin GB (2008) Robust reductions from ranking to classification. Mach Learn J 72(1–2):139–153
Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12(4):387–415
Cortes C, Mohri M (2004) AUC optimization vs error rate minimization. Advances in neural information processing systems. MIT Press, Cambridge
Jaroszewicz S (2006) Polynomial association rules with applications to logistic regression. KDD conference paper, pp 586–591
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Herschtal A, Raskutti B (2004) Optimising area under the roc curve using gradient descent. ACM Press, ICML
Horowitz JL (1992) Smoothed maximum score estimator for the binary response model. Econometrica 60(3):505–531
Manski CF (1975) Maximum score estimation of the stochastic utility model of choice. J Econom 3(3): 205–228
Manski CF (1983) Closest empirical distribution estimation. Econometrica 51(2):305–319
Manski CF (1985a) Semiparametric analysis of discrete response: asymptotic properties of the maximum score estimator. J Econom 27(3):313–333
Manski CF (1985b) Semiparametric analysis of binary response from response-based samples. J Econom 31(1):31–40
Manski CF (1986) Operational characteristics of maximum score estimation. J Econom 32(1):85–108
Manski CF (1988) Identification of binary response models. J Am Stat Assoc 83(403):729–738
Marrocco C, Duin RPW, Tortorella F (2008) Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognit 41(6):1961–1974
Rakotomamonjy A (2004) Optimizing area under ROC curve with SVMs. ROC Anal Artif Intell proceedings, 71–80
Toh KA, Kim J, Lee S (2008) Maximizing area under ROC curve for biometric scores fusion. Pattern Recognit 41:3373–3392
Train K (2003) Discrete choice methods with simulation, 1st edn. Cambridge University Press, Cambridge
Wenxia G, Whitmore GA (2010) Binary response and logistic regression in recent accounting research publications: a methodological note. Rev Quant Financ Account 34(1):81–93
Wooldridge JM (2006) Introductory econometrics: a modern approach, 3rd edn. Thomson South-Western, Canada
Acknowledgments
I would like to thank the participants at the 12th Symposium of Mathematics and its Applications (2009) in Timisoara. Furthermore, I wish to thank Alfredas Račkauskas, Dmitrij Celov and Irena Mikolajun for their useful comments and Steve Guttenberg for his help with the English language.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of Lemma 2
Proof
Using the definition of conditional probability, the expression for \(AUC_\infty (b)\) in Eq. (4) can be rewritten:
The probability that for a randomly drawn \(X_1\) and \(X_2\) pair inequality \(b^{\prime } X_1>b^{\prime }X_2\) holds is constant and because of assumption 6 it is equal to 0.5. This constant will be included in \(C\), to make notation as simple as possible. Employing this notation and the law of total probability we find:
Equation (8) derives from the facts that the inner integral in Eq. (7) can be treated as a probability and \(Y_1\) and \(Y_2\) are independent events.
Next, we take the true parameter \(\beta \) and compare it with an arbitrary parameter \(\tilde{\beta }\). In Eq. (8), parameters alter the integral range and also have an impact on \(P_\epsilon (Y_1=1\vert X_1)P_\epsilon (Y_2=0 \vert X_2)\), because \(b\) determines if \(Y\) will be treated as \(Y_1\) or \(Y_2\). Note that for an observation with explanatory factors \(X_r, P(Y_r=0\vert X_r)=F_\epsilon (-b^{\prime } X_r)\) and \(P(Y_r=1\vert X_r)=1-F_\epsilon (-b^{\prime } X_r), \forall X_r, r=1,2\).
Consider \(X_1\) and \(X_2\) from \(D_X\). Without loss of generality it can be assumed that \(\beta ^{\prime } X_1> \beta ^{\prime } X_2\). The observations will be ranked correctly when \(Y_1=1\) and \(Y_2=0\). With parameter \(\beta \), the probability of ranking observations correctly is \(A= P_\epsilon (Y_1=1\vert X_1)P_\epsilon (Y_2=0\vert X_2)= (1-F_\epsilon (-\beta ^{\prime } X_1))F_\epsilon (-\beta ^{\prime } X_2)\). Now let us consider another parameter: \(\tilde{\beta }\). When \(\tilde{\beta } X_1 > \tilde{\beta } X_2\), the probability of ranking observations correctly remains the same because \(\tilde{\beta }\) doesn’t generate data. Namely, term \(1-F_\epsilon (-\beta ^{\prime } X_1)\) remains, because the true probability of \(Y_1=1\) is \(1-F_\epsilon (-\beta ^{\prime } X_1)\). Another situation is \(\tilde{\beta } X_2 > \tilde{\beta } X_1\). Now the probability of correct ranking is \(\tilde{A}= P_\epsilon (Y_2=1\vert X_2)P_\epsilon (Y_1=0\vert X_1)= F_\epsilon (-\beta ^{\prime } X_1)(1-F_\epsilon (-\beta ^{\prime } X_2)) =F_\epsilon (-\beta ^{\prime } X_1)-F_\epsilon (-\beta ^{\prime } X_1)F_\epsilon (-\beta ^{\prime } X_2)\). If we compare this with \(A=F_\epsilon (-\beta ^{\prime } X_2)-F_\epsilon (-\beta ^{\prime } X_1)F_\epsilon (-\beta ^{\prime } X_2)\), it clear that the term \(F_\epsilon (-\beta ^{\prime } X_2)\ge F_\epsilon (-\beta ^{\prime } X_1)\), because \(\beta ^{\prime } X_1> \beta ^{\prime } X_2\) and \(F_\epsilon \) is nondecreasing. Hence, \(A\ge \tilde{A}\).
The integral range of \(AUC_\infty (\beta )\) is \(\beta ^{\prime }(X_1-X_2)>0\), while taking a parameter \(\tilde{\beta }\) the integral range is \(\tilde{\beta }^{\prime }(X_1-X_2)>0\). It may be the case that \(X\) is concentrated in the area \(\tilde{\beta }^{\prime }(X_1-X_2)>0\), with relatively few observations in \(\beta ^{\prime }(X_1-X_2)>0\). To insure that this is not the case, it is assumed that \(X\) is drawn from a distribution that is symmetric around zero. Lines \(\beta ^{\prime }(X_1-X_2)=0\) and \(\tilde{\beta }^{\prime }(X_1-X_2)=0\) both pass through the origin in \(D_X\times D_X\) space, therefore \(X\) symmetry around zero insures that \(AUC_\infty (\beta )\ge AUC_\infty (\tilde{\beta })\). \(\square \)
1.2 Proof of Lemma 3
Proof
Suppose the existence of an \(X_1\) and \(X_2\) pair, \(X_1\in D_X\) and \(X_2 \in D_X\), that \(\beta ^{\prime } X_1> \beta ^{\prime } X_2\), but \(\tilde{\beta }^{\prime } X_1< \tilde{\beta }^{\prime } X_2\). Then a neighborhood of point \(X_1 \tilde{U}(X_1)\) exists such that if we substitute \(X_1\) with an element \(\tilde{X}\) from \(\tilde{U}(X_1)\), the inequalities \(\beta ^{\prime } \tilde{X}> \beta ^{\prime } X_2\) and \(\tilde{\beta }^{\prime } \tilde{X}< \tilde{\beta }^{\prime } X_2\) are valid.
Define \(E_r\):
Now define the neighborhood \(\tilde{U}(X_1)\) of the point \(X_1\) as a set of all \(\tilde{X}\) such that for each component of \(\tilde{X} \in \tilde{U}(X_1)\) an inequality is valid: \(X_{1,r}-E_r \le \tilde{X}_r \le X_{i,r}+E_r, r=1 \ldots k\). In general \(\tilde{U}(X_1)\) shouldn’t necessarily be a subset of \(D_X\).
Convergence in assumption 5 implies that \(r<\infty \) exists such that \(U_r(X_1)\subset \tilde{U}(X_1)\). Hence, \(P_X( \tilde{\beta }^{\prime } \tilde{X}< \tilde{\beta }^{\prime } X_2 \;and\; \beta ^{\prime } \tilde{X}> \beta ^{\prime } X_2)>0\). \(\square \)
1.3 Proof of Lemma 4
Proof
The previous lemma implies that if such an \(X_1\) and \(X_2\) pair exists, \(X_1\in D_X\) and \(X_2 \in D_X\), that \(\beta ^{\prime } X_1> \beta ^{\prime } X_2\), but \(\tilde{\beta }^{\prime } X_1< \tilde{\beta }^{\prime } X_2\); such inequalities are valid with nonzero probability. Together with assumption 4, we get \(AUC_\infty (\beta )>AUC_\infty (\tilde{\beta })\). (See proof of Lemma 2.)
Suppose, that \(\tilde{\beta }\) exists such that, when \(\beta ^{\prime } X_1> \beta ^{\prime } X_2\) holds, \(\tilde{\beta } X_1> \tilde{\beta } X_2\) holds. We assumed that the first element of \(X\) has a strictly increasing continuous distribution function. Let us take a pair \(X_1,X_2 \in D_X\). Furthermore consider a sequence of \(\eta _r,r=1,2,3\ldots \) such that \(\eta _r<\beta ^{\prime }(X_{1}-X_{2})/\beta _1\), where \(\beta _1\) is the first element of vector \(\beta \) and \(\lim _{r\rightarrow \infty }\eta _r=\beta ^{\prime }(X_{1}-X_{2})/\beta _1\). Thus the inequality \(\beta ^{\prime } X_{1}> \beta ^{\prime } X_{2}+\beta _1\eta _r\) is satisfied. Likewise the inequality with \(\tilde{\beta }\): \(\tilde{\beta }^{\prime } X_{1}> \tilde{\beta }^{\prime } X_{2}+\tilde{\beta }_1\eta _r\) . When \(r\rightarrow \infty \) the last inequality may be rewritten as \((\tilde{\beta }^{\prime }- (\tilde{\beta }_1/\beta _1)\beta ^{\prime })(X_{1}-X_2)\ge 0\). Taking a sequence of \(\eta _r>\beta ^{\prime }(X_{1}-X_{2})/\beta _1\) converging to \(\beta ^{\prime }(X_{1}-X_{2})/\beta _1\) the opposite inequality is found: \(((\tilde{\beta }^{\prime }- (\tilde{\beta }_1/\beta _1)\beta ^{\prime })(X_{1}-X_2)\le 0\). Hence, \((\tilde{\beta }^{\prime }- (\tilde{\beta }_1/\beta _1)\beta ^{\prime })(X_{1}-X_2)= 0\). Therefore, \(\tilde{\beta }/\tilde{\beta }_1=\beta /\beta _1\): the coefficients \(\beta \) and \(\tilde{\beta }\) are proportional.
If \(\tilde{\beta }=c\beta \) for a \(c>0\), the proof that \(AUC_{\infty }(\tilde{\beta })=AUC_{\infty }(\beta )\) is trivial. It follows directly from the definition of the AUC.
It follows that \(AUC_\infty (\tilde{\beta })=AUC_\infty (\beta )\) is equivalent to \(\tilde{\beta }=c\beta \), where \(c\) is a constant. \(\square \)
1.4 Proof of Lemma 5
Proof
To show, that \(AUC_\infty (b)\) is continuous on \(b\) it is sufficient to show that \(AUC_\infty (b+\Delta b)\rightarrow AUC_\infty (b)\) when \(\Delta b\rightarrow 0\). Rewrite Eq. (4) for \(b+\Delta b\):
A similar strategy can be performed with \(AUC_\infty (b)\):
Subtracting \(AUC_\infty (b)\) from \(AUC_\infty (b+\Delta b)\) we get:
The first term in the right hand side of Eq. (13) may be rewritten as \(CP_{X,\epsilon }\big (b^{\prime }X_1= b^{\prime }X_2; Y_1=1, Y_2=0\big )\). It is equal to zero because of assumption 6. In the second term, events \(b^{\prime }X_1\le b^{\prime } X_2\) and \(b^{\prime }X_1> b^{\prime }X_2\) are mutually exclusive, so the probability of such an event is also equal to zero. Therefore, \(AUC_\infty (b)\) is continuous on \(b\). \(\square \)
Rights and permissions
About this article
Cite this article
Fedotenkov, I. Consistency of the estimator of binary response models based on AUC maximization. Stat Methods Appl 22, 381–390 (2013). https://doi.org/10.1007/s10260-013-0229-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-013-0229-4