Skip to main content
Log in

Generalized Gini Correlation and its Application in Data-Mining

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

An asymmetric correlation measure commonly used in social economics, called the Gini correlation, is defined between a numerical response and a rank. We generalize the definition of this correlation so that it can be applied to data mining. The new definition, called the generalized Gini correlation, is found to include special cases that are equivalent to common evaluation measures used in data mining, for example, the LIFT measures for a binary response and the expected profit measure for a monetary response. We consider estimation and inference regarding this generalized Gini correlation. The asymptotic distribution of the estimated correlation is derived with the help of some empirical process theory. We consider several ways of constructing confidence intervals and demonstrate their performance numerically. Our paper is interdisciplinary and makes contributions to both the Gini literature and the literature of statistical inference of performance measures in data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. In general, a conditional expectation E(Z|A) of a random variable Z conditional on an event A is defined to be the expectation under the conditional distribution P(Z|A), which can be computed by \(E(Z|A)=E(ZI(A))/P(A)\), where the indicator I(A) is 1 if A happens and 0 if not. See the Wikipedia entry under “conditional expectation”. https://en.wikipedia.org/wiki/Conditional_expectation.

  2. Note that \(E(Z|F(S)\ge c) = E(ZI[F(S)\ge c])/(1-c)= cov( Z, I[F(S)\ge c])/(1-c) + EZ.\)

  3. Note that \(cov(Z,G(F_Z(Z))) \) does not depend on S and typically \( cov(Z,G(F_Z(Z)))>0\) (see the proof of Proposition 1). Then \(E(Z|F(S)\ge c) \sim E(ZI[F(S)\ge c]) \sim cov( Z, I[F(S)\ge c]) \sim \Gamma _G(Z,S)\).

  4. Note that F(S) is Uniform[0,1] and EG(F(S)) does not depend on \(S, \Gamma _G(Z,S) \sim cov( Z, G(F(S))) \sim E Z G(F(S))=\int _0^1 E(ZI[F(S)\ge c]) dG(c).\)

  5. This is also equivalent to the accuracy ratio for a cumulative accuracy profile in credit rating. See. e.g., Tasche (2006).

  6. Related graphs in economic literatures include the Absolute Concentration Curve (ACC) and LMA, see Yitzhaki and Schechtman (2012b) who use ACC and LMA in finding whether a monotonic increasing transformation can change the sign of the regression coefficient.

  7. As a referee commented, part of this lemma follows from an observation in Kalkbrener (2005). On page 434, Kalkbrener concludes that a certain functional is a “diversifying capital allocation” with respect to Expected Shortfall. For continuous distributions, this observation implies the \(\ge \) inequality of our Lemma 1. However, our Lemma 1 also includes an additional result on a strict inequality. For completeness, we have included a proof in the on-line supplementary material (Gao et al. 2015).

  8. As pointed out in Sect. 2.4 of Yitzhaki and Schechtman (2012a), inconsistencies may be introduced by discreteness of random variables, and some of the properties described in this paper may no longer hold. In practice, when necessary, one can apply adjustments for discrete variables by adding a small random error to separate ties. An example of such treatment can be found in Sect. 8.

  9. This condition may not be necessary. We believe that by showing Hadamard differentiability and using the functional delta method, nonsmooth G can also be included. Simulation results later will include nonsmooth G and will support this conjecture.

  10. Data can be downloaded at https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html.

  11. A two-step model is used in GainSmarts, winner of the KDD-98 Cup. See, e.g., http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html.

  12. We thank Professor Shlomo Yitzhaki for pointing out this possibility and providing insightful comments.

References

  • Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3(10):27–38

    Google Scholar 

  • Dong Y, Li X, Li J, Zhao H (2012) Analysis on weighted AUC for imbalanced data learning through isometrics. J Computat Inf Syst 8(1):371–378

    Google Scholar 

  • Gao Y (2016) On a generalization of the Gini correlation for statistical data mining. Ph.D. Dissertation, Department of Statistics, Northwestern University (in preparation)

  • Gao Y, Jiang W, Tanner MA (2015) Supplementary materials for “Generalized Gini Correlation and its Application in Data-Mining”, Technical Report, Department of Statistics, Northwestern University, http://faculty.wcas.northwestern.edu/~wji047/documents/ggcsupp1

  • Jiang W, Zhao Y (2015) On asymptotic distributions and confidence intervals for lift measures in data mining. J Am Stat Assoc (accepted)

  • Kalkbrener M (2005) An axiomatic approach to capital allocation. Math Financ 15(3):425–437

    Article  MathSciNet  MATH  Google Scholar 

  • Schechtman E, Yitzhaki S (1987) A measure of association based on Gini’s mean difference. Commun Stat Theory Methods 16(1):207–231

    Article  MathSciNet  MATH  Google Scholar 

  • Schechtman E, Yitzhaki S (2003) A family of correlation coefficients based on the extended Gini index. J Econ Inequal 1(2):129–146

    Article  Google Scholar 

  • Tasche D (2006) Validation of internal rating systems and PD estimates. Anal Risk Model Valid 28:169–196

    Google Scholar 

  • Van der Vaart AW (2000) Asymptotic statistics. Cambridge University Press, New York

    MATH  Google Scholar 

  • Van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes. Springer, New York

    Book  MATH  Google Scholar 

  • Walter SD (2005) The partial area under the summary ROC curve. Stat Med 24(13):2025–2040

    Article  MathSciNet  Google Scholar 

  • Yitzhaki S, Schechtman E (2005) The properties of the extended Gini measures of variability and inequality. METRON 63(3):401–433

    MathSciNet  Google Scholar 

  • Yitzhaki S, Schechtman E (2012a) The Gini methodology: a primer on a statistical methodology. Springer, New York

    MATH  Google Scholar 

  • Yitzhaki S, Schechtman E (2012b) Identifying monotonic and non-monotonic relationships. Econ Lett 116(1):23–25

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao Y (2012) R and data mining: examples and case studies. Academic Press, Elsevier

    Google Scholar 

Download references

Acknowledgments

We thank Professor Shlomo Yitzhaki and Professor Edna Schechtman for their comments and suggestions on a previous draft of this paper. We also thank Professor Hongmei Jiang for introducing to us the literature on Gini correlation. Finally, we thank the editor and referees for their comments and suggestions which greatly improved this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Gao.

Additional information

Responsible editor: Johannes Fürnkranz.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 167 KB)

Appendices

Appendix A: Proof of Proposition 2

  1. 1.

    The upperbound is due to Proposition 1. The lowerbound is due to the counterexamples given in Yitzhaki and Schechtman (2005) in the context of the Extended Gini correlation for \(\nu >1\), which is a special example of a GG correlation as commented in Sect. 2.

  2. 2.

    If S is a monotone increasing function of \(Z, F_S(S)=F_Z(Z)\). Hence the ratio is +1.

  3. 3.

    When S and Z are independent, the corresponding cumulative gain curve is simply the random selection curve. By the proof of Proposition 1, we know that the improvement of S is 0. Hence the generalized Gini correlation is also 0.

  4. 4.

    The first part is the result of P4 in the Sect. 2 of Schechtman and Yitzhaki (1987). For general G, by the definition of \(\Gamma _G\), we know that \(\Gamma _G(Z,S)=-\Gamma _G(-Z,S)\) only when \(cov(Z,G(F_Z(Z))\) is invariant under replacement of Z by \(-Z\). However, this is not true due to Yizhaki and Schechtman (2005, Eq. (2.6)), when G is the cdf corresponding to certain Extended Gini correlations: \(G(c)=1-(1-c)^{\nu -1}, \nu \ne 2, \nu >1\).

  5. 5.

    \(S'=f(S)\) be a strictly increasing transformed variable of S. Then \(F_{S'}(s')=P(f(S)\le f(s))=P(S\le s), \forall s\). Therefore \(Cov(Z,G(F_S'(S')))=Cov(Z,G(F_S(S)))\). The generalized Gini correlation does not change by an increasing transformation of S.

  6. 6.

    This is based on the property of covariance used in the definition of generalized Gini correlation.

  7. 7.

    By definition,

    $$\begin{aligned} \Gamma _G(Z,S)=\frac{Cov(Z,G(F_S))}{Cov(Z,G(F_Z))}=\frac{E(Z\cdot G(F_S))-EZ\cdot E(G(F_S))}{E(Z\cdot G(F_Z))-EZ\cdot E(G(F_Z))}. \end{aligned}$$

    Without loss of generality, we assume \((Z,S)\buildrel d \over = (S,Z)\). It is obvious that \(E(Z\cdot G(F_Z))-EZ\cdot E(G(F_Z))=E(S\cdot G(F_S))-ES\cdot E(G(F_S))\). For the numerator:

    $$\begin{aligned} Cov(Z,G(F_S))= & {} \int _{\mathcal {Z}\times \mathcal {S}} z \cdot F_S(s) d F_{Z,S}(z,s)-EZ \cdot E(G(F_S))\\= & {} \int _{\mathcal {S}\times \mathcal {Z}} z \cdot F_Z(s) d F_{S,Z}(z,s)-ES \cdot E(G(F_Z))\\= & {} \int _{\mathcal {S}\times \mathcal {Z}} s \cdot F_Z(z) d F_{S,Z}(s,z)-ES \cdot E(G(F_Z))\\= & {} Cov(S,G(F_Z)). \end{aligned}$$

    Hence we have the desired result.

  8. 8.

    Due to properties 5 and 6, suppose \((Z,S)\sim \mathcal {N}\left( \left( \begin{array}{c} \mu _1\\ \mu _2 \end{array}\right) ,\left( \begin{array}{cc} \sigma _1^2&{}\rho \sigma _1\sigma _2\\ \rho \sigma _1\sigma _2&{}\sigma _2^2 \end{array}\right) \right) ,\) we are always able to find \((Z',S')\) such that \((Z',S')\sim \mathcal {N}\left( \left( \begin{array}{c} 0\\ 0 \end{array}\right) ,\left( \begin{array}{cc} 1&{}\rho \\ \rho &{}1 \end{array}\right) \right) \) and \(\Gamma _G(Z,S)=\Gamma _G(Z',S')\). We know that \(E(Z'|S'=s')=\rho s'\). Recall that the cumulative gain has the form:

    $$\begin{aligned} C_{Z',S'}(1-c)=\int _c^1E(Z'|F_{S'}(S')=u)du=\rho \int _c^1 F^{-1}_{S'}(u)du, \end{aligned}$$

    for any \(c\in (0,1)\). We can see that \(\forall p\in (0,1), C_{Z',S'}(p)=\rho C_{Z',Z'}(p)\) and \(C_{Z',S_0}(p)=0\). We can relate the GG correlation to the cumulative gain using the relation in Proposition 1. Therefore,

    $$\begin{aligned} \Gamma _G(Z,S)= & {} \Gamma _G(Z',S')\\= & {} \frac{\int _0^1 [C_{Z',S'}(1-c)-C_{Z',S_0}(1-c)] d G(c)}{\int _0^1 [C_{Z',Z'}(1-c)-C_{Z',S_0}(1-c)] d G(c)}\\= & {} \frac{\int _0^1[\rho C_{Z',Z'}(1-c)-0] d G(c)}{\int _0^1 [C_{Z',Z'}(1-c)-0 ] d G(c)}=\rho . \end{aligned}$$

    \(\square \)

Appendix B: Proof of Proposition 3

We will mainly use the following lemmas to obtain the asymptotic distribution of the sample GG correlation: Here Z and S are continuous random variables with bounded density functions and E(Z|S) is continuous in S.

Lemma 2

$$\begin{aligned} \sqrt{n}(\hat{E}- E) (Z\cdot G(\hat{F}_S)-Z\cdot G(F_S))= & {} o_p(1)\\ \sqrt{n}(\hat{E}- E) (Z\cdot G(\hat{F}_Z)-Z\cdot G(F_Z))= & {} o_p(1). \end{aligned}$$

Proof of Lemma 2

The first statement can be shown by Lemma 19.24 Van der Vaart and Wellner (1996). The conditions needed to be checked are:

  1. 1.

    \(E( |Z\cdot G(\hat{F}_{S,n})- Z\cdot G(F_S)|^2)\overset{p}{\rightarrow }0\).

  2. 2.

    The set of functions \(\mathcal {F}=\{f(z,u)=z\cdot G(u): G \text {is a distribution function}\}\) is Donsker with respect to the distribution of \(F_{Z,U}\).

For the first condition, note that \(E( |Z\cdot G(\hat{F}_{S,n})- Z\cdot G(F_S)|^2) \le |ZG'|_\infty ^2 |\hat{F}_{S,n} -F_S|_\infty ^2 \overset{p}{\rightarrow }0\).

Now we prove the second condition. Since both G and \(F_S\) are distribution functions, \(G(F_S(\cdot ))\) is also a distribution function. By Theorem 2.7.5 Van der Vaart and Wellner (1996), the family of distribution functions \(\mathcal G\) is Q-Donsker for every probability measure Q:

$$\begin{aligned} \log N_{[\, ]}(\epsilon ,\mathcal G, L_2(Q))\le K\cdot \frac{1}{\epsilon }. \end{aligned}$$

Here \(N_{[\, ]}\) represents the bracketing number described in Definition 2.1.6 of Van der Vaart and Wellner (1996). We are able to find bracketing functions \([l_i,u_i]\) covering all \(G(F_S(\cdot ))\) with total number satisfying the above Donsker condition. Suppose a function \(G(F_S)\) is covered by [lu], construct a new pair of bracketing functions for \(z\cdot G(F_S)\):

$$\begin{aligned} l^*=\left\{ \begin{array}{ll} z\cdot l, &{} \text {if } z\ge 0\\ z\cdot u, &{} \text {if } z<0\\ \end{array} \right. , \quad u^*=\left\{ \begin{array}{ll} z\cdot u, &{} \text {if } z\ge 0\\ z\cdot l, &{} \text {if } z <0\\ \end{array} \right. . \end{aligned}$$

It is easy to see \(l^*\le z G(F_S)\le u^*\) and \(\int _{-\infty }^{+\infty }(u^*-l^*)^2dF_{Z,S}\le |Z|_\infty ^2 \int _{-\infty }^{+\infty }(u-l)dF_S\le |Z|_\infty ^2\epsilon \). For each pair of functions \([l_i,u_i], [l_i^*,u_i^*]\) can be constructed for \(z\cdot G(F_S(s))\). The bracketing number for the set of functions \(\mathcal F=\{f(z,s): f=z\cdot G(F_S(s))\}\) satisfies:

$$\begin{aligned} \log N_{[\, ]}(\epsilon ,\mathcal F, L_2(Q))\le K\cdot \frac{|Z|_\infty ^2}{\epsilon }=K^* \cdot \frac{1}{\epsilon }. \end{aligned}$$

Therefore

$$\begin{aligned} J_{[\, ]}(1,\mathcal F,L_2(Q))=\int _0^1 \sqrt{\log N_{[\, ]}(\epsilon ,\mathcal F, L_2(Q))} d\epsilon < \infty . \end{aligned}$$

The class of functions \(\mathcal F\) is Donsker, and thereby Condition (2) is satisfied. Here \(J_{[\, ]}\) is the bracketing integral and we applied Theorem 19.5 of Van der Vaart (2000).

The second statement is a special case of the first statement where \(S=Z\). There is no condition specifying the relationship between the two random variables for the first statement, so the second part is automatically proved. \(\square \)

Lemma 3

$$\begin{aligned} \hat{E}(G(\hat{F}_S))- E(G(F_S))= & {} o\left( \frac{1}{n}\right) \\ \hat{E}(G(\hat{F}_Z))- E(G(F_Z))= & {} o\left( \frac{1}{n}\right) \end{aligned}$$

Proof of Lemma 3

We prove only the first statement, since the second one is entirely similar. According to the definition of empirical distribution and expectation:

$$\begin{aligned} \hat{E}(G(\hat{F}_S))= & {} \frac{1}{n}\sum _{i=1}^n G(\hat{F}_S(X_i))\\= & {} \frac{1}{n} \sum _{i=1}^n G\left( \frac{1}{n}\sum _{j=1}^n I(X_j\le X_i)\right) \\= & {} \frac{1}{n}\sum _{i=1}^n G\left( \frac{i}{n}\right) . \end{aligned}$$

Due to that fact that \(F_S(S)\sim U(0,1)\), we can see that \(E(G(F_S))\) has the form of a deterministic integral: \(E(G(F_S))=\int _0^1G(t) dt\). Hence \(\hat{E}(G(\hat{F}_S))\) is the Riemann sum approximating the integral. Since G is a distribution function, \(\hat{E}(G(\hat{F}_S))\) is also the upper Riemann sum, and we are able to bound the difference by this way:

$$\begin{aligned} |\hat{E}(G(\hat{F}_S))-E(G(F_S))|= & {} \left| \frac{1}{n}\sum _{i=1}^n G\left( \frac{i}{n}\right) - \int _0^1G(t) dt\right| \\\le & {} \left| \frac{1}{n}\sum _{i=1}^n G\left( \frac{i}{n}\right) -\frac{1}{n}\sum _{i=1}^n G\left( \frac{i-1}{n}\right) \right| \\= & {} \left| \frac{1}{n} (G(1) - G(0)) \right| =\frac{1}{n}. \end{aligned}$$

We have proven the result in Lemma 3. \(\square \)

With the lemmas, we prove Proposition 3 as following:

Generalized Gini Correlation can be written in the form:

$$\begin{aligned} \Gamma _G(Z,S)=\frac{E_1-E_2}{E_3-E_4}, \end{aligned}$$

where:

$$\begin{aligned} E_1= & {} E(Z\cdot G(F_S))\\ E_2= & {} E(Z)\cdot E(G(F_S))\\ E_3= & {} E(Z\cdot G(F_Z))\\ E_4= & {} E(Z)\cdot E(G(F_Z)). \end{aligned}$$

Let the sample version of the GG correlation be \(\hat{\Gamma }_G\), which uses sample expectations \(\hat{E}\) to replace the expectations E in \(\Gamma _G\). Then we have:

$$\begin{aligned} \hat{\Gamma }_G=\Gamma _G +\frac{(E_3-E_4)(\delta _1-\delta _2)+(E_1-E_2)(\delta _4-\delta _3)}{(E_3-E_4)^2(1+(\delta _3-\delta _4)/(E_3-E_4))}, \end{aligned}$$
(6)

where \(\delta _i\) is the difference between sample expectation and corresponding population expectation:

$$\begin{aligned} \delta _1= & {} \hat{E}(Z \cdot G(\hat{F}_S)) - E(Z(G(F_S)))\\ \delta _2= & {} \hat{E}(Z) \cdot \hat{E}(G(\hat{F}_S))- E(Z)\cdot E(G(F_S))\\ \delta _3= & {} \hat{E}(Z\cdot G(\hat{F}_Z))-E(Z\cdot G(F_Z))\\ \delta _4= & {} \hat{E}(Z) \cdot \hat{E}(G(\hat{F}_S))- E(Z)\cdot E(G(F_Z)). \end{aligned}$$

We can decompose the four \(\delta _i\)’s as:

$$\begin{aligned} \delta _1= & {} [(\hat{E} -E) (Z G(\hat{F}_S)-Z G(F_S))]+[(\hat{E}-E)(Z G(F_S))]\\&+ [E(Z(G(\hat{F}_S)-G(F_S)))]\\\equiv & {} D_1+A_1+C_1\\ \delta _2= & {} [(\hat{E}-E)(Z) ]E(G(F_S)) +[\hat{E}(Z)] (\hat{E}(G(\hat{F}_S))-E(G(F_S)))\\\equiv & {} A_2E(G(F_S))+[A_2+EZ]B_1\\ \delta _3= & {} [(\hat{E} -E) (Z G(\hat{F}_Z)-Z G(F_Z))]+[(\hat{E}-E)(Z G(F_Z))]\\&+ [E(Z(G(\hat{F}_Z)-G(F_Z)))]\\\equiv & {} D_2+A_3+C_2\\ \delta _4= & {} [(\hat{E}-E)(Z) ] E(G(F_Z))+[\hat{E}(Z)](\hat{E}(G(\hat{F}_Z))-E(G(F_Z))) \\\equiv & {} A_2E(G(F_Z))+[A_2+EZ] B_2. \end{aligned}$$

For \(C_1\), note that \(C_1=EZG'(\hat{F}_S-F_S)+O_p(0.5|ZG^{''}|_\infty |\hat{F}_S-F_S|_\infty ^2) =EZG'(\hat{F}_S-F_S) +O_p(n^{-1}) \equiv C^*_1+O_p(n^{-1})\). Similarly \(C_2=C^*_2+O_p(n^{-1})\) where \(C^*_2\equiv EZG'(\hat{F}_Z-F_Z)\). Now rewriting \(\hat{F}_S(s')-F_S(s')=(\hat{E}-E)I(S\le s')\) for any \(s'\), we can rewrite \(C_1^*=(\hat{E}-E)\int z' G'(F_S(s'))I(S\le s')dF_{Z,S}(z',s') =(\hat{E}-E)\int E(Z|S=s') I(S\le s')dG(F_S(s')) =(\hat{E}-E)\int _S^\infty E(Z|S=s') dG(F_S(s')) \). By taking \(S=Z\), we get \(C_2^*=(\hat{E}-E)\int _Z^\infty z' dG(F_Z(z'))\). Then \(C_{1,2}^*\) and \(A_{1,2,3}\) are \(O_p(1/\sqrt{n})\) by the Central Limit Theorem.

Based on Lemmas 2 and 3, we know that \(B_1,B_2=O(1/n)\), and \(D_1,D_2=o_p(1/\sqrt{n})\), and therefore (6) becomes

$$\begin{aligned} \hat{\Gamma }_G-\Gamma _G= & {} \frac{(E_3-E_4)(A_1+C_1^*-A_2EG(F_S))+(E_1-E_2)(A_2EG(F_Z)-A_3-C_2^*)}{(E_3-E_4)^2}\\&\times [(1+o_p(1)], \end{aligned}$$

which can be written in the form

$$\begin{aligned} (\hat{E}-E)f(Z,S) [(1+o_p(1)]. \end{aligned}$$

Now applying the Slutsky’s Theorem and the Central Limit Theorem shows that

$$\begin{aligned} \sqrt{n}(\hat{\Gamma }_G-\Gamma _G)\overset{d}{\rightarrow } N(0,var(f(Z,S)). \end{aligned}$$

By combining the contributions from \(A_{1,2,3}\) and \(C^*_{1,2}\), we obtain

\(\square \)

These lead to the proof of the proposition.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, Y., Jiang, W. & Tanner, M.A. Generalized Gini Correlation and its Application in Data-Mining. Data Min Knowl Disc 30, 1455–1479 (2016). https://doi.org/10.1007/s10618-016-0450-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0450-5

Keywords

Mathematics Subject Classification

Navigation