Data Mining and Knowledge Discovery

, Volume 30, Issue 6, pp 1455–1479 | Cite as

Generalized Gini Correlation and its Application in Data-Mining

Article

Abstract

An asymmetric correlation measure commonly used in social economics, called the Gini correlation, is defined between a numerical response and a rank. We generalize the definition of this correlation so that it can be applied to data mining. The new definition, called the generalized Gini correlation, is found to include special cases that are equivalent to common evaluation measures used in data mining, for example, the LIFT measures for a binary response and the expected profit measure for a monetary response. We consider estimation and inference regarding this generalized Gini correlation. The asymptotic distribution of the estimated correlation is derived with the help of some empirical process theory. We consider several ways of constructing confidence intervals and demonstrate their performance numerically. Our paper is interdisciplinary and makes contributions to both the Gini literature and the literature of statistical inference of performance measures in data mining.

Keywords

Asymptotic distribution Confidence interval Data mining Empirical process Gini correlation LIFT measures 

Mathematics Subject Classification

62E20 62P99 

Supplementary material

10618_2016_450_MOESM1_ESM.pdf (167 kb)
Supplementary material 1 (pdf 167 KB)

References

  1. Bekkar M, Djemaa HK, Alitouche TA (2013) Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl 3(10):27–38Google Scholar
  2. Dong Y, Li X, Li J, Zhao H (2012) Analysis on weighted AUC for imbalanced data learning through isometrics. J Computat Inf Syst 8(1):371–378Google Scholar
  3. Gao Y (2016) On a generalization of the Gini correlation for statistical data mining. Ph.D. Dissertation, Department of Statistics, Northwestern University (in preparation)Google Scholar
  4. Gao Y, Jiang W, Tanner MA (2015) Supplementary materials for “Generalized Gini Correlation and its Application in Data-Mining”, Technical Report, Department of Statistics, Northwestern University, http://faculty.wcas.northwestern.edu/~wji047/documents/ggcsupp1
  5. Jiang W, Zhao Y (2015) On asymptotic distributions and confidence intervals for lift measures in data mining. J Am Stat Assoc (accepted)Google Scholar
  6. Kalkbrener M (2005) An axiomatic approach to capital allocation. Math Financ 15(3):425–437MathSciNetCrossRefMATHGoogle Scholar
  7. Schechtman E, Yitzhaki S (1987) A measure of association based on Gini’s mean difference. Commun Stat Theory Methods 16(1):207–231MathSciNetCrossRefMATHGoogle Scholar
  8. Schechtman E, Yitzhaki S (2003) A family of correlation coefficients based on the extended Gini index. J Econ Inequal 1(2):129–146CrossRefGoogle Scholar
  9. Tasche D (2006) Validation of internal rating systems and PD estimates. Anal Risk Model Valid 28:169–196Google Scholar
  10. Van der Vaart AW (2000) Asymptotic statistics. Cambridge University Press, New YorkMATHGoogle Scholar
  11. Van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes. Springer, New YorkCrossRefMATHGoogle Scholar
  12. Walter SD (2005) The partial area under the summary ROC curve. Stat Med 24(13):2025–2040MathSciNetCrossRefGoogle Scholar
  13. Yitzhaki S, Schechtman E (2005) The properties of the extended Gini measures of variability and inequality. METRON 63(3):401–433MathSciNetGoogle Scholar
  14. Yitzhaki S, Schechtman E (2012a) The Gini methodology: a primer on a statistical methodology. Springer, New YorkMATHGoogle Scholar
  15. Yitzhaki S, Schechtman E (2012b) Identifying monotonic and non-monotonic relationships. Econ Lett 116(1):23–25MathSciNetCrossRefMATHGoogle Scholar
  16. Zhao Y (2012) R and data mining: examples and case studies. Academic Press, ElsevierGoogle Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.Department of StatisticsNorthwestern UniversityEvanstonUSA

Personalised recommendations