Data Mining and Knowledge Discovery

, Volume 24, Issue 1, pp 136–158 | Cite as

Hellinger distance decision trees are robust and skew-insensitive

  • David A. Cieslak
  • T. Ryan Hoens
  • Nitesh V. ChawlaEmail author
  • W. Philip Kegelmeyer


Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (


Imbalanced data Decision tree Hellinger distance 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Alpaydin E (1999) Combined 5 × 2cv F test for comparing supervised classification learning algorithms. Neural Comput 11(8): 1885–1892CrossRefGoogle Scholar
  2. Asuncion A, Newman D (2007) UCI machine learning repository.
  3. Banfield R, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1): 832–844CrossRefGoogle Scholar
  4. Batista G, Prati R, Monard M (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29CrossRefGoogle Scholar
  5. Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140zbMATHMathSciNetGoogle Scholar
  6. Breiman L (1998) Rejoinder to the paper ’Arcing Classifiers’ by Leo Breiman. Ann Stat 26(2): 841–849MathSciNetGoogle Scholar
  7. Breiman L (2001) Random forests. Mach Learn 45(1): 5–32CrossRefzbMATHGoogle Scholar
  8. Breiman L, Friedman J, Stone CJ, Olshen R (1984) Classification and regression trees. Chapman and Hall, Boca RatonzbMATHGoogle Scholar
  9. Chang C, Lin C (2001) LIBSVM: a library for support vector machines. software available at Accessed June 2011
  10. Chawla NV (2003) C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: ICML workshop on learning from imbalanced data sets II. Washington, DC, USA, pp 1–8Google Scholar
  11. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357zbMATHGoogle Scholar
  12. Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: learning from imbalanced datasets. SIGKDD Explor 6: 1–16CrossRefGoogle Scholar
  13. Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2): 225–252CrossRefMathSciNetGoogle Scholar
  14. Cieslak DA, Chawla NV (2008a) Learning decision trees for unbalanced data. In: European conference on machine learning (ECML). Antwerp, Belgium, pp 241–256Google Scholar
  15. Cieslak DA, Chawla NV (2008b) Analyzing classifier performance on imbalanced datasets when training and testing distributions differ. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Osaka, Japan, pp 519–526Google Scholar
  16. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30zbMATHMathSciNetGoogle Scholar
  17. Dietterich TG (1998) Approximate statistical tests for comparing supervised classiffication learning algorithms. Neural Comput 10(7): 1895–1923CrossRefGoogle Scholar
  18. Dietterich T (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2): 139–157CrossRefGoogle Scholar
  19. Dietterich T, Kearns M, Mansour Y (1996) Applying the weak learning framework to understand and improve C4.5. In: Proceeings of the 13th international conference on machine learning. Morgan Kaufmann, Bari, Italy, pp 96–104Google Scholar
  20. Drummond C, Holte R (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 239–246Google Scholar
  21. Drummond C, Holte R (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: ICML workshop on learning from imbalanced datasets II. Washington, DC, USA, pp 1–8Google Scholar
  22. Flach PA (2003) The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: International conference on machine learning (ICML). Washington, DC, USA, pp 194–201Google Scholar
  23. Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th national conference on machine learning. Bari, Italy, pp 148–156Google Scholar
  24. Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1): 86–92CrossRefzbMATHGoogle Scholar
  25. Halmos P (1950) Measure theory. Van Nostrand and Co., PrincetonzbMATHGoogle Scholar
  26. Hand D, Till R (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45: 171–186CrossRefzbMATHGoogle Scholar
  27. Hido S, Kashima H (2008) Roughly balanced bagging for imbalanced data. In: SIAM international conference on data mining (SDM). Atlanta, Georgia, USA, pp 143–152Google Scholar
  28. Ho T (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8): 832–844CrossRefGoogle Scholar
  29. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2): 65–70zbMATHMathSciNetGoogle Scholar
  30. Japkowicz N (2000) Class imbalance problem: significance & strategies. In: International conference on artificial intelligence (ICAI). Las Vegas, Nevada, USA, pp 111–117Google Scholar
  31. Kailath T (1967) The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans Commun 15(1): 52–60CrossRefGoogle Scholar
  32. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: International conference on machine learning (ICML). Nashville, Tennessee, USA, pp 179–186Google Scholar
  33. Nguyen X, Wainwright MJ, Jordan MI (2007) Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In: Advances in neural information processing systems (NIPS). Vancouver, BC, Canada, pp 1–8Google Scholar
  34. Provost F, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3): 199–215CrossRefzbMATHGoogle Scholar
  35. Quinlan R (1986) Induction of decision trees. Mach Learn 1: 81–106Google Scholar
  36. Rao C (1995) A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Questiio (Quaderns d’Estadistica i Investig Oper) 19: 23–63zbMATHGoogle Scholar
  37. Schapire R, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37: 297–336CrossRefzbMATHGoogle Scholar
  38. Van Hulse J, Khoshgoftaar T, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: International conference on machine learning (ICML). Corvalis, Oregon, USA, pp 935–942Google Scholar
  39. Vilalta R, Oblinger D (2000) A quantification of distance-bias between evaluation metrics in classification. In: International conference on machine learning (ICML). Stanford University, California, USA, pp 1087–1094Google Scholar
  40. Wu J, Xiong H, Chen J (2010) Cog: local decomposition for rare class analysis. Data Min Knowl Discov 20: 191–220CrossRefMathSciNetGoogle Scholar
  41. Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the 18th international conference on machine learning. Morgan Kaufmann, San Francisco, CA, USA, 2001, pp. 609–616Google Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  • David A. Cieslak
    • 1
  • T. Ryan Hoens
    • 1
  • Nitesh V. Chawla
    • 1
    Email author
  • W. Philip Kegelmeyer
    • 1
  1. 1.University of Notre DameNotre DameUSA

Personalised recommendations