Making Class Bias Useful: A Strategy of Learning from Imbalanced Data

  • Jie Gu
  • Yuanbing Zhou
  • Xianqiang Zuo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4881)


The performance of many learning methods are usually influenced by the class imbalance problem, where the training data is dominated by the instances belonging to one class. In this paper, we propose a novel method which combines random forest based techniques and sampling methods for effectively learning from imbalanced data. Our method is mainly composed of two phases: data cleaning and classification based on random forest. Firstly, the training data is cleaned through the elimination of dangerous negative instances. The data cleaning process is supervised by a negative biased random forest, where the negative instances have a major proportion of the training data in each of the tree in the forest. Secondly, we develop a variant of random forest in which each tree is biased towards the positive class to classify the data set, where a major vote is provided for prediction. In the experimental test, we compared our method with other existing methods on the real data sets, and the results demonstrate the significative performance improvement of our method in terms of the area under the ROC curve(AUC).


Random Forest Class Distribution Data Cleaning Class Imbalance Positive Instance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Breiman, L.: Random Forest. Machine Learning 45, 5–32 (2001)zbMATHCrossRefGoogle Scholar
  2. 2.
    Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: Special Issue on Learning from Imbalanced Data Sets. Sigkdd Explorations 6(1), 1 (2004)CrossRefGoogle Scholar
  3. 3.
    Ting, K.M.: A comparative study of cost-sensitive boosting algorithms. In: Proceedings of Seventeenth International Conference on Machine Learning, Stanford, CA, pp. 983–990 (2000)Google Scholar
  4. 4.
    Juszczak, P., Duin, R.P.W.: Uncertainty sampling methods for one-class classifiers. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets (2003)Google Scholar
  5. 5.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 321–357 (2002)Google Scholar
  6. 6.
    Hart, P.E.: The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory IT-14, 515–516 (1968)CrossRefGoogle Scholar
  7. 7.
    Tomek, I.: Two Modifications of CNN. IEEE Transactions on Systems Man and Communications SMC-6, 769–772 (1976)MathSciNetGoogle Scholar
  8. 8.
    Aggarwal, C.C.: Towards Systematic Design of Distance Functions for Data Mining Applications. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (2003)Google Scholar
  9. 9.
    Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  10. 10.
    Bradley, A.P.: Cost-sensitive Learning by Cost-proportionate Example Weighting. In: Proceedings of the 3rd IEEE International Conference on Data Mining, Melbourne, FL, pp. 435–442 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jie Gu
    • 1
  • Yuanbing Zhou
    • 2
  • Xianqiang Zuo
    • 2
  1. 1.Software School of Tsinghua University 
  2. 2.State Power Economic Research InstituteChina

Personalised recommendations