A Hybrid Approach Handling Imbalanced Datasets

  • Paolo Soda
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5716)


Several binary classification problems exhibit imbalance in class distribution, influencing system learning. Indeed, traditional machine learning algorithms are biased towards the majority class, thus producing poor predictive accuracy over the minority one. To overcome this limitation, many approaches have been proposed up to now to build artificially balanced training sets. Further to their specific drawbacks, they achieve more balanced accuracies on each class harming the global accuracy. This paper first reviews the more recent method coping with imbalanced datasets and then proposes a strategy overcoming the main drawbacks of existing approaches. It is based on an ensemble of classifiers trained on balanced subsets of the original imbalanced training set working in conjunction with the classifier trained on the original imbalanced dataset. The performance of the method has been estimated on six public datasets, proving its effectiveness also in comparison with other approaches. It also gives the chance to modify the system behaviour according to the operating scenario.


Class Distribution Minority Class Global Accuracy Imbalanced Dataset Minority Class Sample 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1), 20–29 (2004)CrossRefGoogle Scholar
  2. 2.
    Chawla, N.V., Bowyer, K.W., et al.: Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16(3), 321–357 (2002)zbMATHGoogle Scholar
  3. 3.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Machine Learning-International Workshop Then Conference, pp. 179–186. Morgan Kaufmann Publishers, Inc., San Francisco (1997)Google Scholar
  4. 4.
    Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter 6(1), 40–49 (2004)CrossRefGoogle Scholar
  5. 5.
    Ling, C.X., Li, C.: Data mining for direct marketing: Problems and solutions. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 73–79 (1998)Google Scholar
  6. 6.
    Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)zbMATHGoogle Scholar
  7. 7.
    Barandela, R., Sanchez, J.S., Garca, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36(3), 849–851 (2003)CrossRefGoogle Scholar
  8. 8.
    Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 217–225 (1994)Google Scholar
  9. 9.
    Eavis, T., Japkowicz, N.: A recognition-based alternative to discrimination-based multi-layer perceptrons. In: AI 2000: Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence, pp. 280–292 (2000)Google Scholar
  10. 10.
    Barandela, R., Valdovinos, R.M., Sánchez, J.S.: New applications of ensembles of classifiers. Pattern Analysis & Applications 6(3), 245–256 (2003)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Molinara, M., Ricamato, M.T., Tortorella, F.: Facing imbalanced classes through aggregation of classifiers. In: ICIAP 2007: Proceedings of the 14th International Conference on Image Analysis and Processing, pp. 43–48 (2007)Google Scholar
  12. 12.
    Kotsiantis, S., Pintelas, P.: Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing and Teleinformatics 1(1), 46–55 (2003)Google Scholar
  13. 13.
    Japkowicz, N.: Concept-learning in the presence of between-class and within-class imbalances. In: AI 2001: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence, pp. 67–77 (2001)Google Scholar
  14. 14.
    Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class Distribution. Springer, Heidelberg (2001)CrossRefzbMATHGoogle Scholar
  15. 15.
    Ezawa, K., Singh, M., Norton, S.: Learning goal oriented bayesian networks for telecommunications risk management. In: Machine Learning-International Workshop Then Conference, pp. 139–147 (1996)Google Scholar
  16. 16.
    Cordella, L.P., Foggia, P., Sansone, C., Tortorella, F., Vento, M.: Reliability parameters to improve combination strategies in multi-expert systems. Pattern Analysis & Applications 2(3), 205–214 (1999)CrossRefGoogle Scholar
  17. 17.
    Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)CrossRefGoogle Scholar
  18. 18.
    Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)Google Scholar
  19. 19.
    Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: KDD 1999: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 155–164. ACM, New York (2000)Google Scholar
  20. 20.
    Fumera, G., Roli, F.: Support Vector Machines with Embedded Reject Option. LNCS, pp. 68–82 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Paolo Soda
    • 1
  1. 1.Integrated Research Centre, Medical Informatics & Computer Science LaboratoryUniversity Campus Bio-Medico of RomeRomeItaly

Personalised recommendations