Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples

  • Herna L. Viktor
  • Hongyu Guo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3138)


Ensembles of classifiers have successfully been used to improve the overall predictive accuracy in many domains. In particular, the use of boosting which focuses on hard to learn examples, have application for difficult to learn problems. In a two-class imbalanced data set, the number of examples of the majority class is much higher than that of the minority class. This implies that, during training, the predictive accuracy against the minority class of a traditional boosting ensemble may be poor. This paper introduces an approach to address this shortcoming, through the generation of synthesis examples which are added to the original training set. In this way, the ensemble is able to focus not only on hard examples, but also on rare examples. The experimental results, when applying our Databoost-IM algorithm to eleven datasets, indicate that it surpasses a benchmarking individual classifier as well as a popular boosting method, when evaluated in terms of the overall accuracy, the G-mean and the F-measures..


Predictive Accuracy Majority Class Class Distribution Minority Class Data Generation Process 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, pp. 148–156 (1996)Google Scholar
  2. 2.
    Breiman, L.:: Bias, Variance and Arcing Classifiers. Technical report 460, University of California: Berkeley, Berkeley, CA: USA (1996)Google Scholar
  3. 3.
    Schwenk, H., Bengio, Y.: AdaBoosting Neural Networks: Application to On-line Character Recognition. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 969–972. Springer, Heidelberg (1997)Google Scholar
  4. 4.
    Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000)CrossRefGoogle Scholar
  5. 5.
    Japkowicz, N.: Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05 (2000)Google Scholar
  6. 6.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)zbMATHGoogle Scholar
  7. 7.
    Maloof, M.A.: Learning when Data Sets are Imbalanced and when Costs are Unequal and unknown. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II (2003)Google Scholar
  8. 8.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: Onesided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997)Google Scholar
  9. 9.
    Joshi, M., Kumar, V., Agarwal, R.: Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division (2001)Google Scholar
  10. 10.
    Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: improving prediction of the minority class in boosting. In: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 107–119. Cavtat-Dubrovnik, Croatia (2003)Google Scholar
  11. 11.
    Blake ,C.L ., Merz,C.J : UCI Repository of Machine Learning Databases , Department of Information and Computer Science, University of California, Irvine, CA (1998)
  12. 12.
    Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In: proceedings of the Third international conference on Knowledge discovery and data mining, pp. 43–48. AAAI Press, Menlo Park (1997)Google Scholar
  13. 13.
    Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  14. 14.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)Google Scholar
  15. 15.
    Guo, H., Viktor, H. L.: Boosting with data generation: Improving the Classification of Hard to Learn Examples. In: The 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE), Ottawa, Canada, May 17-20 (2004)Google Scholar
  16. 16.
    Viktor, H.L.: The CILT multi-agent learning system. South African Computer Journal (SACJ) 24, 171–181 (1999)Google Scholar
  17. 17.
    Viktor, H.L., Skrypnik, I.: Improving the Competency of Ensembles of Classifiers through Data Generation. In: ICANNGA 2001, April 21-25, pp. 59–62. Czech Republic, Prague (2001)Google Scholar
  18. 18.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1994)Google Scholar
  19. 19.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning tools and Techniques with Java Implementations,  ch.8. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Herna L. Viktor
    • 1
  • Hongyu Guo
    • 1
  1. 1.School of Information Technology and EngineeringUniversity of OttawaOttawaCanada

Personalised recommendations