An Empirical Study of Bagging Predictors for Imbalanced Data with Different Levels of Class Distribution

  • Guohua Liang
  • Xingquan Zhu
  • Chengqi Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7106)

Abstract

Research into learning from imbalanced data has increasingly captured the attention of both academia and industry, especially when the class distribution is highly skewed. This paper compares the Area Under the Receiver Operating Characteristic Curve (AUC) performance of bagging in the context of learning from different imbalanced levels of class distribution. Despite the popularity of bagging in many real-world applications, some questions have not been clearly answered in the existing research, e.g., which bagging predictors may achieve the best performance for applications, and whether bagging is superior to single learners when the levels of class distribution change. We perform a comprehensive evaluation of the AUC performance of bagging predictors with 12 base learners at different imbalanced levels of class distribution by using a sampling technique on 14 imbalanced data-sets. Our experimental results indicate that Decision Table (DTable) and RepTree are the learning algorithms with the best bagging AUC performance. Most AUC performances of bagging predictors are statistically superior to single learners, except for Support Vector Machines (SVM) and Decision Stump (DStump).

Keywords

imbalanced class distribution AUC performance bagging 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Phua, C., Alahakoon, D., Lee, V.: Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter 6, 50–59 (2004)CrossRefGoogle Scholar
  2. 2.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)MATHGoogle Scholar
  3. 3.
    Mena, L., Gonzalez, J.: Machine learning for imbalanced datasets: application in medical diagnostic. In: Proceedings of the 19th International FLAIRS Conference (2006)Google Scholar
  4. 4.
    Rao, R.B., Krishnan, S., Niculescu, R.S.: Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter 8, 3–10 (2006)CrossRefGoogle Scholar
  5. 5.
    Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks 21, 427–436 (2008)CrossRefGoogle Scholar
  6. 6.
    Koknar-Tezel, S., Latecki, L.J.: Improving SVM Classification on Imbalanced Data Sets in Distance Spaces. In: Proceedings of ICDM 2009, pp. 259–267 (2009)Google Scholar
  7. 7.
    Su, C.T., Hsiao, Y.H.: An evaluation of the robustness of MTS for imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 1321–1332 (2007)Google Scholar
  8. 8.
    Maloof, M.: Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC (2003)Google Scholar
  9. 9.
    Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6, 7–19 (2004)CrossRefGoogle Scholar
  10. 10.
    Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  11. 11.
    Sun, Y., Kamel, M., Wong, A., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)CrossRefMATHGoogle Scholar
  12. 12.
    Zeng-Chang, Q.: ROC analysis for predictions made by probabilistic classifiers. In: Proceedings of ICMLC 2005, pp. 3119–3124 (2005)Google Scholar
  13. 13.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, 1–30 (2006)MathSciNetMATHGoogle Scholar
  14. 14.
    Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)MATHGoogle Scholar
  15. 15.
    Opitz, D., Maclin, R.: Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research 11, 169–198 (1999)MATHGoogle Scholar
  16. 16.
    Büchlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics 30, 927–961 (2002)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Buja, A., Stuetzle, W.: Observations on bagging. Statistica Sinica 16, 323 (2006)MathSciNetMATHGoogle Scholar
  18. 18.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874 (2006)CrossRefGoogle Scholar
  19. 19.
    Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 1145–1159 (1997)CrossRefGoogle Scholar
  20. 20.
    Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30, 25–36 (2006)Google Scholar
  21. 21.
    Fawcett, T.: ROC graphs: Notes and practical considerations for researchers. Machine Learning 31, 1–38 (2004)MathSciNetGoogle Scholar
  22. 22.
    He, H., Garcia, A.E.: Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering 21, 1263–1284 (2009)CrossRefGoogle Scholar
  23. 23.
    Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar
  24. 24.
    Merz, C., Murphy, P.: UCI Repository of Machine Learning Databases (2006)Google Scholar
  25. 25.
    Liang, G., Zhu, X., Zhang, C.: An Empirical Study of Bagging Predictors for Different Learning Algorithms. In: Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2011. AAAI Press, San Francisco (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Guohua Liang
    • 1
  • Xingquan Zhu
    • 1
  • Chengqi Zhang
    • 1
  1. 1.The Centre for Quantum Computation & Intelligent SystemsFEIT, University of TechnologySydneyAustralia

Personalised recommendations