Skip to main content

Effects of Training Data Size and Class Imbalance on the Performance of Classifiers

  • Conference paper
  • First Online:
Artificial Intelligence and Natural Language (AINL 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1119))

Included in the following conference series:

  • 559 Accesses

Abstract

This study discusses the effects of training data size and class imbalance on the performance of classifiers. An empirical study was performed on nine classifiers with twenty benchmark datasets. First, two groups of datasets (those with few variables and those with numerous variables) were prepared. Then we progressively increased the class imbalance of each dataset in each group by under-sampling both classes so that we could clarify to what extent the predictive power of each classifier was adversely affected. Kappa coefficient (kappa) was chosen as the performance metric, and nemenyi post hoc test was used to find significant differences between classifiers. Additionally, the ranks of nine classifiers in different conditions were discussed. The results indicated that (1) Naïve bayes, logistic regression and logit leaf model are less susceptible to class imbalance; (2) It was assumed that using datasets with balanced class distribution and sufficient instances would be the ideal condition to maximize the performance of classifiers; (3) Increasing the number of instances is more effective than using variables for improving the predictive performance of Random Forest. Furthermore, our experiment clarified the optimal classifiers for four types of datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    NFL theorem: If algorithm A outperforms algorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A.

  2. 2.

    https://www.openml.org/.

References

  • Ali, S., Smith, K.A.: On learning algorithm selection for classification. Appl. Soft Comput. 6(2), 119–138 (2006)

    Article  Google Scholar 

  • Brazdil, P.B., Soares, C., Pinto da Costa, J.: Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach. Learn. 50(3), 251–277 (2003)

    Article  MATH  Google Scholar 

  • Brown, I., Mues, C.: An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 39(3), 3446–3453 (2012)

    Article  Google Scholar 

  • Caigny, A.D., Coussement, K., De Bock, K.W.: A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. Eur. J. Oper. Res. 269(2), 760–772 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  • Foody, G.M., Mathur, A.: A relative evaluation of multiclass image classification by support vector machine. IEEE Trans. Geosci. Remote Sens. 42(6), 1335–1343 (2004)

    Article  Google Scholar 

  • Fernández-Delgado, M., Cernadas, E., Barro, S.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–3181 (2014)

    MathSciNet  MATH  Google Scholar 

  • Halevy, A., Norvig, P., Pereita, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 1541–1672 (2009)

    Article  Google Scholar 

  • Kalousis, A., Gama, J., Hilario, M.: On data and algorithms: understanding inductive performance. Mach. Learn. 54(3), 275–312 (2004)

    Article  MATH  Google Scholar 

  • Mathur, A., Foody, G.M.: Crop classification by a support vector machine with intelligently selected training data for an operational application. Int. J. Remote Sens. 29(8), 2227–2240 (2008)

    Article  Google Scholar 

  • Pal, M., Mather, P.M.: An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sens. Environ. 86(4), 554–565 (2003)

    Article  Google Scholar 

  • Song, Q., Wang, G., Wang, C.: Automatic commendation of classification algorithms based on data set characteristics. Pattern Recogn. 45(2), 1672–2689 (2012)

    MathSciNet  Google Scholar 

  • Smith, K.A., Woo, F., Ciesielski, V., Ibrahim, R.: Matching data mining algorithm suitability to data characteristics using a self-organizing map. In: Abraham, A., Köppen, M. (eds.) Hybrid Information Systems. AISC, vol. 14, pp. 169–179. Physica, Heidelberg (2002). https://doi.org/10.1007/978-3-7908-1782-9_13

    Chapter  Google Scholar 

  • Smith, K.A., Woo, F., Ciesielski, V., Ibrahim, R.: Modelling the relationship between problem characteristics and data mining algorithm performance using neural networks. In: Smart Engineering System Design: Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining, and Complex Systems, vol. 11, pp. 356–362 (2001)

    Google Scholar 

  • Sánchez, J.S., Molineda, R.A., Sotoca, K.M.: An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal. Appl. 10, 189–201 (2007)

    Article  MathSciNet  Google Scholar 

  • Wolpert, D.H., Macready, W.G.: No Free Lunch theorem for search. Technical report SFI-TR-05-010, Santa Fe Institute, Santa Fe, NM (1995)

    Google Scholar 

  • Wainberg, M., Alipanahi, B., Frey, B.J.: Are random forests truly the best classifiers? J. Mach. Learn. Res. 17, 1–5 (2016)

    MathSciNet  Google Scholar 

  • Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning, Technical report ML-TR-43, Department of Computer Science, Rutgers University (2001). https://www.researchgate.net/publication/2364670_The_Effect_of_Class_Distribution_on_Classifier_Learning_An_Empirical_Study

  • Zhu, X., Vondrick, C., Fowlkes, C., Ramanan, D.: Do we need more training data? Int. J. Comput. Vis. 19(1), 76–92 (2016)

    Article  MathSciNet  Google Scholar 

  • Jeni, L.A., Cohn, J.F., Torre, F.D.L.: Facing imbalanced data-recommendations for the use of performance metrics. In: International Conference on Affective Computing and Intelligent Interaction (2013)

    Google Scholar 

  • Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004)

    Article  Google Scholar 

  • Eitrich, T., Lang, B.: Efficient optimization of support vector machine learning parameters for unbalanced datasets. J. Comput. Appl. Math. 196(2), 425–436 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Garcia, V., Mollineda, R.A., Sanchez, J.S.: Theoretical analysis of a performance measure for imbalanced data. In: 2010 20th International Conference on Pattern Recognition (ICPR). IEEE (2010)

    Google Scholar 

  • Tang, Y., Zhang, Y.-Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(1), 281–288 (2009)

    Article  Google Scholar 

  • Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. ACM (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wanwan Zheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, W., Jin, M. (2019). Effects of Training Data Size and Class Imbalance on the Performance of Classifiers. In: Ustalov, D., Filchenkov, A., Pivovarova, L. (eds) Artificial Intelligence and Natural Language. AINL 2019. Communications in Computer and Information Science, vol 1119. Springer, Cham. https://doi.org/10.1007/978-3-030-34518-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34518-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34517-4

  • Online ISBN: 978-3-030-34518-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics