Effects of Training Data Size and Class Imbalance on the Performance of Classifiers

Zheng, Wanwan; Jin, Mingzhe

doi:10.1007/978-3-030-34518-1_1

Wanwan Zheng⁹ &
Mingzhe Jin⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1119))

Included in the following conference series:

Conference on Artificial Intelligence and Natural Language

559 Accesses

Abstract

This study discusses the effects of training data size and class imbalance on the performance of classifiers. An empirical study was performed on nine classifiers with twenty benchmark datasets. First, two groups of datasets (those with few variables and those with numerous variables) were prepared. Then we progressively increased the class imbalance of each dataset in each group by under-sampling both classes so that we could clarify to what extent the predictive power of each classifier was adversely affected. Kappa coefficient (kappa) was chosen as the performance metric, and nemenyi post hoc test was used to find significant differences between classifiers. Additionally, the ranks of nine classifiers in different conditions were discussed. The results indicated that (1) Naïve bayes, logistic regression and logit leaf model are less susceptible to class imbalance; (2) It was assumed that using datasets with balanced class distribution and sufficient instances would be the ideal condition to maximize the performance of classifiers; (3) Increasing the number of instances is more effective than using variables for improving the predictive performance of Random Forest. Furthermore, our experiment clarified the optimal classifiers for four types of datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

Article 25 February 2020

Assessing Imbalanced Datasets in Binary Classifiers

Dataset meta-level and statistical features affect machine learning performance

Article Open access 19 January 2024

Notes

1.
NFL theorem: If algorithm A outperforms algorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A.
2.
https://www.openml.org/.

References

Ali, S., Smith, K.A.: On learning algorithm selection for classification. Appl. Soft Comput. 6(2), 119–138 (2006)
Article Google Scholar
Brazdil, P.B., Soares, C., Pinto da Costa, J.: Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach. Learn. 50(3), 251–277 (2003)
Article MATH Google Scholar
Brown, I., Mues, C.: An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 39(3), 3446–3453 (2012)
Article Google Scholar
Caigny, A.D., Coussement, K., De Bock, K.W.: A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. Eur. J. Oper. Res. 269(2), 760–772 (2018)
Article MathSciNet MATH Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Foody, G.M., Mathur, A.: A relative evaluation of multiclass image classification by support vector machine. IEEE Trans. Geosci. Remote Sens. 42(6), 1335–1343 (2004)
Article Google Scholar
Fernández-Delgado, M., Cernadas, E., Barro, S.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–3181 (2014)
MathSciNet MATH Google Scholar
Halevy, A., Norvig, P., Pereita, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 1541–1672 (2009)
Article Google Scholar
Kalousis, A., Gama, J., Hilario, M.: On data and algorithms: understanding inductive performance. Mach. Learn. 54(3), 275–312 (2004)
Article MATH Google Scholar
Mathur, A., Foody, G.M.: Crop classification by a support vector machine with intelligently selected training data for an operational application. Int. J. Remote Sens. 29(8), 2227–2240 (2008)
Article Google Scholar
Pal, M., Mather, P.M.: An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sens. Environ. 86(4), 554–565 (2003)
Article Google Scholar
Song, Q., Wang, G., Wang, C.: Automatic commendation of classification algorithms based on data set characteristics. Pattern Recogn. 45(2), 1672–2689 (2012)
MathSciNet Google Scholar
Smith, K.A., Woo, F., Ciesielski, V., Ibrahim, R.: Matching data mining algorithm suitability to data characteristics using a self-organizing map. In: Abraham, A., Köppen, M. (eds.) Hybrid Information Systems. AISC, vol. 14, pp. 169–179. Physica, Heidelberg (2002). https://doi.org/10.1007/978-3-7908-1782-9_13
Chapter Google Scholar
Smith, K.A., Woo, F., Ciesielski, V., Ibrahim, R.: Modelling the relationship between problem characteristics and data mining algorithm performance using neural networks. In: Smart Engineering System Design: Neural Networks, Fuzzy Logic, Evolutionary Programming, Data Mining, and Complex Systems, vol. 11, pp. 356–362 (2001)
Google Scholar
Sánchez, J.S., Molineda, R.A., Sotoca, K.M.: An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal. Appl. 10, 189–201 (2007)
Article MathSciNet Google Scholar
Wolpert, D.H., Macready, W.G.: No Free Lunch theorem for search. Technical report SFI-TR-05-010, Santa Fe Institute, Santa Fe, NM (1995)
Google Scholar
Wainberg, M., Alipanahi, B., Frey, B.J.: Are random forests truly the best classifiers? J. Mach. Learn. Res. 17, 1–5 (2016)
MathSciNet Google Scholar
Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning, Technical report ML-TR-43, Department of Computer Science, Rutgers University (2001). https://www.researchgate.net/publication/2364670_The_Effect_of_Class_Distribution_on_Classifier_Learning_An_Empirical_Study
Zhu, X., Vondrick, C., Fowlkes, C., Ramanan, D.: Do we need more training data? Int. J. Comput. Vis. 19(1), 76–92 (2016)
Article MathSciNet Google Scholar
Jeni, L.A., Cohn, J.F., Torre, F.D.L.: Facing imbalanced data-recommendations for the use of performance metrics. In: International Conference on Affective Computing and Intelligent Interaction (2013)
Google Scholar
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
Article Google Scholar
Eitrich, T., Lang, B.: Efficient optimization of support vector machine learning parameters for unbalanced datasets. J. Comput. Appl. Math. 196(2), 425–436 (2006)
Article MathSciNet MATH Google Scholar
Garcia, V., Mollineda, R.A., Sanchez, J.S.: Theoretical analysis of a performance measure for imbalanced data. In: 2010 20th International Conference on Pattern Recognition (ICPR). IEEE (2010)
Google Scholar
Tang, Y., Zhang, Y.-Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(1), 281–288 (2009)
Article Google Scholar
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning. ACM (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Doshisha University, Kyoto, Japan
Wanwan Zheng & Mingzhe Jin

Authors

Wanwan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhe Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wanwan Zheng .

Editor information

Editors and Affiliations

Krasovskii Institute of Mathematics and Mechanics, Yekaterinburg, Russia
Dmitry Ustalov
ITMO University, St. Petersburg, Russia
Andrey Filchenkov
Computer Science, University of Helsinki, Helsinki, Finland
Lidia Pivovarova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, W., Jin, M. (2019). Effects of Training Data Size and Class Imbalance on the Performance of Classifiers. In: Ustalov, D., Filchenkov, A., Pivovarova, L. (eds) Artificial Intelligence and Natural Language. AINL 2019. Communications in Computer and Information Science, vol 1119. Springer, Cham. https://doi.org/10.1007/978-3-030-34518-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-34518-1_1
Published: 13 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34517-4
Online ISBN: 978-3-030-34518-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Effects of Training Data Size and Class Imbalance on the Performance of Classifiers

Abstract

Access this chapter

Similar content being viewed by others

The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

Assessing Imbalanced Datasets in Binary Classifiers

Dataset meta-level and statistical features affect machine learning performance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Effects of Training Data Size and Class Imbalance on the Performance of Classifiers

Abstract

Access this chapter

Similar content being viewed by others

The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

Assessing Imbalanced Datasets in Binary Classifiers

Dataset meta-level and statistical features affect machine learning performance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation