On the Role of Cost-Sensitive Learning in Imbalanced Data Oversampling
- 1.1k Downloads
Learning from imbalanced data is still considered as one of the most challenging areas of machine learning. Among plethora of methods dedicated to alleviating the challenge of skewed distributions, two most distinct ones are data-level sampling and cost-sensitive learning. The former modifies the training set by either removing majority instances or generating additional minority ones. The latter associates a penalty cost with the minority class, in order to mitigate the classifiers’ bias towards the better represented class. While these two approaches have been extensively studied on their own, no works so far have tried to combine their properties. Such a direction seems as highly promising, as in many real-life imbalanced problems we may obtain the actual misclassification cost and thus it should be embedded in the classification framework, regardless of the selected algorithm. This work aims to open a new direction for learning from imbalanced data, by investigating an interplay between the oversampling and cost-sensitive approaches. We show that there is a direct relationship between the misclassification cost imposed on the minority class and the oversampling ratios that aim to balance both classes. This becomes vivid when popular skew-insensitive metrics are modified to incorporate the cost-sensitive element. Our experimental study clearly shows a strong relationship between sampling and cost, indicating that this new direction should be pursued in the future in order to develop new and effective algorithms for imbalanced data.
KeywordsMachine learning Imbalanced data Cost-sensitive learning Data preprocessing Oversampling SMOTE
This work was supported by the Polish National Science Centre under the grant No. 2017/27/B/ST6/01325 as well as by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.
- 8.Domingos, P.M.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August 1999, pp. 155–164 (1999)Google Scholar
- 13.Ksieniewicz, P., Woźniak, M.: Dealing with the task of imbalanced, multidimensional data classification using ensembles of exposers. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, LIDTA@PKDD/ECML 2017, 22 September 2017, Skopje, Macedonia, pp. 164–175 (2017)Google Scholar
- 17.del Río, S., Benítez, J.M., Herrera, F.: Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification. In: 2015 IEEE TrustCom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015, vol. 2, pp. 180–185 (2015)Google Scholar
- 18.Skryjomski, P., Krawczyk, B.: Influence of minority class instance types on SMOTE imbalanced data oversampling. In: First International Workshop on Learning with Imbalanced Domains: theory and applications, LIDTA@PKDD/ECML 2017, 22 September 2017, Skopje, Macedonia, pp. 7–21 (2017)Google Scholar
- 19.Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L.: Cost-sensitive learning methods for imbalanced data. In: International Joint Conference on Neural Networks, IJCNN 2010, Barcelona, Spain, 18–23 July 2010, pp. 1–8 (2010)Google Scholar
- 20.Wang, S., Li, Z., Chao, W., Cao, Q.: Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In: The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, 10–15 June 2012, pp. 1–8 (2012)Google Scholar