Abstract
Imbalanced learning plays an important role in our daily life, featuring large amounts of normal samples and small percentage of abnormal ones in its data set. To solve these imbalanced data cases, machine learning models like Decision Tree and Logistic Regression have been widely applied. However, performance of models is always negatively affected due to the massive imbalance. In order to fix this problem, sampling methods are used to balance the data sets. This work combines random undersampling with SMOTE (Synthetic Minority Over-sampling Technique) to synthetically modify data sets and train models, which achieves better recall_score performance in experiments. Additionally, we correct the mistake that other works about sampling methods always evaluate models on the transformed data set, which is against its original purpose. At last, we improve the Logistic Regression algorithm using this data-set-based technique, allowing it to perform better when handling imbalanced data cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: snopes.com: two-striped telamonia spider. J. Artif. Intell. Res. 16(Sept. 28), 321–357 (2002). https://arxiv.org/pdf/1106.1813.pdf, http://www.snopes.com/horrors/insects/telamonia.asp
Dal Pozzolo, A., Caelen, O., Bontempi, G., Johnson, R.A.: Calibrating Probability with Undersampling for Unbalanced Classification Fraud detection View project Volatility forecasting View project Calibrating Probability with Undersampling for Unbalanced Classification (2015). https://www.researchgate.net/publication/283349138
DeRouin, E., Brown, J., Fausett, L., Schneider, M.: Neural network training on unequally represented classes. In: Intelligent Engineering Systems Through Artificial Neural Networks, pp. 135–141. ASME Press, New York (1991). https://dl.acm.org/doi/book/10.5555/1557404
Dev, S., Wang, H., Nwosu, C.S., Jain, N., Veeravalli, B., John, D.: A predictive analytics approach for stroke prediction using machine learning and neural networks. Healthc. Anal. 2, 100032 (2022). https://doi.org/10.1016/j.health.2022.100032
Domingos, P.: MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM Press, San Diego, CA (1999). https://dl.acm.org/doi/pdf/10.1145/312129.312220
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, Nashville, Tennesse (1997). https://dblp.org/rec/conf/icml/KubatM97.html
Kulatilleke, G.K.: Challenges and complexities in machine learning based credit card fraud detection, pp. 1–17 (2022a). http://arxiv.org/abs/2208.10943
Kulatilleke, G.K.: Credit card fraud detection - classifier selection strategy, pp. 1–17 (2022b). http://arxiv.org/abs/2208.11900
Kulatilleke, G.K., Samarakoon, S.: Empirical study of machine learning classifier evaluation metrics behavior in massively imbalanced and noisy data (2022). http://arxiv.org/abs/2208.11904
Ling, C., Li, C.: Data mining for direct marketing problems and solutions. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-1998). AAAI Press, New York, NY (1998). https://www.csd.uwo.ca/~xling/papers/kdd98
Rosadi, D., et al.: Improving machine learning prediction of peatlands fire occurrence for unbalanced data using SMOTE approach. In: 2021 International Conference on Data Science, Artificial Intelligence, and Business Analytics, DATABIA 2021 - Proceedings, pp. 160–163 (2021). https://doi.org/10.1109/DATABIA53375.2021.9650084
Sohony, I., Pratap, R., Nambiar, U.: Ensemble learning for credit card fraud detection. In: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, pp. 289–294 (2018). https://dl.acm.org/doi/abs/10.1145/3152494.3156815
Tarawneh, A.S., Hassanat, A.B., Altarawneh, G.A., Almuhaimeed, A.: Stop oversampling for class imbalance learning: a review. IEEE Access 10, 47643–47660 (2022). https://doi.org/10.1109/ACCESS.2022.3169512
Yousuf, B.B., Sulaiman, R.B., Nipun, M.S.: Chapter * A novel approach to increase scalability while training machine learning algorithms using Bfloat – 16 in credit card fraud detection (n.d.)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, J. (2023). USMOTE: A Synthetic Data-Set-Based Method Improving Imbalanced Learning. In: Xu, Z., Alrabaee, S., Loyola-González, O., Cahyani, N.D.W., Ab Rahman, N.H. (eds) Cyber Security Intelligence and Analytics. CSIA 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 173. Springer, Cham. https://doi.org/10.1007/978-3-031-31775-0_57
Download citation
DOI: https://doi.org/10.1007/978-3-031-31775-0_57
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31774-3
Online ISBN: 978-3-031-31775-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)