Abstract
The class imbalance problem is widely studied in the machine learning community, and it is present in many real-world applications such as spam filtering, anomaly detection and medical diagnosis. In this paper, we propose a density weighted fuzzy outlier clustering approach for class imbalanced learning. The method considers a novel fuzzy neighborhood relation with local density information when assigning the weights to the samples in the clustering process, and it is then hybridized with the fuzzy outlier clustering approach for a novel fuzzy clustering method. In this way, the most representative majority class samples are chosen while the outlier samples are subjected to elimination. The validity of the proposed method is tested with synthetic and real-world datasets which demonstrates superior performance compared to other clustering-based resampling schemes. Thus, the density weighted fuzzy outlier clustering approach can be used for real life imbalanced problems.
Similar content being viewed by others
References
Anand A, Pugalenthi G, Fogel GB, Suganthan PN (2010) An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39(5):1385–1391. https://doi.org/10.1007/s00726-010-0595-2
Barua S, Islam MM, Yao X, Murase K (2014) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425. https://doi.org/10.1109/TKDE.2012.232
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Springer, Boston, pp 95–154
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684. https://doi.org/10.1007/s10489-011-0287-y
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2011) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Dagher I (2012) Clustering with complex centers. Neural Comput Appl 21(1):133–144. https://doi.org/10.1007/s00521-011-0616-4
Devi D, Biswas S, Purkayastha B (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recogni Lett 93:3–12. https://doi.org/10.1016/j.patrec.2016.10.006
Drummond C, Holte R (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, held in conjunction with ICML 2003
Du M, Ding S, Xue Y (2018) A robust density peaks clustering algorithm using fuzzy neighborhood. Int J Mach Learn Cybern 9(7):1131–1140. https://doi.org/10.1007/s13042-017-0636-1
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, KDD’96. AAAI Press, pp 226–231. http://dl.acm.org/citation.cfm?id=3001460.3001507
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239. https://doi.org/10.1016/j.eswa.2016.12.035
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recognit Lett 24(9):1641–1650. https://doi.org/10.1016/S0167-8655(03)00003-5
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668. https://doi.org/10.1109/TPAMI.2005.95
Huang X, Ye Y, Zhang H (2014) Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation. IEEE Trans Neural Netw Learn Syst 25(8):1433–1446. https://doi.org/10.1109/TNNLS.2013.2293795
Keller A (2000) Fuzzy clustering with outliers. In: Proceedings of the NAFIPS00 2000, pp 143–147
Khanali H, Vaziri B (2019) An improved approach to fuzzy clustering based on fcm algorithm and extended vikor method. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04035-w
Krawczyk B, Galar M, Jele ukasz, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput 38:714–726. https://doi.org/10.1016/j.asoc.2015.08.060
Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26. https://doi.org/10.1016/j.ins.2017.05.008
Lopez V, del Rio S, Benitez JM, Herrera F (2015) Cost-sensitive linguistic fuzzy rule based classification systems under the mapreduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38. https://doi.org/10.1016/j.fss.2014.01.015
Majhi SK (2019) Fuzzy clustering algorithm based on modified whale optimization algorithm for automobile insurance fraud detection. In: Evolutionary intelligence, pp 1–12. https://doi.org/10.1007/s12065-019-00260-3
Ofek N, Rokach L, Stern R, Shabtai A (2017) Fast-CBUS: a fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243:88–102. https://doi.org/10.1016/j.neucom.2017.03.011
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Silva GRL, Neto PC, Torres LCB, Braga AP (2019) A fuzzy data reduction cluster method based on boundary information for large datasets. Neural Comput Appl. https://doi.org/10.1007/s00521-019-04049-4
Somasundaram A, Reddy S (2019) Parallel and incremental credit card fraud detection model to handle concept drift and data imbalance. Neural Comput Appl 31(1):3–14. https://doi.org/10.1007/s00521-018-3633-8
Somasundaram A, Reddy US (2017) Modelling a stable classifier for handling large scale data with noise and imbalance. In: Proceedings of the 2017 international conference on computational intelligence in data science (ICCIDS) Chennai, India, pp 16
Stetco A, Zeng XJ, Keane J (2015) Fuzzy c-means++: fuzzy c-means with effective seeding initialization. Expert Syst Appl 42(21):7541–7548. https://doi.org/10.1016/j.eswa.2015.05.014
Tukey J (1977) Exploratory data analysis. Addison-Wesley Publishing Company, Menlo Park
Vo T, Nguyen T, Le CT (2019) A hybrid framework for smile detection in class imbalance scenarios. Neural Comput Appl 31(12):85838592
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3, Part 1):5718–5727. https://doi.org/10.1016/j.eswa.2008.06.108
Yoon K, Kwek S (2007) A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput Appl 16(3):295–306. https://doi.org/10.1007/s00521-007-0089-7
Yu H, Mu C, Sun C, Yang W, Zuo X (2015) Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl Based Syst 76:67–78
Zhang H, Wang S, Xu X, Chow TWS, Wu QMJ (2018) Tree2vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29(11):5304–5318. https://doi.org/10.1109/TNNLS.2018.2797060
Acknowledgements
The work is supported by the National Natural Science Foundation of China (Grant No. 71420107025). The authors would like to thank the associate editor and anonymous referees for their helpful and constructive comments.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, X., Wang, H. & Wang, Y. A density weighted fuzzy outlier clustering approach for class imbalanced learning. Neural Comput & Applic 32, 13035–13049 (2020). https://doi.org/10.1007/s00521-020-04747-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-04747-4