Abstract
k-means algorithm is popularly used as an effective clustering method. However, existing k-means algorithm usually performs poorly in imbalanced datasets. To address this problem, density-kmeans++ algorithm based on density distance is proposed in this paper. The proposed method incorporates density distance into traditional Euclidean distance-based k-means algorithm when clustering imbalanced dataset. The experimental results on UCI datasets and Western Reserve University Bearing Data indicates that density-kmeans++ has better ability to deal with imbalanced datasets than k-means++.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, New York
Macqueen J (1965) Some methods for classification and analysis of multivariate observations. In: Proceedings of berkeley symposium on mathematical statistics & probability
Tian Z, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD international conference on management of data. https://doi.org/10.1145/233269.233324
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery & data mining
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905. https://doi.org/10.1109/34.868688
Pearson R, Goney G, Shwaber J (2003) Imbalanced clustering for microarray time-series. In: Proceedings of the ICML, ICML, Washington DC, vol 3
Chen L, Cai Z, Chen L, Gu Q (2010) A novel differential evolution-clustering hybrid resampling algorithm on imbalanced datasets. In: International conference on knowledge discovery & data mining. IEEE. https://doi.org/10.1109/WKDD.2010.48
Li X, Chen ZG, Yang F (2013) Exploring of clustering algorithm on class-imbalanced data. https://doi.org/10.1109/ICCSE.2013.6553890
Fan J, Niu Z, Liang Y, Zhao Z (2016) Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling. Neurocomputing 211:172–181. https://doi.org/10.1016/j.neucom.2015.10.140 S092523121630577X
Brown RA (2014) Building a balanced k-d tree in o(kn log n) time. Computer Science
Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust access method for points and rectangles. ACM SIGMOD Rec 19(2):322–331. https://doi.org/10.1145/93605.98741
Arthur D (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, 2007. ACM. https://doi.org/10.1145/1283383.1283494
UCI Machine Learning Repository. http://csegroups.case.edu/bearingdatacenter/home
Bearing Data Center. http://csegroups.case.edu/bearingdatacenter/home
Liu H, Zhou JZ, Xu YH, Zheng Y, Peng XL, Jiang W (2018) Unsupervised fault diagnosis of rolling bearings using a deep neural network based on generative adversarial networks. Neurocomputing 315:412–424. https://doi.org/10.1016/j.neucom.2018.07.034
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Fan, L., Chai, Y., Li, Y. (2020). A Density-Based k-Means++ Algorithm for Imbalanced Datasets Clustering. In: Jia, Y., Du, J., Zhang, W. (eds) Proceedings of 2019 Chinese Intelligent Systems Conference. CISC 2019. Lecture Notes in Electrical Engineering, vol 594. Springer, Singapore. https://doi.org/10.1007/978-981-32-9698-5_5
Download citation
DOI: https://doi.org/10.1007/978-981-32-9698-5_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-32-9697-8
Online ISBN: 978-981-32-9698-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)