Skip to main content

A Density-Based k-Means++ Algorithm for Imbalanced Datasets Clustering

  • Conference paper
  • First Online:
Book cover Proceedings of 2019 Chinese Intelligent Systems Conference (CISC 2019)

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 594))

Included in the following conference series:

Abstract

k-means algorithm is popularly used as an effective clustering method. However, existing k-means algorithm usually performs poorly in imbalanced datasets. To address this problem, density-kmeans++ algorithm based on density distance is proposed in this paper. The proposed method incorporates density distance into traditional Euclidean distance-based k-means algorithm when clustering imbalanced dataset. The experimental results on UCI datasets and Western Reserve University Bearing Data indicates that density-kmeans++ has better ability to deal with imbalanced datasets than k-means++.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, New York

    MATH  Google Scholar 

  2. Macqueen J (1965) Some methods for classification and analysis of multivariate observations. In: Proceedings of berkeley symposium on mathematical statistics & probability

    Google Scholar 

  3. Tian Z, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD international conference on management of data. https://doi.org/10.1145/233269.233324

  4. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery & data mining

    Google Scholar 

  5. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905. https://doi.org/10.1109/34.868688

    Article  Google Scholar 

  6. Pearson R, Goney G, Shwaber J (2003) Imbalanced clustering for microarray time-series. In: Proceedings of the ICML, ICML, Washington DC, vol 3

    Google Scholar 

  7. Chen L, Cai Z, Chen L, Gu Q (2010) A novel differential evolution-clustering hybrid resampling algorithm on imbalanced datasets. In: International conference on knowledge discovery & data mining. IEEE. https://doi.org/10.1109/WKDD.2010.48

  8. Li X, Chen ZG, Yang F (2013) Exploring of clustering algorithm on class-imbalanced data. https://doi.org/10.1109/ICCSE.2013.6553890

  9. Fan J, Niu Z, Liang Y, Zhao Z (2016) Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling. Neurocomputing 211:172–181. https://doi.org/10.1016/j.neucom.2015.10.140 S092523121630577X

    Article  Google Scholar 

  10. Brown RA (2014) Building a balanced k-d tree in o(kn log n) time. Computer Science

    Google Scholar 

  11. Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust access method for points and rectangles. ACM SIGMOD Rec 19(2):322–331. https://doi.org/10.1145/93605.98741

    Article  Google Scholar 

  12. Arthur D (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, 2007. ACM. https://doi.org/10.1145/1283383.1283494

  13. UCI Machine Learning Repository. http://csegroups.case.edu/bearingdatacenter/home

  14. Bearing Data Center. http://csegroups.case.edu/bearingdatacenter/home

  15. Liu H, Zhou JZ, Xu YH, Zheng Y, Peng XL, Jiang W (2018) Unsupervised fault diagnosis of rolling bearings using a deep neural network based on generative adversarial networks. Neurocomputing 315:412–424. https://doi.org/10.1016/j.neucom.2018.07.034

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Chai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fan, L., Chai, Y., Li, Y. (2020). A Density-Based k-Means++ Algorithm for Imbalanced Datasets Clustering. In: Jia, Y., Du, J., Zhang, W. (eds) Proceedings of 2019 Chinese Intelligent Systems Conference. CISC 2019. Lecture Notes in Electrical Engineering, vol 594. Springer, Singapore. https://doi.org/10.1007/978-981-32-9698-5_5

Download citation

Publish with us

Policies and ethics