Subset K-Means Approach for Handling Imbalanced-Distributed Data

Kumar, Ch. N. Santhosh; Rao, K. Nageswara; Govardhan, A.; Sandhya, N.

doi:10.1007/978-3-319-13731-5_54

Ch. N. Santhosh Kumar⁶,
K. Nageswara Rao⁷,
A. Govardhan⁸ &
…
N. Sandhya⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 338))

2484 Accesses
3 Citations

Abstract

The effectiveness of clustering analysis relies not only on the assumption of cluster number but also on the class distribution of the data employed. This paper represents another step in overcoming a drawback of K-means, its lack of defense against imbalance data distribution. K-means is a partitional clustering technique that is well-known and widely used for its low computational cost. However, the performance of k-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the “uniform effect.” In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the k-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect”, we revisit the well-known K-means algorithm and provide a general method to properly cluster imbalance distributed data.

The proposed algorithm consists of a novel under random subset generation technique implemented by defining number of subsets depending upon the unique properties of the dataset. We conduct experiments using ten UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experiment results show that our proposed approach has several distinctive advantages over the original k-means and other clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: A data-distribution perspective. IEEE Trans. Syst., Man, Cybern. B, Cybern. 39(2), 318–331 (2009)
Article Google Scholar
Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recognit. 42, 689–698 (2009)
Article MATH Google Scholar
Chawla, N., Bowyer, K., Kegelmeyer, P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7) (July 2002)
Google Scholar
de Amorim, R.C., Mirkin, B.: Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recognition 45, 1061–1075 (2012)
Article Google Scholar
Kiranyaz, S., Ince, T., Pulkkinen, J., Gabbouj, M.: Personalized long-term ECG classification: A systematic approach. Expert Systems with Applications 38, 3220–3226 (2011)
Article Google Scholar
Xiang, H., Yang, Y., Zhao, S.: Local Clustering Ensemble Learning Method Based on Improved AdaBoost for Rare Class Analysis. Journal of Computational Information Systems 8(4), 1783–1790 (2012)
Google Scholar
Muniyandi, A.P., Rajeswari, R., Rajaram, R.: Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree algorithm. In: International Conference on Communication Technology and System Design 2011. Procedia Engineering, vol. 30, pp. 174–182 (2012)
Google Scholar
Li, X., Chen, Z., Yang, F.: Exploring of clustering algorithm on class-imbalanced data
Google Scholar
Bouras, C., Tsogkas, V.: A clustering technique for news articles using WordNet. Knowl. Based Syst. (2012), http://dx.doi.org/10.1016/j.knosys.2012.06.015
Mok, P.Y., Huang, H.Q., Kwok, Y.L., Au, J.S.: A robust adaptive clustering analysis method for automatic identification of clusters. Pattern Recognition 45, 3017–3033 (2012)
Article Google Scholar
Leiva, L.A., Vidal, E.: Warped K-Means: An algorithm to cluster sequentially-distributed data. Information Sciences 237, 196–210 (2013)
Article MathSciNet Google Scholar
Jaing, M.F., Tseng, S.S., Su, C.M.: Two Phase Clustering Process for Outlier Detection. Pattern Recognition Letters 22, 691–700 (2001)
Article Google Scholar
Cao, J., Wu, Z., Wu, J., Liu, W.: Towards information-theoretic K-means clustering for image indexing. Signal Processing 93, 2026–2037 (2013)
Article Google Scholar
Mignotte, M.: A de-texturing and spatially constrained K-means approach for image segmentation. Pattern Recognition Lett. (2010), doi:10.1016/j.patrec, 09.016
Google Scholar
Maimon, O., Rokach, L.: Data mining and knowledge discovery handbook. Springer, Berlin (2010)
Book MATH Google Scholar
Blake, C., Merz, C.J.: UCI repository of machine learning databases. Machine-readable data repository. Department of Information and Computer Science, University of California at Irvine, Irvine (2000), http://www.ics.uci.edu/mlearn/MLRepository.html
http://www.keel.com
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of CSE, JNTU, Hyderabad, Telangana, India
Ch. N. Santhosh Kumar
PSCMR college of Engineering and Technology, Kothapet, Vijayawada, A.P., India
K. Nageswara Rao
CSE, SIT, JNTU, Hyderabad, Telangana, India
A. Govardhan
CSE Department, VNR Vignana Jyothi Institite of Engineering & Technology, Hyderabad, India
N. Sandhya

Authors

Ch. N. Santhosh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
K. Nageswara Rao
View author publications
You can also search for this author in PubMed Google Scholar
A. Govardhan
View author publications
You can also search for this author in PubMed Google Scholar
N. Sandhya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ch. N. Santhosh Kumar .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Anil Neerukonda Institute of Technology and Sciences, Vishakapatnam, India
Suresh Chandra Satapathy
School of Information Technology, Jawaharlal Nehru Technological University Hyderabad, Hyderabad, India
A. Govardhan
Department of CSE, Computer Society of India, Hyderabad, India
K. Srujan Raju
University of Kalyanai, Kalyanai, West Bengal, India
J. K. Mandal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, C.N.S., Rao, K.N., Govardhan, A., Sandhya, N. (2015). Subset K-Means Approach for Handling Imbalanced-Distributed Data. In: Satapathy, S., Govardhan, A., Raju, K., Mandal, J. (eds) Emerging ICT for Bridging the Future - Proceedings of the 49th Annual Convention of the Computer Society of India CSI Volume 2. Advances in Intelligent Systems and Computing, vol 338. Springer, Cham. https://doi.org/10.1007/978-3-319-13731-5_54

Download citation

DOI: https://doi.org/10.1007/978-3-319-13731-5_54
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13730-8
Online ISBN: 978-3-319-13731-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics