Undersampled $$K$$ -means approach for handling imbalanced distributed data

Kumar, N. Santhosh; Rao, K. Nageswara; Govardhan, A.; Reddy, K. Sudheer; Mahmood, Ali Mirza

doi:10.1007/s13748-014-0045-6

Undersampled $K$-means approach for handling imbalanced distributed data

Regular Paper
Published: 08 April 2014

Volume 3, pages 29–38, (2014)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

N. Santhosh Kumar¹,
K. Nageswara Rao²,
A. Govardhan³,
K. Sudheer Reddy⁴ &
…
Ali Mirza Mahmood⁵

1023 Accesses
22 Citations
Explore all metrics

Abstract

$K$-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of $K$-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the “uniform effect”. In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the $K$-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect”, we revisit the well-known $K$-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Density-Based Clustering Based on Hierarchical Density Estimates

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

References

Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 318–331 (2009)
Article Google Scholar
Lu, W.-Z., Wang, D.: Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci. Total. Environ. 395(2–3), 109–116 (2008)
Article Google Scholar
Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. R. World Appl. 7(4), 720–747 (2006)
Article MATH MathSciNet Google Scholar
Cieslak, D., Chawla, N., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference on Granular Computing, pp. 732–737 (2006)
Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21(2–3), 427–436 (2008)
Article Google Scholar
Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-sensitive decision trees applied to medical data. In: Song, I., Eder, J., Nguyen, T. (eds.) Data Warehousing Knowledge Discovery (Lecture Notes Series in Computer Science)
Kiliç, K., Uncu, Ö., Türksen, I.B.: Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inf. Sci. 177(23), 5153–5162 (2007)
Article MATH Google Scholar
Celebi, M.E., Kingravi, H.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H.: A methodological approach to the classification of dermoscopy images. Comput. Med. Imaging Graph 31(6), 362–373 (2007)
Article Google Scholar
Mahmood, A.M., Kuppa, M.R.: Early detection of clinical parameters in heart disease using improved decision tree algorithm. In: IEEE 2$^{nd}$ Vaagdevi International Conference on Information Technology for Real World Problems (VCON’10) Acceptance Rate less than 6 %, pp. 24–29, Dec 9–11 Warangal. Archived in IEEE Computer Society Digital Library, India (2010)
Mahmood, A.M., Kuppa, M.R.: A novel pruning approach using expert knowledge for data specific pruning. In: Engineering with Computers, vol. 28, pp. 21–30. Springer-Verlag, London (2011)
Peng, X., King, I.: Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw. 21(2—-3), 450–457 (2008)
Article Google Scholar
Chawla, N., Bowyer, K., Kegelmeyer, P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Weiss, G.: Mining with rarity: a unifying framework. SIGKDD Explor. Newslett. 6(1), 7–19 (2004)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of International Conference on ACM Special Interest Group on Management of Data, pp. 73–84 (1998)
Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recogn. 42, 689–698 (2009)
Article MATH Google Scholar
Xiang, H., Yang, Y., Zhao, S.: Local clustering ensemble learning method based on improved AdaBoost for rare class analysis. J. Comput. Inform. Syst. 8(4), 1783–1790 (2012)
Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Article Google Scholar
de Amorim Cordeiro, R.: Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recogn. 45, 1061–1075 (2012)
Kiranyaz, S., Ince, T., Pulkkinen, J., Gabbouj, M.: Personalized long-term ecg classification: a systematic approach. Expert Syst. Appl. 38, 3220–3226 (2011)
Article Google Scholar
Muniyandi, A.P., Rajeswari, R., Rajaram, R.: Network anomaly detection by cascading K-means clustering and C4.5 decision tree algorithm. International Conference on Communication Technology and System Design 2011. Procedia Eng. 30, 174–182 (2012)
Article Google Scholar
Xuan, l., Zhigang, C., Fan, Y.: Exploring of clustering algorithm on class-imbalanced data (2013)
Bouras, C., Tsogkas, V.: A clustering technique for news articles using WordNet, Knowl. Based Syst. (2012). doi:10.1016/j.knosys.2012.06.015
Mok, P.Y., Huang, H.Q., Kwok, Y.L., Au, J.S.: A robust adaptive clustering analysis method for automatic identification of clusters. Pattern Recogn. 45, 3017–3033 (2012)
Article Google Scholar
Leiva, L.A., Vidal, E.: Warped K-means: an algorithm to cluster sequentially-distributed data. Inform. Sci. 237, 196–210 (2013)
Article MathSciNet Google Scholar
Jaing, M.F., Tseng, S.S., Su, C.M.: Two phase clustering process for outlier detection. Pattern Recogn. Lett. 22, 691–700 (2001)
Article Google Scholar
Cao, J., Wu, Z., Wu, J., Liu, W.: Towards information-theoretic k-means clustering for image indexing. Sign. Proces. 93, 2026–2037 (2013)
Article Google Scholar
Mignotte, M.: A de-texturing and spatially constrained k-means approach for image segmentation. Pattern Recogn. Lett. 32, 359–367 (2010). doi:10.1016/j.patrec.2010.09.016
Article Google Scholar
López, V., Fernandez, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform. Sci. 250, 113–141 (2013)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46(12), 3460–3471 (2013)
Article Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Article Google Scholar
Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learning Res. 7, 1–30 (2006)
MATH MathSciNet Google Scholar
García, S., Fernandez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inform. Sci. 180, 2044–2064 (2010)
Google Scholar
Maimon, O., Rokach, L.: Data Mining And Knowledge Discovery Handbook. Springer, Berlin (2010)
Book MATH Google Scholar
Hall, M.A.: Correlation-Based Feature Subset Selection For Machine Learning. Hamilton, New Zealand (1998)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann Publishers, San Mateo (1993)
Google Scholar
http://www.keel.es/
Blake, C., Merz, C.J.: UCI repository of machine learning databases. Machine-readable data repository. Department of Information and Computer Science, University of California at Irvine, Irvine. http://www.ics.uci.edu/mlearn/MLRepository.html (2000)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Google Scholar
Dasgupta, S.: Performance guarantees for hierarchical clustering. In: 15th Annual Conference on Computational Learning Theory, pp. 351–363 (2002)

Download references

Author information

Authors and Affiliations

Department of CSE, JNTU, Hyderabad, Andhra Pradesh, India
N. Santhosh Kumar
PSCMR college of Engineering and Technology, Kothapet, Vijayawada, Andhra Pradesh, India
K. Nageswara Rao
CSE and SIT, JNTU, Hyderabad, Andhra Pradesh, India
A. Govardhan
Infosys, Hyderabad, Andhra Pradesh, India
K. Sudheer Reddy
DMS SVH College of Engineering, Machilipatam, Andhra Pradesh, India
Ali Mirza Mahmood

Authors

N. Santhosh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
K. Nageswara Rao
View author publications
You can also search for this author in PubMed Google Scholar
A. Govardhan
View author publications
You can also search for this author in PubMed Google Scholar
K. Sudheer Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Ali Mirza Mahmood
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. Santhosh Kumar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumar, N.S., Rao, K.N., Govardhan, A. et al. Undersampled $K$-means approach for handling imbalanced distributed data. Prog Artif Intell 3, 29–38 (2014). https://doi.org/10.1007/s13748-014-0045-6

Download citation

Received: 04 January 2014
Accepted: 16 March 2014
Published: 08 April 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s13748-014-0045-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Undersampled \(K\)-means approach for handling imbalanced distributed data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Density-Based Clustering Based on Hierarchical Density Estimates

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Undersampled \(K\)-means approach for handling imbalanced distributed data

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Density-Based Clustering Based on Hierarchical Density Estimates

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation