An Empirical Comparative Study of Novel Clustering Algorithms for Class Imbalance Learning

  • Ch. N. Santhosh Kumar
  • K. Nageswara Rao
  • A. Govardhan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 380)


Data mining is the process of discovering knowledge from the vast data sources. In Data mining, classification and clustering are the two broad branches of study. In Clustering, K-means algorithm is one of the bench mark algorithms used for numerous applications. The popularity of k-means algorithm is due to its efficient and low usage of memory. One of the short comings of k-means algorithm is degradation of performance, when applied to imbalance distributed data. The results of cluster size generated by k-means are relatively uniform, in spite of the input data with non-uniform cluster sizes, which is defined as “uniform effect” in the literature. This paper proposes several novel algorithms to solve the above said problem. The proposed algorithms are compared with each other. The experiments conducted with the proposed algorithm on eleven UCI datasets with evaluation metrics show that proposed algorithms are effective to solve the problem of “uniform effect.”


Imbalanced data K-means clustering algorithms Oversampling Uniform effect 


  1. 1.
    Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: A data-distribution perspective. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 318–331 (2009)CrossRefGoogle Scholar
  2. 2.
    Lu, W.-Z., Wang, D.: Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci. Total. Environ. 395(2–3), 109–116 (2008)CrossRefGoogle Scholar
  3. 3.
    Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. R. World Appl. 7(4), 720–747 (2006)MATHMathSciNetCrossRefGoogle Scholar
  4. 4.
    Cieslak, D., Chawla, N., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference Granular Computing, pp. 732–737 (2006)Google Scholar
  5. 5.
    Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21(2–3), 427–436 (2008)CrossRefGoogle Scholar
  6. 6.
    Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-sensitive decision trees applied to medical data. In: Song, I., Eder, J., Nguyen, T. (eds.) Data Warehousing Knowl. Discov. Lecture Notes Series in Computer ScienceGoogle Scholar
  7. 7.
    Kilic, K., Uncu, Ö., Türksen, I.B.: Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inf. Sci. 177(23), 5153–5162 (2007)Google Scholar
  8. 8.
    Celebi, M.E., Kingravi, H.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H.: A methodological approach to the classification of dermoscopy images. Comput. Med. Imag. Grap. 31(6), 362–373 (2007)CrossRefGoogle Scholar
  9. 9.
    Peng, X., King, I.: Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw. 21(2–3), 450–457 (2008). Berlin/Heidelberg, Germany: Springer, 2007, vol. 4654, pp. 303–312Google Scholar
  10. 10.
    Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings International Conference ACM Special Interest Group Manage Data, pp. 73–84 (1998)Google Scholar
  11. 11.
    Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recognit. 42, 689–698 (2009)MATHCrossRefGoogle Scholar
  12. 12.
    Lago-Fernándezn, L.F., Aragón, J., Martínez-Muñoz, G., González, A.M., Sánchez-Montañés, M.: Cluster validation in problems with increasing dimensionality and unbalanced clusters. Neurocomputing, Elsevier 123, 33–39 (2014)Google Scholar
  13. 13.
    Alejo, R., García, V., Pacheco-Sánchez, J.H.: An efficient over-sampling approach based on mean square error back propagation for dealing with the multi-class imbalance problem. Neural Process Lett, Elsivier. doi: 10.1007/s11063-014-9376-3
  14. 14.
    Wang, Q.: A hybrid sampling SVM approach to imbalanced data classification. Hindawi Publishing Corporation Abstract and Applied Analysis, vol. 2014, p. 7. Article ID 972786.
  15. 15.
    Santhosh Kumar, N., Nageswara Rao, K.,·Govardhan, A., Sudheer Reddy, K., Ali Mirza, M.: Undersampled K-means approach for handling imbalanced distributed data. Prog. Artif. Intell. Springer. doi: 10.1007/s13748-014-0045-6
  16. 16.
    Brzezinski, D., Stefanowski. J.: Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans. Neural Networks Learn. Syst.
  17. 17.
    Poolsawad, N., Kambhampati, C., Cleland, J.G.F.: Balancing class for performance of classification with a clinical dataset. In: Proceedings of the World Congress on Engineering 2014, vol. I, WCE n, U.KGoogle Scholar
  18. 18.
    Oreški, G., Oreški, S.: An experimental comparison of classification algorithm performances for highly imbalanced datasets. Presented at CECIIS 2014Google Scholar
  19. 19.
    Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. Emerg. Paradig. Mach. Learn. Smart Innov. Syst. Technol. 13, 277–306 (2013)CrossRefGoogle Scholar
  20. 20.
    Tomašev, N., Mladeni, D.: Class imbalance and the curse of minority hubs. Knowledge-Based Syst. J. (2013). doi:
  21. 21.
    Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sudheer Reddy, K., Mahmood, A.M.: Undersampled K-means approach for handling imbalanced distributed data. Progress in Artificial Intelligence. ISSN:2192-6352 Prog Artif. Intell. 3, 29–38 (2014). doi: 10.1007/s13748-014-0045-6. Published in Springer-Verlag Berlin Heidelberg April 2014
  22. 22.
    Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sudheer Reddy, K.: Imbalanced K- means: An algorithm to cluster imbalanced—distributed data. Int. J. Eng. Techn. Res. (IJETR). vol.2, Issue-2, Feb. 2014. ISSN:2321-0869Google Scholar
  23. 23.
    Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sandhya, N.: Subset K-Means approach for handling imbalanced-distributed data. Springer International Publication Switzerland 2015—Emerging ICT for Bridging the Future—Proceedings of the 49th Annual Convention of the Computer Society of India CSI, vol. 2. Advances in Intelligent Systems and Computing, vol. 338. doi: 10.1007/978-3-319-13731-5_54, 2015, pp. 497–508. Published in Springer International Publication Switzerland 2015
  24. 24.
    Blake, C., Merz, C.J.: UCI repository of machine learning databases. Machine-readable data repository. Department of Information and Computer Science, University of California at Irvine, Irvine (2000).
  25. 25.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)Google Scholar

Copyright information

© Springer India 2016

Authors and Affiliations

  • Ch. N. Santhosh Kumar
    • 1
  • K. Nageswara Rao
    • 2
  • A. Govardhan
    • 3
  1. 1.Department of CSEJNTU-HyderabadHyderabadIndia
  2. 2.PSCMR College of Engineering and TechnologyVijayawadaIndia
  3. 3.CSE & SITJNTU HyderabadHyderabadIndia

Personalised recommendations