Skip to main content
Log in

A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In this paper, we propose a heuristic-based algorithm to improve the initial seeding of the k-means clustering algorithm. The proposed algorithm primarily aims to improve the initial choice of the centroids used by the k-means algorithm and also ensure that the requisite number of clusters is always returned in every run of the algorithm. Thus, the use of the proposed algorithm significantly reduces the possibility of k-means converging to a locally optimal solution. The paper explores the genetic algorithm framework to obtain the original seed points and couples this with the use of the differential evolution heuristic to obtain the requisite number of clusters. We have examined the performance of the proposed algorithm in the case of clustering text documents as such corpus often have significantly large number of data points and also require the formation of a large number of clusters. The results obtained have been compared with basic implementations of the k-means algorithm using standard parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  • Abraham A, Das S, Konar A (2006) Document clustering using differential evolution. In: Proceedings of IEEE international conference on evolutionary computation, IEEE, pp 1784–1791

  • Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  • Al-Shboul B, Myaeng SH (2006) Initializing k-means using genetic algorithms. PhD thesis, University of Jordan

  • Alshamiri AK, Singh A, Surampudi BR (2016) Artificial bee colony algorithm for clustering: an extreme learning approach. Soft Comput 20(8):3163–3176

    Article  Google Scholar 

  • Arellano-Verdejo J, Alba E, Godoy-Calderon S (2016) Efficiently finding the optimum number of clusters in a dataset with a new hybrid differential evolution algorithm: DELA. Soft Comput 20(3):895–905

    Article  Google Scholar 

  • Babu GP, Murty MN (1993) A near-optimal initial seed value selection in k-means means algorithm using a genetic algorithm. Pattern Recogn Lett 14(10):763–769

    Article  MATH  Google Scholar 

  • Banerjee S, Choudhary A, Pal S (2015) Empirical evaluation of k-means, bisecting k-means, fuzzy c-means and genetic k-means clustering algorithms. In: 2015 IEEE international WIE conference on electrical and computer engineering (WIECON-ECE), pp 168–172

  • Bettoumi S, Jlassi C, Arous N (2017) Collaborative multi-view k-means clustering. Soft Comput. https://doi.org/10.1007/s00500-017-2801-6

  • Bezdek JC (2013) Pattern recognition with fuzzy objective function algorithms. Springer, Berlin

    MATH  Google Scholar 

  • Bickel S, Scheffer T (2004) Multi-view clustering. ICDM 4:19–26

    Google Scholar 

  • Castells P, Fernandez M, Vallet D (2007) An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans Knowl Data Eng 19(2):261–272

    Article  Google Scholar 

  • Celebi ME (2015) Partitional clustering algorithms. Springer, Berlin

    Book  MATH  Google Scholar 

  • Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  • Das S, Abraham A, Konar A (2008) Automatic clustering using an improved differential evolution algorithm. IEEE Trans Syst Man Cybern Part A: Syst Hum 38(1):218–237

    Article  Google Scholar 

  • De Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn 45(3):1061–1075

    Article  Google Scholar 

  • Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197

    Article  Google Scholar 

  • Dunham MH (2006) Data mining: introductory and advanced topics. Pearson Education India, London

    Google Scholar 

  • Feoktistov V (2006) Differential evolution. Springer, New York

    MATH  Google Scholar 

  • Freitas AA (2013) Data mining and knowledge discovery with evolutionary algorithms. Springer, New York

    Google Scholar 

  • Gavish M, Donoho DL (2014) The optimal hard threshold for singular values is \(\frac{4}{\sqrt{3}}\). IEEE Trans Inf Theory 60(8):5040–5053

    Article  MATH  MathSciNet  Google Scholar 

  • Ghosh S, Dubey SK (2013) Comparative analysis of k-means and fuzzy c-means algorithms. Int J Adv Comput Sci Appl 4(4):35–39

    Google Scholar 

  • Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, New York

    MATH  Google Scholar 

  • Goyal MM, Agrawal N, Sarma MK, Kalita NJ (2015) Comparison clustering using cosine and fuzzy set based similarity measures of text documents. arXiv preprint arXiv:1505.00168

  • Grossman DA, Frieder O (2012) Information retrieval: algorithms and heuristics, vol 15. Springer, New York

    MATH  Google Scholar 

  • Hamerly G, Drake J (2015) Accelerating loydś algorithm for k-means clustering. In: Emre Celebi M (ed) Partitional Clustering algorithms. Springer, New York, pp 41–78

    Google Scholar 

  • Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Hatamlou A (2013) Black hole: a new heuristic optimization approach for data clustering. Inf Sci 222:175–184

    Article  MathSciNet  Google Scholar 

  • Hatamlou A, Abdullah S, Nezamabadi-Pour H (2012) A combined approach for clustering based on k-means and gravitational search algorithms. Swarm Evol Comput 6:47–52

    Article  Google Scholar 

  • Holland JH (1975) Adaptation in natural and artificial systems: an introductory analysis with application to biology, control, and artificial intelligence. University of Michigan Press, Ann Arbor, pp 439–444

    Google Scholar 

  • Hornik K, Feinerer I, Kober M, Buchta C (2012) Spherical k-means clustering. J Stat Softw 50(10):1–22

    Article  Google Scholar 

  • Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  • Karaboga D, Ozturk C (2011) A novel clustering approach: artificial bee colony (ABC) algorithm. Appl Soft Comput 11(1):652–657

    Article  Google Scholar 

  • Krishna K, Murty MN (1999) Genetic k-means algorithm. IEEE Trans Syst Man Cybern Part B (Cybernetics) 29(3):433–439

    Article  Google Scholar 

  • Li CS (2011) Cluster center initialization method for k-means algorithm over data sets with two clusters. Procedia Eng 24:324–328

    Article  Google Scholar 

  • Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Mustafi D, Sahoo G, Mustafi A (2016a) An improved heuristic k-means clustering method using genetic algorithm based initialization. In: Advances in computational intelligence: proceedings of international conference on computational intelligence 2015. Springer, New York, pp 123–132

  • Mustafi D, Sahoo G, Mustafi A (2016b) A multi criteria document clustering approach using genetic algorithm. In: Computational intelligence in data mining, vol 1. Springer, New York, pp 237–247

  • Nasir JA, Varlamis I, Karim A, Tsatsaronis G (2013) Semantic smoothing for text clustering. Knowl-Based Syst 54:216–229

    Article  Google Scholar 

  • Ning Y, Zhu X, Zhu S, Zhang Y (2015) Surface emg decomposition based on k-means clustering and convolution kernel compensation. IEEE J Biomed Health Inform 19(2):471–477

    Article  Google Scholar 

  • Ozturk C, Hancer E, Karaboga D (2015) Improved clustering criterion for image clustering with artificial bee colony algorithm. Pattern Anal Appl 18(3):587–599

    Article  MathSciNet  Google Scholar 

  • Peiravi A, Mashhadi HR, Hamed Javadi S (2013) An optimal energy-efficient clustering method in wireless sensor networks using multi-objective genetic algorithm. Int J Commun Syst 26(1):114–126

    Article  Google Scholar 

  • Pena JM, Lozano JA, Larranaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recogn Lett 20(10):1027–1040

    Article  Google Scholar 

  • Pujari AK (2001) Data mining techniques. Universities Press, Hyderabad

    Google Scholar 

  • Qu BY, Suganthan PN, Liang JJ (2012) Differential evolution with neighborhood mutation for multimodal optimization. IEEE Trans Evol Comput 16(5):601–614

    Article  Google Scholar 

  • Romero FP, Peralta A, Soto A, Olivas JA, Serrano-Guerrero J (2010) Fuzzy optimized self-organizing maps and their application to document clustering. Soft Comput 14(8):857–867

    Article  Google Scholar 

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Savaresi SM, Boley DL (2000) Bisecting k-means and PDDP: a comparative analysis. Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza L. da Vinci, 32, 20133, Milan, ITALY

  • Sharma R, Verma K (2017) Enhanced shared nearest neighbor clustering approach using fuzzy for teleconnection analysis. Soft Comput. https://doi.org/10.1007/s00500-017-2767-4

  • Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D (2014) Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3):491–504

    Article  Google Scholar 

  • Singh VK, Tiwari N, Garg S (2011) Document clustering using k-means, heuristic k-means and fuzzy c-means. In: 2011 international conference on computational intelligence and communication networks (CICN). pp 297–301

  • Sivanandam S, Deepa S (2007) Introduction to genetic algorithms. Springer, New York

    MATH  Google Scholar 

  • Song W, Park SC (2009) Genetic algorithm for text clustering based on latent semantic indexing. Comput Math Appl 57(11):1901–1907

    Article  MATH  Google Scholar 

  • Tsapanos N, Tefas A, Nikolaidis N, Pitas I (2015) A distributed framework for trimmed kernel k-means clustering. Pattern Recogn 48(8):2685–2698

    Article  MATH  Google Scholar 

  • Wang J, Wang J, Ke Q, Zeng G, Li S (2015) Fast approximate k-means via cluster closures. In: Baughman AK, Gao J, Pan J-Y, Petrushin VA (eds) Multimedia data mining and analytics. Springer, New York, pp 373–395

    Google Scholar 

  • Xu KS, Kliger M, Hero Iii AO (2014) Adaptive evolutionary clustering. Data Min Knowl Disc 28(2):304–336

    Article  MathSciNet  MATH  Google Scholar 

  • Zaki MJ, Meira W Jr (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to D. Mustafi.

Ethics declarations

Conflict of interest

The authors hereby declare that they have no conflict of interest.

Human participants or animals

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mustafi, D., Sahoo, G. A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the k-means algorithm with applications in text clustering. Soft Comput 23, 6361–6378 (2019). https://doi.org/10.1007/s00500-018-3289-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3289-4

Keywords

Navigation