Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Harmony K-means algorithm for document clustering

Abstract

Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known optimum faster than other methods and the quality of clusters are comparable.

This is a preview of subscription content, log in to check access.

References

  1. Aggarwal CC, Gates SC, Yu PS (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 352–356

  2. Anderberg MR (1973) Cluster analysis for applications. Academic Press Inc, New York, NY

  3. Boley D, Gini M, Gross R, Han EH, Hastings K, Karypis G et al (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3): 329–341. doi:10.1016/S0167-9236(99)00055-X

  4. Cios K, Pedrycs W, Swiniarski R (1998) Data mining methods for knowledge discovery. Kluwer Academic Publishers

  5. Coello CAC (2000) Constraint-handling using an evolutionary multi objective optimization technique. Civ Eng Environ Syst 17: 319–346. doi:10.1080/02630250008970288

  6. Cui X, Potok TE, Palathingal P (2005) Document clustering using particle swarm optimization. In: IEEE swarm intelligence symposium, pp 185–191

  7. Cutting DR, Pedersen JO, Karger DR, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR. Copenhagen, pp 318–329

  8. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Knowledge discovery and data mining, pp 269–274

  9. Everitt B (1980) Cluster analysis, 2nd edn. Halsted Press, New York

  10. Geem ZW, Kim JH, Loganathan GV (2002) Harmony search optimization: application to pipe network design. Int J Model Simul 22: 125–133

  11. Geem ZW, Tseng C, Park Y (2005) Harmony search for generalized orienteering problem: best touring in China, Springer. Lect Notes Comput Sci 3412: 741–750

  12. Grira N, Crucianu M, Boujemaa N (2005) Unsupervised and semi-supervised clustering: a brief survey. In: 7th ACM SIGMM international workshop on multimedia information retrieval, pp 9–16

  13. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 264–323. CSUR. doi:10.1145/331499.331504

  14. Jones G, Robertson AM, Santimetvirul C, Willett P (1995) Non-hierarchic document clustering using a genetic algorithm. Inform Res 1(1)

  15. Kennedy J, Eberhart RC, Shi Y (2001) Swarm intelligence. Morgan Kaufmann, New York

  16. Klein RW, Dubes RC (1989) Experiments in projection and clustering by simulated annealing. Pattern Recognit 22: 213–220. doi:10.1016/0031-3203(89)90067-8

  17. Labroche N, Monmarche’ N, Venturini G (2003) AntClust: ant clustering and web usage mining. In: Genetic and evolutionary computation conference, pp 25–36

  18. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 16–22

  19. Lee KS, Geem ZW (2004) A new meta-heuristic algorithm for continues engineering optimization: harmony search theory and practice. Comput Method Appl Mech Eng 194: 3902–3933. doi:10.1016/j.cma.2004.09.007

  20. Mahdavi M, Fesanghary M, Damangir E (2007) An improved harmony search algorithm for solving optimization problems. Appl Math Comput 188: 1567–1579

  21. McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297

  22. Merwe VD, Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: Proceedings of IEEE congress on evolutionary computation (CEC 2003), Australia, pp 215–220

  23. Omran M, Salman A, Engelbrecht AP (2002) Image classification using particle swarm optimization. In: Proceedings of the 4th Asia-Pacific conference on simulated evolution and learning (SEAL 2002), Singapore, pp 370–374

  24. Quinlan RJ (1993) C4.5: programs for machine learning. Morgan Kaufmann

  25. Raghavan VV, Birchand K (1979) A clustering strategy based on a formalism of the reproductive process in a natural system. In: Proceedings of the second international conference on information storage and retrieval, pp 10–22

  26. Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE Trans Neural Netw 5(1): 96–101. doi:10.1109/72.265964

  27. Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley

  28. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24: 513–523. doi:10.1016/0306-4573(88)90021-0

  29. Selim SZ, Ismail MA (1984) K-means type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6: 81–87

  30. Steinbach M, Karypis G, Kumar V (2000a) A comparison of document clustering techniques. KDD’2000. Technical report of University of Minnesota

  31. Steinbach M, Karypis G, Kumar V (2000b) A comparison of document clustering techniques. In: KDD workshop on text mining

  32. Stumme G, Hotho A, Berendt B (2001) Semantic Web Mining. Freiburg, September 3rd, 12th Europ. Conf. on Machine Learning (ECML’01)/5th Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’01)

  33. Stumme G, Hotho A, Berendt B (2006) Semantic Web Mining State of the art and future directions. J Web Semantics: Sci Serv Agents World Wide Web 124–143

  34. TREC (1999) Text REtrieval conference. http://trec.nist.gov

  35. TRECQ (1999) Text REtrieval conference relevance judgments. http://trec.nist.gov/data/qrels_eng/index.html

  36. Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected riterion functions for document clustering. Mach Learn 55(3): 311–331. doi:10.1023/B:MACH.0000027785.44527.d6

  37. Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10: 141–168. doi:10.1007/s10618-005-0361-3

  38. Zhong S (2006) Semi-supervised model-based document clustering: a comparative study. Mach Learn 65(1). doi:10.1007/s10994-006-6540-7

  39. Zhong S, Ghosh J (2003) A unified framework for model-based lustering. J Mach Learn Res 4: 1001–1037. JMLR doi:10.1162/jmlr.2003.4.6.1001

  40. Zhong S, Ghosh J (2005) Generative model-based clustering of documents: a comparative study. Knowl Inf Syst 8: 374–384. KAIS doi:10.1007/s10115-004-0194-1

Download references

Author information

Correspondence to Mehrdad Mahdavi.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Mahdavi, M., Abolhassani, H. Harmony K-means algorithm for document clustering. Data Min Knowl Disc 18, 370–391 (2009). https://doi.org/10.1007/s10618-008-0123-0

Download citation

Keywords

  • Document clustering
  • Markov chain
  • Harmony search
  • Global optimization
  • K-means algorithm