Skip to main content

Advertisement

Log in

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Applied Intelligence Aims and scope Submit manuscript

Abstract

Among the typical clustering methods, the K-means algorithm plays the most important role in clustering because of its simplicity and efficiency. However, it is sensitive to the initial points and easy to fall into local optimum. In order to avoid this kind of flaw, a patented text clustering algorithm Clustering by Genetic Algorithm Model (CGAM) is revealed in this paper. CGAM constructs the fitness function of genetic algorithm (GA) and convergence criterion for K-means algorithm because GA simulates the natural evolutionary process and deals with a larger search space. To tackle the rich semantics of Chinese texts, CGAM creates an innovative selection method of initial centers of GA and accommodates the contribution of characteristics of different parts of speech. Moreover, the impact of outliers is addressed and treated. Its performance is demonstrated by a series of experiments based on both Reuters-21578 and Chinese text corpus. Experimental results show that the CGAM achieves clustering results better than other GA based K-means algorithms and has been successfully applied to national program of business intelligence system in the context of huge set of contents in both Chinese and English.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 4
Fig. 5
Fig. 3

References

  1. Liu Y, Wang X, Xu Z, Yi G (2006) Summary of document clustering. Chin Inform 20(3):55–62

    Google Scholar 

  2. Chen H, He T, Ji D (2005) No guide word sense disambiguation based on K-means clustering. Chin Inform 19(4):10–16

    Google Scholar 

  3. Qing X, Zheng S (2009) A new method for initializing the K-means clustering algorithm. In: 2009 second international symposium on knowledge acquisition and modeling, Wuhan, pp 41–44

    Google Scholar 

  4. Chen X, Xu Y (2009) K-means clustering algorithm with refined initial center. In: 2nd international conference on biomedical engineering and informatics, Tianjin, pp 1–4

    Google Scholar 

  5. He T, Dai W, Jiao C, et al (2007) Text clustering based on hybrid parallel genetic algorithm. Chin Inform 21(4):55–60

    Google Scholar 

  6. Kashef R, Kamel MS (2009) Enhanced bisecting K-means clustering using intermediate cooperation. Pattern Recognit 42:2557–2569

    Article  MATH  Google Scholar 

  7. Xu H, Liu Y, Den C (2010) K-means text clustering algorithm based on similar centers. Comput Eng Design 31(8):1802–1805

    Google Scholar 

  8. Song W, Choi LC, Park SC, Ding XF (2011) Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization. Expert Syst Appl 38:9112–9121

    Article  Google Scholar 

  9. Song W, Li CH, Park SC (2009) Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Syst Appl 36:9095–9104

    Article  Google Scholar 

  10. Tjhi W-C, Chen L (2007) Possibilistic fuzzy co-clustering of large document collections. Pattern Recognit 40:3452–3466

    Article  MATH  Google Scholar 

  11. Jiang J-Y, Liou R-J, Lee S-J (2011) A fuzzy self-constructing feature clustering algorithm for text classification. In: IEEE transactions on knowledge and data engineering, March 2011, vol 23(3)

    Google Scholar 

  12. Boutsinasa B, Papastergiou T (2008) On clustering tree structured data with categorical nature. Pattern Recognit 41:3613–3623

    Article  Google Scholar 

  13. Hondt JD, Vertommen J, Verhaegen P-A, Cattrysse D, Duflou JR (2010) Pairwise-adaptive dissimilarity measure for document clustering. Inf Sci 180:2341–2358

    Article  Google Scholar 

  14. Wang C, Chen Z, Yuan Z (2003) K-means clustering analysis based on genetic algorithm. Comput Sci 30(2):163–164

    MathSciNet  Google Scholar 

  15. Lai Y, Liu J, Yang G (2008) K-means clustering analysis based on genetic algorithm. Comput Eng 34(20):200–202

    Google Scholar 

  16. Hu Y, Bi J (2010) Genetic optimization of K-means clustering algorithm. Comput Syst Appl 6:52–55

    Google Scholar 

  17. Wang H, Yan X, Jin J, Zhan Z (2010) An improved genetic K-means clustering algorithm. Comput Digital Eng 1:18–20

    Google Scholar 

  18. Xu J, Zhang L, Xu S, Li J (2010) Improved genetic K-means clustering algorithm. J Microcomput Appl 31(4):11–18

    Google Scholar 

  19. Silva C, Ribe B (2010) Distributed text classification with an ensemble kernel-based learning approach. In: IEEE transaction on systems, man, and cybernetics—Part C: applications and reviews, May 2010, vol 40(3)

    Google Scholar 

  20. Shi K, Zhang N, Li L, et al (2011) Efficient text classification method based on improved term reduction and term weighting. J China Univ Post Commun 18:131–135

    Google Scholar 

  21. Shi K, Li L, Liu H, et al (2011) A linguistic feature based K-means text clustering method. In: Proceedings of IEEE cloud computing and intelligent systems, Sep 2011, pp 108–112

    Google Scholar 

  22. Shi K, Li L, Liu H, et al (2011) Improved GA-based document clustering algorithm. In: Proceedings of IEEE broadband and multimedia communications, Oct 2011, pp 675–679

    Google Scholar 

  23. Shi K, Li L, Zhang N, et al (2011) An improved KNN text classification algorithm based on density. In: Proceedings of IEEE cloud computing and intelligent systems, Sep 2011, pp 113–117

    Google Scholar 

  24. Shi K, Li L (2012) High performance topic detection based on relevance model. Journal of University of Electronic Science and Technology of China. To appear

  25. Zhao S, Liu T, Li S (2007) A text clustering algorithm based on topics. Chin Inform 21(2):58–61

    Google Scholar 

  26. Shi K, Li L (2012) A close-to-linear topic detection algorithm using relative entropy based relevance model and inverted indices retrieval. International Journal of Computational Intelligence Systems. To appear

  27. Lee LH, Wan CH, Rajkumar R, Isa D (2012) An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Appl Intell 37:80–99

    Article  Google Scholar 

  28. Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34:19–27

    Article  MATH  Google Scholar 

  29. Kyriakopoulou A, Kalamboukis T (2011) Clustering as a prior step to classification: an empirical study. Int J Artif Intell Tools 20(3):531–548

    Article  Google Scholar 

  30. Capdevila M, Florez OW (2009) A communication perspective on automatic text categorization. IEEE Transactions on Knowledge and Data Engineering 12(7):1027–1041

    Article  Google Scholar 

  31. Li Y, Hung E, Chung K (2011) A subspace decision cluster classifier for text classification. Expert Syst Appl 38:12475–12482

    Article  Google Scholar 

Download references

Acknowledgements

This research is partially supported by National Natural Science Foundation under the grant numbers 60970107 and 61073150. The real text clustering application–outline based theme report system for business intelligence (www.wasuo.com) is sponsored by National Incubation Center. We would like to express our sincere thanks.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kansheng Shi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, K., Li, L. High performance genetic algorithm based text clustering using parts of speech and outlier elimination. Appl Intell 38, 511–519 (2013). https://doi.org/10.1007/s10489-012-0382-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-012-0382-8

Keywords

Navigation