Abstract
Among the typical clustering methods, the K-means algorithm plays the most important role in clustering because of its simplicity and efficiency. However, it is sensitive to the initial points and easy to fall into local optimum. In order to avoid this kind of flaw, a patented text clustering algorithm Clustering by Genetic Algorithm Model (CGAM) is revealed in this paper. CGAM constructs the fitness function of genetic algorithm (GA) and convergence criterion for K-means algorithm because GA simulates the natural evolutionary process and deals with a larger search space. To tackle the rich semantics of Chinese texts, CGAM creates an innovative selection method of initial centers of GA and accommodates the contribution of characteristics of different parts of speech. Moreover, the impact of outliers is addressed and treated. Its performance is demonstrated by a series of experiments based on both Reuters-21578 and Chinese text corpus. Experimental results show that the CGAM achieves clustering results better than other GA based K-means algorithms and has been successfully applied to national program of business intelligence system in the context of huge set of contents in both Chinese and English.
References
Liu Y, Wang X, Xu Z, Yi G (2006) Summary of document clustering. Chin Inform 20(3):55–62
Chen H, He T, Ji D (2005) No guide word sense disambiguation based on K-means clustering. Chin Inform 19(4):10–16
Qing X, Zheng S (2009) A new method for initializing the K-means clustering algorithm. In: 2009 second international symposium on knowledge acquisition and modeling, Wuhan, pp 41–44
Chen X, Xu Y (2009) K-means clustering algorithm with refined initial center. In: 2nd international conference on biomedical engineering and informatics, Tianjin, pp 1–4
He T, Dai W, Jiao C, et al (2007) Text clustering based on hybrid parallel genetic algorithm. Chin Inform 21(4):55–60
Kashef R, Kamel MS (2009) Enhanced bisecting K-means clustering using intermediate cooperation. Pattern Recognit 42:2557–2569
Xu H, Liu Y, Den C (2010) K-means text clustering algorithm based on similar centers. Comput Eng Design 31(8):1802–1805
Song W, Choi LC, Park SC, Ding XF (2011) Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization. Expert Syst Appl 38:9112–9121
Song W, Li CH, Park SC (2009) Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Syst Appl 36:9095–9104
Tjhi W-C, Chen L (2007) Possibilistic fuzzy co-clustering of large document collections. Pattern Recognit 40:3452–3466
Jiang J-Y, Liou R-J, Lee S-J (2011) A fuzzy self-constructing feature clustering algorithm for text classification. In: IEEE transactions on knowledge and data engineering, March 2011, vol 23(3)
Boutsinasa B, Papastergiou T (2008) On clustering tree structured data with categorical nature. Pattern Recognit 41:3613–3623
Hondt JD, Vertommen J, Verhaegen P-A, Cattrysse D, Duflou JR (2010) Pairwise-adaptive dissimilarity measure for document clustering. Inf Sci 180:2341–2358
Wang C, Chen Z, Yuan Z (2003) K-means clustering analysis based on genetic algorithm. Comput Sci 30(2):163–164
Lai Y, Liu J, Yang G (2008) K-means clustering analysis based on genetic algorithm. Comput Eng 34(20):200–202
Hu Y, Bi J (2010) Genetic optimization of K-means clustering algorithm. Comput Syst Appl 6:52–55
Wang H, Yan X, Jin J, Zhan Z (2010) An improved genetic K-means clustering algorithm. Comput Digital Eng 1:18–20
Xu J, Zhang L, Xu S, Li J (2010) Improved genetic K-means clustering algorithm. J Microcomput Appl 31(4):11–18
Silva C, Ribe B (2010) Distributed text classification with an ensemble kernel-based learning approach. In: IEEE transaction on systems, man, and cybernetics—Part C: applications and reviews, May 2010, vol 40(3)
Shi K, Zhang N, Li L, et al (2011) Efficient text classification method based on improved term reduction and term weighting. J China Univ Post Commun 18:131–135
Shi K, Li L, Liu H, et al (2011) A linguistic feature based K-means text clustering method. In: Proceedings of IEEE cloud computing and intelligent systems, Sep 2011, pp 108–112
Shi K, Li L, Liu H, et al (2011) Improved GA-based document clustering algorithm. In: Proceedings of IEEE broadband and multimedia communications, Oct 2011, pp 675–679
Shi K, Li L, Zhang N, et al (2011) An improved KNN text classification algorithm based on density. In: Proceedings of IEEE cloud computing and intelligent systems, Sep 2011, pp 113–117
Shi K, Li L (2012) High performance topic detection based on relevance model. Journal of University of Electronic Science and Technology of China. To appear
Zhao S, Liu T, Li S (2007) A text clustering algorithm based on topics. Chin Inform 21(2):58–61
Shi K, Li L (2012) A close-to-linear topic detection algorithm using relative entropy based relevance model and inverted indices retrieval. International Journal of Computational Intelligence Systems. To appear
Lee LH, Wan CH, Rajkumar R, Isa D (2012) An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Appl Intell 37:80–99
Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34:19–27
Kyriakopoulou A, Kalamboukis T (2011) Clustering as a prior step to classification: an empirical study. Int J Artif Intell Tools 20(3):531–548
Capdevila M, Florez OW (2009) A communication perspective on automatic text categorization. IEEE Transactions on Knowledge and Data Engineering 12(7):1027–1041
Li Y, Hung E, Chung K (2011) A subspace decision cluster classifier for text classification. Expert Syst Appl 38:12475–12482
Acknowledgements
This research is partially supported by National Natural Science Foundation under the grant numbers 60970107 and 61073150. The real text clustering application–outline based theme report system for business intelligence (www.wasuo.com) is sponsored by National Incubation Center. We would like to express our sincere thanks.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shi, K., Li, L. High performance genetic algorithm based text clustering using parts of speech and outlier elimination. Appl Intell 38, 511–519 (2013). https://doi.org/10.1007/s10489-012-0382-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-012-0382-8