High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Shi, Kansheng; Li, Leming

doi:10.1007/s10489-012-0382-8

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Published: 09 September 2012

Volume 38, pages 511–519, (2013)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Kansheng Shi¹ &
Leming Li²

516 Accesses
18 Citations
Explore all metrics

Abstract

Among the typical clustering methods, the K-means algorithm plays the most important role in clustering because of its simplicity and efficiency. However, it is sensitive to the initial points and easy to fall into local optimum. In order to avoid this kind of flaw, a patented text clustering algorithm Clustering by Genetic Algorithm Model (CGAM) is revealed in this paper. CGAM constructs the fitness function of genetic algorithm (GA) and convergence criterion for K-means algorithm because GA simulates the natural evolutionary process and deals with a larger search space. To tackle the rich semantics of Chinese texts, CGAM creates an innovative selection method of initial centers of GA and accommodates the contribution of characteristics of different parts of speech. Moreover, the impact of outliers is addressed and treated. Its performance is demonstrated by a series of experiments based on both Reuters-21578 and Chinese text corpus. Experimental results show that the CGAM achieves clustering results better than other GA based K-means algorithms and has been successfully applied to national program of business intelligence system in the context of huge set of contents in both Chinese and English.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering

Article 11 April 2017

Laith Mohammad Abualigah & Ahamad Tajudin Khader

Performance Evaluation of New Text Mining Method Based on GA and K-Means Clustering Algorithm

A novel squirrel search clustering algorithm for text document clustering

Article 31 August 2022

Meena Chaudhary, Jyoti Pruthi, … Suryakant

References

Liu Y, Wang X, Xu Z, Yi G (2006) Summary of document clustering. Chin Inform 20(3):55–62
Google Scholar
Chen H, He T, Ji D (2005) No guide word sense disambiguation based on K-means clustering. Chin Inform 19(4):10–16
Google Scholar
Qing X, Zheng S (2009) A new method for initializing the K-means clustering algorithm. In: 2009 second international symposium on knowledge acquisition and modeling, Wuhan, pp 41–44
Google Scholar
Chen X, Xu Y (2009) K-means clustering algorithm with refined initial center. In: 2nd international conference on biomedical engineering and informatics, Tianjin, pp 1–4
Google Scholar
He T, Dai W, Jiao C, et al (2007) Text clustering based on hybrid parallel genetic algorithm. Chin Inform 21(4):55–60
Google Scholar
Kashef R, Kamel MS (2009) Enhanced bisecting K-means clustering using intermediate cooperation. Pattern Recognit 42:2557–2569
Article MATH Google Scholar
Xu H, Liu Y, Den C (2010) K-means text clustering algorithm based on similar centers. Comput Eng Design 31(8):1802–1805
Google Scholar
Song W, Choi LC, Park SC, Ding XF (2011) Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization. Expert Syst Appl 38:9112–9121
Article Google Scholar
Song W, Li CH, Park SC (2009) Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Syst Appl 36:9095–9104
Article Google Scholar
Tjhi W-C, Chen L (2007) Possibilistic fuzzy co-clustering of large document collections. Pattern Recognit 40:3452–3466
Article MATH Google Scholar
Jiang J-Y, Liou R-J, Lee S-J (2011) A fuzzy self-constructing feature clustering algorithm for text classification. In: IEEE transactions on knowledge and data engineering, March 2011, vol 23(3)
Google Scholar
Boutsinasa B, Papastergiou T (2008) On clustering tree structured data with categorical nature. Pattern Recognit 41:3613–3623
Article Google Scholar
Hondt JD, Vertommen J, Verhaegen P-A, Cattrysse D, Duflou JR (2010) Pairwise-adaptive dissimilarity measure for document clustering. Inf Sci 180:2341–2358
Article Google Scholar
Wang C, Chen Z, Yuan Z (2003) K-means clustering analysis based on genetic algorithm. Comput Sci 30(2):163–164
MathSciNet Google Scholar
Lai Y, Liu J, Yang G (2008) K-means clustering analysis based on genetic algorithm. Comput Eng 34(20):200–202
Google Scholar
Hu Y, Bi J (2010) Genetic optimization of K-means clustering algorithm. Comput Syst Appl 6:52–55
Google Scholar
Wang H, Yan X, Jin J, Zhan Z (2010) An improved genetic K-means clustering algorithm. Comput Digital Eng 1:18–20
Google Scholar
Xu J, Zhang L, Xu S, Li J (2010) Improved genetic K-means clustering algorithm. J Microcomput Appl 31(4):11–18
Google Scholar
Silva C, Ribe B (2010) Distributed text classification with an ensemble kernel-based learning approach. In: IEEE transaction on systems, man, and cybernetics—Part C: applications and reviews, May 2010, vol 40(3)
Google Scholar
Shi K, Zhang N, Li L, et al (2011) Efficient text classification method based on improved term reduction and term weighting. J China Univ Post Commun 18:131–135
Google Scholar
Shi K, Li L, Liu H, et al (2011) A linguistic feature based K-means text clustering method. In: Proceedings of IEEE cloud computing and intelligent systems, Sep 2011, pp 108–112
Google Scholar
Shi K, Li L, Liu H, et al (2011) Improved GA-based document clustering algorithm. In: Proceedings of IEEE broadband and multimedia communications, Oct 2011, pp 675–679
Google Scholar
Shi K, Li L, Zhang N, et al (2011) An improved KNN text classification algorithm based on density. In: Proceedings of IEEE cloud computing and intelligent systems, Sep 2011, pp 113–117
Google Scholar
Shi K, Li L (2012) High performance topic detection based on relevance model. Journal of University of Electronic Science and Technology of China. To appear
Zhao S, Liu T, Li S (2007) A text clustering algorithm based on topics. Chin Inform 21(2):58–61
Google Scholar
Shi K, Li L (2012) A close-to-linear topic detection algorithm using relative entropy based relevance model and inverted indices retrieval. International Journal of Computational Intelligence Systems. To appear
Lee LH, Wan CH, Rajkumar R, Isa D (2012) An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Appl Intell 37:80–99
Article Google Scholar
Li C, Liu K, Wang H (2011) The incremental learning algorithm with support vector machine based on hyperplane-distance. Appl Intell 34:19–27
Article MATH Google Scholar
Kyriakopoulou A, Kalamboukis T (2011) Clustering as a prior step to classification: an empirical study. Int J Artif Intell Tools 20(3):531–548
Article Google Scholar
Capdevila M, Florez OW (2009) A communication perspective on automatic text categorization. IEEE Transactions on Knowledge and Data Engineering 12(7):1027–1041
Article Google Scholar
Li Y, Hung E, Chung K (2011) A subspace decision cluster classifier for text classification. Expert Syst Appl 38:12475–12482
Article Google Scholar

Download references

Acknowledgements

This research is partially supported by National Natural Science Foundation under the grant numbers 60970107 and 61073150. The real text clustering application–outline based theme report system for business intelligence (www.wasuo.com) is sponsored by National Incubation Center. We would like to express our sincere thanks.

Author information

Authors and Affiliations

Shanghai Jiaotong University, Shanghai, 200240, China
Kansheng Shi
Chinese Academy of Engineering, Beijing, 100088, China
Leming Li

Authors

Kansheng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Leming Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kansheng Shi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, K., Li, L. High performance genetic algorithm based text clustering using parts of speech and outlier elimination. Appl Intell 38, 511–519 (2013). https://doi.org/10.1007/s10489-012-0382-8

Download citation

Published: 09 September 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10489-012-0382-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Abstract

Access this article

Similar content being viewed by others

Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering

Performance Evaluation of New Text Mining Method Based on GA and K-Means Clustering Algorithm

A novel squirrel search clustering algorithm for text document clustering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Abstract

Access this article

Similar content being viewed by others

Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering

Performance Evaluation of New Text Mining Method Based on GA and K-Means Clustering Algorithm

A novel squirrel search clustering algorithm for text document clustering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation