An Improved Bisecting K-Means Text Clustering Method
Bisecting K-means clustering method belongs to the hierarchical algorithm in text clustering, in which the selection of K value and initial center of mass will affect the final result of clustering. Chinese word segmentation has the characteristics of vague word and word boundary, etc. We transformed the corpus into word vector by word2vec, reduced the dimension of data by ontology modeling, and cleaned the data by jieba word segmentation and TF-IDF to improve the accuracy of the data. We propose an improved algorithm based on hierarchical clustering and Bisecting K-means clustering to cluster the data many times until it converges. Through experiments, it is proved that the clustering result of this method is better than that of K-means clustering algorithm and Bisecting K-means clustering algorithm.
KeywordsText clustering Bisecting K-means Ontology theory Hierarchical clustering
This work was partially supported by NSFC (No. 61807024).
- 1.Zhang, Y., Huang, T., Lin, K., Zhang, Q.: An improved K-means text clustering algorithm. J. Guilin Univ. Electron. Sci. Technol. 36(04), 311–314 (2016)Google Scholar
- 2.Wang, Q.: Chinese word segmentation and word vector. China New Commun. 20(23), 19–23 (2018)Google Scholar
- 3.An, J., Gao, G., Shi, Z., Sun, L.: An improved K-means text clustering algorithm. Sens. Microsyst. 34(05), 130–133 (2015)Google Scholar
- 4.Liu, P., Lu, J.: Improved K-means text clustering algorithm based on MapReduce. Inf. Technol. (11), 201–205 (2016)Google Scholar
- 5.Zou, H., Li, M.: An improved bisecting K-means algorithm for text clustering. Microcomput. Appl. 29(12), 64–67 (2010)Google Scholar
- 6.Zhang, J., Wang, N., Huang, S., Li, S.: Research on optimization and parallelization of bisecting K-means clustering algorithm. Comput. Eng. 37(17), 23–25 (2011)Google Scholar
- 7.Hui, Y., Xia, Y., Chen, Z., Tong, X.: Short text clustering algorithm based on synonyms and K-means. Comput. Knowl. Technol. 15(01), 5–6 (2019)Google Scholar
- 8.Tang, X., Zhai, X.: Semantic indexing of text knowledge fragments based on ontology and Word2Vec. Inf. Sci. 37(04), 97–102 (2019)Google Scholar
- 9.Dai, Y., Xu, L.: An improved TF-IDF algorithm based on semantic analysis. J. Southwest Univ. Sci. Technol. 34(01), 6773 (2019)Google Scholar
- 10.Kui, Z.: Improvement of TF-IDF weight calculation method in text classification. Softw. Guide 17(12), 39–42 (2018)Google Scholar