A Text Clustering Approach of Chinese News Based on Neural Network Language Model

Fan, Zhaoxin; Chen, Shuoying; Zha, Li; Yang, Jiadong

doi:10.1007/s10766-014-0329-2

A Text Clustering Approach of Chinese News Based on Neural Network Language Model

Published: 16 October 2014

Volume 44, pages 198–206, (2016)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Zhaoxin Fan¹,
Shuoying Chen¹,
Li Zha² &
…
Jiadong Yang³

848 Accesses
13 Citations
Explore all metrics

Abstract

Text clustering plays an important role in data mining and machine learning. After years of development, clustering technology has produced a series of theories and methods. However, in the text clustering of Chinese news, the mainstream LDA method suffers a high time complex. In order to improve the speed, this paper puts forward a new method in which neural network language model is first applied to text clustering. Text clustering is first converted to its dual problem called word clustering. With neural network language model, we can get word vector which can be used in the fuzzy k-means of the Chinese news keyword set. Based on the keyword clustering result, we can get text clustering result of Chinese news by a single transition. Experiments have show this method’s running speed is five times faster than LDA. This method has been successfully used in the Sohu news recommendation system currently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

Sanskar Soni, Satyendra Singh Chouhan & Santosh Singh Rathore

A review of semi-supervised learning for text classification

Article 31 January 2023

José Marcio Duarte & Lilian Berton

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

Xiaobao Wu, Thong Nguyen & Anh Tuan Luu

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Aggarwal, C.C., Zhai, C.X.: A survey of text clustering algorithms. Mining text data. Springer, US (2012)
Google Scholar
Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. ACM (2012)
Bengio, Y., et al.: Neural probabilistic language models. Innovations in machine learning. Springer, Berlin (2006)
Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin (2006)
Google Scholar
Rajaraman, A., Ullman, J.D.: Data mining. Mining of massive datasets. Cambridge University Press, Cambridge (2012)
Google Scholar
Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46.3, 167–174 (1992)
MathSciNet Google Scholar
Zeng, H.-J., et al.: CBC: clustering based text classification requiring minimal labeled data. Data mining, 2003. ICDM 2003. In: 3rd IEEE international conference on IEEE (2003)
Decherchi, S., et al.: A text clustering framework for information retrieval. J. Inf. Assur. Sec. 4, 174–182 (2009)
Google Scholar
Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowl. Inf. Syst. 31.3, 455–474 (2012)
Article Google Scholar
Kang, S.-S.: Keyword-based document clustering. In: Proceedings of the 6th international workshop on information retrieval with Asian languages-Volume 11. Association for computational linguistics (2003)
Cheng, H.-C., Chiun-Chieh, H.S.U.: Using topic keyword clusters for automatic document clustering. IEICE Trans. Inf. Syst. 88.8, 1852–1860 (2005)
Article Google Scholar
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. Knowl. Data Eng. IEEE Trans. 19.8, 1026–1041 (2007)
Article Google Scholar
Berry, M.W., Castellanos, M. (eds.): Survey of text mining. Springer, New York (2004)
Hotho, A., Nurnberger, A., Paaß G.: A brief survey of text mining. Ldv Forum. 20, 19–62 (2005)
Horng, Y.-J.: A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. Fuzzy Syst. IEEE Trans. 13.2, 216–228 (2005)
Article Google Scholar
Tjhi, W.-C., Chen, L.: A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data. Fuzzy Sets Syst. 159.4, 371–389 (2008)
Article MathSciNet Google Scholar
Heinrich, G.: Parameter estimation for text analysis. Technical report (2005)
Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2.1, 37–63 (2011)
Google Scholar

Download references

Acknowledgments

We would like to thank ICT big data system team members and SOHU mobile research development team members, especially Jian Lin and Wei Liu to help give good suggestions in this paper. This research is supported partly by the Hi-Tech Research and Development (863) Program of China (Grant No. 2013AA01A213).

Author information

Authors and Affiliations

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Zhaoxin Fan & Shuoying Chen
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Li Zha
Sohu.com Inc, Beijing, China
Jiadong Yang

Authors

Zhaoxin Fan
View author publications
You can also search for this author in PubMed Google Scholar
Shuoying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Li Zha
View author publications
You can also search for this author in PubMed Google Scholar
Jiadong Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaoxin Fan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, Z., Chen, S., Zha, L. et al. A Text Clustering Approach of Chinese News Based on Neural Network Language Model. Int J Parallel Prog 44, 198–206 (2016). https://doi.org/10.1007/s10766-014-0329-2

Download citation

Received: 27 June 2014
Accepted: 07 October 2014
Published: 16 October 2014
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10766-014-0329-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Text Clustering Approach of Chinese News Based on Neural Network Language Model

Abstract

Access this article

Similar content being viewed by others

TextConvoNet: a convolutional neural network based architecture for text classification

A review of semi-supervised learning for text classification

A survey on neural topic models: methods, applications, and challenges

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Text Clustering Approach of Chinese News Based on Neural Network Language Model

Abstract

Access this article

Similar content being viewed by others

TextConvoNet: a convolutional neural network based architecture for text classification

A review of semi-supervised learning for text classification

A survey on neural topic models: methods, applications, and challenges

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation