Skip to main content
Log in

A Text Clustering Approach of Chinese News Based on Neural Network Language Model

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Text clustering plays an important role in data mining and machine learning. After years of development, clustering technology has produced a series of theories and methods. However, in the text clustering of Chinese news, the mainstream LDA method suffers a high time complex. In order to improve the speed, this paper puts forward a new method in which neural network language model is first applied to text clustering. Text clustering is first converted to its dual problem called word clustering. With neural network language model, we can get word vector which can be used in the fuzzy k-means of the Chinese news keyword set. Based on the keyword clustering result, we can get text clustering result of Chinese news by a single transition. Experiments have show this method’s running speed is five times faster than LDA. This method has been successfully used in the Sohu news recommendation system currently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Aggarwal, C.C., Zhai, C.X.: A survey of text clustering algorithms. Mining text data. Springer, US (2012)

    Google Scholar 

  3. Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. ACM (2012)

  4. Bengio, Y., et al.: Neural probabilistic language models. Innovations in machine learning. Springer, Berlin (2006)

    Google Scholar 

  5. Berkhin, P.: A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin (2006)

    Google Scholar 

  6. Rajaraman, A., Ullman, J.D.: Data mining. Mining of massive datasets. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  7. Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46.3, 167–174 (1992)

    MathSciNet  Google Scholar 

  8. Zeng, H.-J., et al.: CBC: clustering based text classification requiring minimal labeled data. Data mining, 2003. ICDM 2003. In: 3rd IEEE international conference on IEEE (2003)

  9. Decherchi, S., et al.: A text clustering framework for information retrieval. J. Inf. Assur. Sec. 4, 174–182 (2009)

    Google Scholar 

  10. Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowl. Inf. Syst. 31.3, 455–474 (2012)

    Article  Google Scholar 

  11. Kang, S.-S.: Keyword-based document clustering. In: Proceedings of the 6th international workshop on information retrieval with Asian languages-Volume 11. Association for computational linguistics (2003)

  12. Cheng, H.-C., Chiun-Chieh, H.S.U.: Using topic keyword clusters for automatic document clustering. IEICE Trans. Inf. Syst. 88.8, 1852–1860 (2005)

    Article  Google Scholar 

  13. Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. Knowl. Data Eng. IEEE Trans. 19.8, 1026–1041 (2007)

    Article  Google Scholar 

  14. Berry, M.W., Castellanos, M. (eds.): Survey of text mining. Springer, New York (2004)

  15. Hotho, A., Nurnberger, A., Paaß G.: A brief survey of text mining. Ldv Forum. 20, 19–62 (2005)

  16. Horng, Y.-J.: A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. Fuzzy Syst. IEEE Trans. 13.2, 216–228 (2005)

    Article  Google Scholar 

  17. Tjhi, W.-C., Chen, L.: A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data. Fuzzy Sets Syst. 159.4, 371–389 (2008)

    Article  MathSciNet  Google Scholar 

  18. Heinrich, G.: Parameter estimation for text analysis. Technical report (2005)

  19. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2.1, 37–63 (2011)

    Google Scholar 

Download references

Acknowledgments

We would like to thank ICT big data system team members and SOHU mobile research development team members, especially Jian Lin and Wei Liu to help give good suggestions in this paper. This research is supported partly by the Hi-Tech Research and Development (863) Program of China (Grant No. 2013AA01A213).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaoxin Fan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fan, Z., Chen, S., Zha, L. et al. A Text Clustering Approach of Chinese News Based on Neural Network Language Model. Int J Parallel Prog 44, 198–206 (2016). https://doi.org/10.1007/s10766-014-0329-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-014-0329-2

Keywords

Navigation