KNN-Based Pseudo-supervised RCNN Framework for Text Clustering

  • Zhi ChenEmail author
  • Wu GuoEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1075)


This paper explores the application of recurrent convolutional neural networks (RCNN) to text clustering, an unsupervised task in natural language processing (NLP). The RCNN is trained with pseudo-labels that are generated by pre-clustering on unsupervised document representations. To enhance the quality of pseudo-labels, the K-Nearest Neighbors (KNN) algorithm is used to select training samples for the neural network. After the deep feature representations of all documents have been obtained using the trained RCNN, the agglomerative hierarchical clustering (AHC) algorithm is used to cluster them. The experimental results on two public databases show that the proposed approach significantly boosts the performance of text clustering.


Text clustering Pseudo-supervised K nearest neighbors 


  1. 1.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  2. 2.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefGoogle Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Larochelle, H., Lauly, S.: A neural autoregressive topic model. In: Advances in Neural Information Processing Systems, pp. 2708–2716 (2012)Google Scholar
  5. 5.
    Kim, Y.: Convolutional neural networks for sentence classification (2014). arXiv Preprint arXiv:1408.5882
  6. 6.
    Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences (2014). arXiv Preprint arXiv:1404.2188
  7. 7.
    Wang, S., Huang, M., Deng, Z.: Densely connected CNN with multi-scale feature attention for text classification. In: IJCAI, pp. 4468–4474 (2018)Google Scholar
  8. 8.
    Tang, D., Qin, B., Liu, T.: Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1422–1432 (2015)Google Scholar
  9. 9.
    Lee, J.Y., Dernoncourt, F.: Sequential short-text classification with recurrent and convolutional neural networks (2016). arXiv Preprint arXiv:1603.03827
  10. 10.
    Wang, B.: Disconnected recurrent neural networks for text categorization. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2311–2320 (2018)Google Scholar
  11. 11.
    Lai, S., Xu, L., Liu, K., et al.: Recurrent convolutional neural networks for text classification. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)Google Scholar
  12. 12.
    Wen, Y., Zhang, W., Luo, R., et al.: Learning text representation using recurrent convolutional neural network with highway layers (2016). arXiv Preprint arXiv:1606.06905
  13. 13.
    Data Mining and Knowledge Discovery Handbook (2005)Google Scholar
  14. 14.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  15. 15.
    Cieri, C., Miller, D., Walker, K.: The Fisher Corpus: a resource for the next generations of speech-to-text. In: LREC, vol. 4, pp. 69–71 (2004)Google Scholar
  16. 16.
    Xu, J., Xu, B., Wang, P., et al.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017)CrossRefGoogle Scholar
  17. 17.
    Kesiraju, S., Burget, L., Szöke, I., et al.: Learning document representations using subspace multinomial model. In: INTERSPEECH, pp. 700–704 (2016)Google Scholar
  18. 18.
    Nguyen, D.Q., Billingsley, R., Du, L., et al.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)CrossRefGoogle Scholar
  19. 19.
    Xie, P., Xing, E.P.: Integrating document clustering and topic modeling (2013). arXiv Preprint arXiv:1309.6874
  20. 20.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv Preprint arXiv:1412.6980
  21. 21.
    Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487 (2016)Google Scholar
  22. 22.
    Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.National Engineering Laboratory for Speech and Language Information ProcessingUniversity of Science and Technology of ChinaHefeiChina

Personalised recommendations