Constructing Pseudo Documents with Semantic Similarity for Short Text Topic Discovery

  • Heng-yang Lu
  • Yun Li
  • Chi Tang
  • Chong-jun Wang
  • Jun-yuan Xie
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11305)


With the popularity of the Internet, short texts become common in our daily life. Data like tweets and online Q&A pairs are quite valuable in application domains such as question retrieval and personalized recommendation. However, the sparsity problem of short text brings huge challenges for learning topics with conventional topic models. Recently, models like Biterm Topic Model and Word Network Topic Model alleviate the sparsity problem by modeling topics on biterms or pseudo documents. They are encouraged to put words with higher semantic similarity into the same topic by using word co-occurrence. However, there exist many semantically similar words which rarely co-occur. To address this limitation, we propose a model named SEREIN which exploits word embeddings to find more comprehensive semantic representations. Compared with existing models, we improve the performance of topic discovery significantly. Experiments on two open-source and real-world short text datasets also show the effectiveness of involving word embeddings.


Topic model Word embeddings Short text 



This paper is supported by the National Key Research and Development Program of China (Grant No. 2016YF- B1001102), the National Natural Science Foundation of China (Grant Nos. 61502227, 61876080), the Fundamental Research Funds for the Central Universities No.020214380040, the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University. We also would like to thank machine learning repository of UCI [9] and Yahoo! Research for the datasets.


  1. 1.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247 (2014)Google Scholar
  2. 2.
    Bengio, Y.: Neural net language models. Scholarpedia 3(1), 3881 (2008)CrossRefGoogle Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Chen, D., Manning, C.D.: A fast and accurate dependency parser using neural networks. In: EMNLP, pp. 740–750 (2014)Google Scholar
  5. 5.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)Google Scholar
  6. 6.
    Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)Google Scholar
  7. 7.
    Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)Google Scholar
  8. 8.
    Li, C., Duan, Y., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. (TOIS) 36(2), 11 (2017)CrossRefGoogle Scholar
  9. 9.
    Lichman, M.: UCI machine learning repository (2013).
  10. 10.
    Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550. ACM (2014)Google Scholar
  11. 11.
    Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: AAAI, pp. 2418–2424 (2015)Google Scholar
  12. 12.
    Lu, H., Xie, L.Y., Kang, N., Wang, C.J., Xie, J.Y.: Don’t forget the quantifiable relationship between words: using recurrent neural network for short text topic discovery. In: AAAI, pp. 1192–1198 (2017)Google Scholar
  13. 13.
    Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, p. 3 (2010)Google Scholar
  14. 14.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  15. 15.
    Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)Google Scholar
  16. 16.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J., et al.: Learning representations by back-propagating errors. Cognit. Model. 5(3), 1 (1988)zbMATHGoogle Scholar
  17. 17.
    Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)CrossRefGoogle Scholar
  18. 18.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)Google Scholar
  19. 19.
    Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)Google Scholar
  20. 20.
    Zamani, H., Croft, W.B.: Relevance-based word embedding. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 505–514. ACM (2017)Google Scholar
  21. 21.
    Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Heng-yang Lu
    • 1
  • Yun Li
    • 1
  • Chi Tang
    • 1
  • Chong-jun Wang
    • 1
  • Jun-yuan Xie
    • 1
  1. 1.National Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations