Skip to main content

Constructing Pseudo Documents with Semantic Similarity for Short Text Topic Discovery

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11305))

Included in the following conference series:

  • 2570 Accesses

Abstract

With the popularity of the Internet, short texts become common in our daily life. Data like tweets and online Q&A pairs are quite valuable in application domains such as question retrieval and personalized recommendation. However, the sparsity problem of short text brings huge challenges for learning topics with conventional topic models. Recently, models like Biterm Topic Model and Word Network Topic Model alleviate the sparsity problem by modeling topics on biterms or pseudo documents. They are encouraged to put words with higher semantic similarity into the same topic by using word co-occurrence. However, there exist many semantically similar words which rarely co-occur. To address this limitation, we propose a model named SEREIN which exploits word embeddings to find more comprehensive semantic representations. Compared with existing models, we improve the performance of topic discovery significantly. Experiments on two open-source and real-world short text datasets also show the effectiveness of involving word embeddings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=10.

  2. 2.

    http://archive.ics.uci.edu/ml/datasets/News+Aggregator.

  3. 3.

    https://github.com/xiaohuiyan/BTM.

  4. 4.

    https://github.com/NobodyWHU/GPUDMM.

References

  1. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1, pp. 238–247 (2014)

    Google Scholar 

  2. Bengio, Y.: Neural net language models. Scholarpedia 3(1), 3881 (2008)

    Article  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  4. Chen, D., Manning, C.D.: A fast and accurate dependency parser using neural networks. In: EMNLP, pp. 740–750 (2014)

    Google Scholar 

  5. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  6. Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)

    Google Scholar 

  7. Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)

    Google Scholar 

  8. Li, C., Duan, Y., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. (TOIS) 36(2), 11 (2017)

    Article  Google Scholar 

  9. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  10. Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550. ACM (2014)

    Google Scholar 

  11. Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: AAAI, pp. 2418–2424 (2015)

    Google Scholar 

  12. Lu, H., Xie, L.Y., Kang, N., Wang, C.J., Xie, J.Y.: Don’t forget the quantifiable relationship between words: using recurrent neural network for short text topic discovery. In: AAAI, pp. 1192–1198 (2017)

    Google Scholar 

  13. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, p. 3 (2010)

    Google Scholar 

  14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  15. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)

    Google Scholar 

  16. Rumelhart, D.E., Hinton, G.E., Williams, R.J., et al.: Learning representations by back-propagating errors. Cognit. Model. 5(3), 1 (1988)

    MATH  Google Scholar 

  17. Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)

    Article  Google Scholar 

  18. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)

    Google Scholar 

  19. Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)

    Google Scholar 

  20. Zamani, H., Croft, W.B.: Relevance-based word embedding. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 505–514. ACM (2017)

    Google Scholar 

  21. Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This paper is supported by the National Key Research and Development Program of China (Grant No. 2016YF- B1001102), the National Natural Science Foundation of China (Grant Nos. 61502227, 61876080), the Fundamental Research Funds for the Central Universities No.020214380040, the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University. We also would like to thank machine learning repository of UCI [9] and Yahoo! Research for the datasets.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chong-jun Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lu, Hy., Li, Y., Tang, C., Wang, Cj., Xie, Jy. (2018). Constructing Pseudo Documents with Semantic Similarity for Short Text Topic Discovery. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11305. Springer, Cham. https://doi.org/10.1007/978-3-030-04221-9_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04221-9_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04220-2

  • Online ISBN: 978-3-030-04221-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics