World Wide Web

, Volume 21, Issue 2, pp 487–513 | Cite as

A topic model for co-occurring normal documents and short texts

  • Yang Yang
  • Feifei WangEmail author
  • Junni Zhang
  • Jin Xu
  • Philip S. Yu


User comments, as a large group of online short texts, are becoming increasingly prevalent with the development of online communications. These short texts are characterized by their co-occurrences with usually lengthier normal documents. For example, there could be multiple user comments following one news article, or multiple reader reviews following one blog post. The co-occurring structure inherent in such text corpora is important for efficient learning of topics, but is rarely captured by conventional topic models. To capture such structure, we propose a topic model for co-occurring documents, referred to as COTM. In COTM, we assume there are two sets of topics: formal topics and informal topics, where formal topics can appear in both normal documents and short texts whereas informal topics can only appear in short texts. Each normal document has a probability distribution over a set of formal topics; each short text is composed of two topics, one from the set of formal topics, whose selection is governed by the topic probabilities of the corresponding normal document, and the other from a set of informal topics. We also develop an online algorithm for COTM to deal with large scale corpus. Extensive experiments on real-world datasets demonstrate that COTM and its online algorithm outperform state-of-art methods by discovering more prominent, coherent and comprehensive topics.


Co-occurring structure Online algorithm Short texts Topic model 



This work is funded by the State Key Development Program of Basic Research of China (973) under Grant No. 2013cb329600 and National Natural Science Foundation of China under Grant Nos. 61672050, 61372191, 61472433, 61572492.


  1. 1.
    AlSumait, L., Barbara, D., Domeniconi, C.: On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In: 2008 eighth IEEE international conference on data mining, pp. 3c12. IEEE (2008)Google Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993C1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling hidden topics on document manifold. In: Proceedings of the 17th ACM conference on information and knowledge management, pp. 911c920. ACM (2008)Google Scholar
  4. 4.
    Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statisticstheory and Methods 3(1), 1C27 (1974)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: Topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928C2941 (2014)CrossRefGoogle Scholar
  6. 6.
    Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 1C24 (2015)CrossRefGoogle Scholar
  7. 7.
    Dixit, S., Agrawal, A.: Survey on review spam detection. Int. J. Comput. Commun. Technol. ISSN (PRINT) 4, 0975C7449 (2013)Google Scholar
  8. 8.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9(Aug), 1871C1874 (2008)zbMATHGoogle Scholar
  9. 9.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 50c57. ACM (1999)Google Scholar
  10. 10.
    Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp. 80c88. ACM (2010)Google Scholar
  11. 11.
    Hu, W., Tsujii, J.: A latent concept topic model for robust topic inference using word embeddings. In: The 54th annual meeting of the association for computational linguistics, pp. 380 (2016)Google Scholar
  12. 12.
    Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp. 775c784. ACM (2011)Google Scholar
  13. 13.
    Lakkaraju, H., Bhattacharya, I., Bhattacharyya, C.: Dynamic multi-relational chinese restaurant process for analyzing influences on users in social media. In: 2012 IEEE 12th international conference on data mining, pp. 389c398. IEEE (2012)Google Scholar
  14. 14.
    Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: The international ACM SIGIR conference, pp. 165c174 (2016)Google Scholar
  15. 15.
    Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link lda: joint models of topic and author community. In: Proceedings of the 26th annual international conference on machine learning, pp. 665c672. ACM (2009)Google Scholar
  16. 16.
    Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization. In: Proceedings of the 21st ACM international conference on information and knowledge management, pp. 265c274. ACM (2012)Google Scholar
  17. 17.
    McCallum, A., Wang, X., Mohanty, N.: Joint group and topic discovery from relations and text. Springer (2007)Google Scholar
  18. 18.
    Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp. 889c892. ACM (2013)Google Scholar
  19. 19.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Computer Science (2013)Google Scholar
  20. 20.
    Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, pp. 262c272 (2011)Google Scholar
  21. 21.
    Natarajan, N., Sen, P., Chaoji, V.: Community detection in content-sharing social networks. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp. 82c89. ACM (2013)Google Scholar
  22. 22.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp. 1532c1543 (2014)Google Scholar
  23. 23.
    Phan X.H., Nguyen L.M., Horiguchi S.: Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on world wide Web, pp. 91c100. ACM (2008)Google Scholar
  24. 24.
    Phan, X.H., Nguyen, C.T., Le, D.T., Nguyen, L.M., Horiguchi, S., Ha, Q.T.: A hidden topic-based framework toward building applications with short Web documents. IEEE Trans. Knowl. Data Eng. 23(7), 961C976 (2011)CrossRefGoogle Scholar
  25. 25.
    Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: International conference on artificial intelligence, pp. 2270c2276 (2015)Google Scholar
  26. 26.
    Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, pp. 261c270. ACM (2010)Google Scholar
  27. 27.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd international conference on WorldWideWeb, InternationalWorldWideWeb conferences steering committee, pp. 1445c1456 (2013)Google Scholar
  28. 28.
    Yang, Y., Wang, F., Jiang, F., Jin, S., Xu, J.: A topic model for hierarchical documents. In: International conference on data science in cyberspace, IEEE (2016)Google Scholar
  29. 29.
    Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Advances in information retrieval, pp. 338c349. Springer (2011)Google Scholar
  30. 30.
    Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., Xiong, H.: Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2. ACM (2016)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.School of Electrical Engineering and Computer SciencePeking UniversityBeijingChina
  2. 2.School of StatisticsRenmin University of ChinaBeijingChina
  3. 3.Guanghua School of ManagementPeking UniversityBeijingChina
  4. 4.Department of Computer ScienceUniversity of Illinois at ChicagoChicagoUSA

Personalised recommendations