Advertisement

Domain Dictionary-Based Topic Modeling for Social Text

  • Bo Jiang
  • Jiguang Liang
  • Ying ShaEmail author
  • Rui Li
  • Lihong Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10041)

Abstract

Online social networks are becoming increasingly popular and posting large volumes of unstructured social text documents every day. Inferring topics from large-scale social texts is a significant but challenging task for many text mining applications. Conventional topic models has been shown unsatisfactory results due to the sparsity and noise of content in short texts. Besides, the learned topics are very difficult to understand the semantic information only by the top weighted terms. In this paper, we propose a novel social text topic modeling method to deal with the problems. The proposed model utilizes topic domain dictionary to construct a weakly supervised matrix, which can play a role of making reference matrix and the learned topic matrix become similar. Experimental results on the constructed social text dataset from Twitter demonstrate that our proposed method can outperform the state-of-the art baselines significantly and also improve the semantic relevancy of the learned topic.

Keywords

Social text Topic modeling Domain dictionary Matrix factorization 

Notes

Acknowledgments

This work was supported by National Key Technology R&D Program(No. 2012BAH46B03), and the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences(No. XDA06030200).

References

  1. 1.
    Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)Google Scholar
  2. 2.
    Balasubramanyan, R., Cohen, W.W.: Regularization of latent variable models to obtain sparsity. In: SDM, pp. 414–422. SIAM (2013)Google Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Boyd-Graber, J.L., Blei, D.M.: Syntactic topic models. In: Advances in Neural Information Processing Systems, pp. 185–192 (2009)Google Scholar
  5. 5.
    Basave, A.E.C. He, Y., Xu, R.: Automatic labelling of topic models learned from twitter by summarisation. Association for Computational Linguistics (ACL) (2014)Google Scholar
  6. 6.
    Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)CrossRefGoogle Scholar
  7. 7.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JAsIs 41(6), 391–407 (1990)CrossRefGoogle Scholar
  8. 8.
    Dredze, M., Wallach, H.M., Puller, D., Pereira, F.: Generating summary keywords for emails using topics. In: Proceedings of the 13th International Conference on Intelligent User Interfaces, pp. 199–206. ACM (2008)Google Scholar
  9. 9.
    Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text (2011)Google Scholar
  10. 10.
    Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)CrossRefGoogle Scholar
  11. 11.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)Google Scholar
  12. 12.
    Yuening, H., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach. Learn. 95(3), 423–469 (2014)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 465–474. ACM (2013)Google Scholar
  14. 14.
    Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 204–213. Association for Computational Linguistics (2012)Google Scholar
  15. 15.
    Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)Google Scholar
  16. 16.
    Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics (2011)Google Scholar
  17. 17.
    Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Proceedings of the Association for Computational Linguistics, pp. 530–539 (2014)Google Scholar
  18. 18.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  19. 19.
    Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)Google Scholar
  20. 20.
    Mei, Q., Shen, X., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499. ACM (2007)Google Scholar
  21. 21.
    Paul, M.J., Dredze, M.: You are what you tweet: analyzing twitter for public health. In: ICWSM, pp. 265–272 (2011)Google Scholar
  22. 22.
    Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)Google Scholar
  23. 23.
    Quercia, D., Askham, H., Crowcroft, J.: Tweetlda: supervised topic classification and link prediction in twitter. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 247–250. ACM (2012)Google Scholar
  24. 24.
    Ramage, D., Dumais, S.T., Liebling, D.J.: Characterizing microblogs with topic models. In: ICWSM, vol. 10, p. 1 (2010)Google Scholar
  25. 25.
    Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM (2015)Google Scholar
  26. 26.
    Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–842. ACM (2010)Google Scholar
  27. 27.
    Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchicaldirichlet process. In: Advances in Neural Information Processing Systems, pp. 1982–1989 (2009)Google Scholar
  28. 28.
    Wang, D., Li, T., Zhu, S., Ding, C.: Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–314. ACM (2008)Google Scholar
  29. 29.
    Wang, Q., Jun, X., Li, H., Craswell, N.: Regularized latent semantic indexing: a new approach to large-scale topic modeling. ACM Trans. Inf. Syst. (TOIS) 31(1), 5 (2013)CrossRefGoogle Scholar
  30. 30.
    Wang, X., McCallum, A.: Topics over time: a non-markov continuous-time model of topicaltrends. In: Proceedings of the 12th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pp. 424–433. ACM (2006)Google Scholar
  31. 31.
    Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The ibp compound dirichlet process and its application to focused topic modeling. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 1151–1158 (2010)Google Scholar
  32. 32.
    Wu, Y., Wu, W., Li, Z., Zhou, M.: Mining query subtopics from questions in community question answering. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)Google Scholar
  33. 33.
    Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the SIAM International Conference on Data Mining (2013)Google Scholar
  34. 34.
    Yang, S.-H., Kolcz, A., Schlaikjer, A., Gupta, P.: Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1907–1916. ACM (2014)Google Scholar
  35. 35.
    Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-20161-5_34 CrossRefGoogle Scholar
  36. 36.
    Zhu, S., Yu, K., Chi, Y., Gong, Y.: Combining content and link for classification using matrix factorization. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 487–494. ACM (2007)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Bo Jiang
    • 1
  • Jiguang Liang
    • 1
  • Ying Sha
    • 1
    Email author
  • Rui Li
    • 1
  • Lihong Wang
    • 1
  1. 1.National Engineering Laboratory for Information Security TechnologiesInstitute of Information Engineering, Chinese Academy of SciencesBeijingChina

Personalised recommendations