A Hierarchical Topic Modelling Approach for Tweet Clustering

  • Bo Wang
  • Maria Liakata
  • Arkaitz Zubiaga
  • Rob Procter
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10540)

Abstract

While social media platforms such as Twitter can provide rich and up-to-date information for a wide range of applications, manually digesting such large volumes of data is difficult and costly. Therefore it is important to automatically infer coherent and discriminative topics from tweets. Conventional topic models and document clustering approaches fail to achieve good results due to the noisy and sparse nature of tweets. In this paper, we explore various ways of tackling this challenge and finally propose a two-stage hierarchical topic modelling system that is efficient and effective in alleviating the data sparsity problem. We present an extensive evaluation on two datasets, and report our proposed system achieving the best performance in both document clustering performance and topic coherence.

Keywords

Tweet clustering Topic model Twitter topic detection Social media 

References

  1. 1.
    Aggarwal, C.C., Subbian, K.: Event detection in social streams. In: Proceedings of the 2012 SIAM International Conference on Data Mining, pp. 624–635. SIAM (2012)Google Scholar
  2. 2.
    Allan, J.: Topic Detection and Tracking: Event-based Information Organization, vol. 12. Springer Science & Business Media (2012)Google Scholar
  3. 3.
    Alvarez-Melis, D., Saveski, M.: Topic modeling in twitter: aggregating tweets by conversations. In: ICWSM, pp. 519–522 (2016)Google Scholar
  4. 4.
    Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013)CrossRefGoogle Scholar
  5. 5.
    Becker, H., Naaman, M., Gravano, L.: Beyond trending topics: real-world event identification on twitter. In: ICWSM 2011, pp. 438–441 (2011)Google Scholar
  6. 6.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATHGoogle Scholar
  7. 7.
    Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves: how humans interpret topic models. In: Advances in Neural Information Processing Systems, pp. 288–296 (2009)Google Scholar
  8. 8.
    Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 269 (2016)Google Scholar
  9. 9.
    Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1057–1060. ACM (2016)Google Scholar
  10. 10.
    Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)Google Scholar
  11. 11.
    Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (2010)Google Scholar
  12. 12.
    Hu, W., Tsujii, J.: A latent concept topic model for robust topic inference using word embeddings. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 380 (2016)Google Scholar
  13. 13.
    Ifrim, G., Shi, B., Brigadir, I.: Event detection in twitter using aggressive filtering and hierarchical tweet clustering. In: Second Workshop on Social News on the Web (SNOW), Seoul, Korea, vol. 8. ACM, April 2014Google Scholar
  14. 14.
    Jordaan, M.: Poke me, i’m a journalist: the impact of facebook and twitter on newsroom routines and cultures at two south african weeklies. Ecquid Novi: African Journalism Stud. 34(1), 21–35 (2013)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Lau, J.H., Baldwin, T.: The sensitivity of topic coherence evaluation to topic cardinality. In: Proceedings of NAACL-HLT, pp. 483–487 (2016)Google Scholar
  16. 16.
    Lau, J.H., Collier, N., Baldwin, T.: On-line trend analysis with topic models: \(\backslash \)# twitter trends detection topic model online. In: COLING, pp. 1519–1534 (2012)Google Scholar
  17. 17.
    Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: EACL, pp. 530–539 (2014)Google Scholar
  18. 18.
    Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 165–174. ACM (2016)Google Scholar
  19. 19.
    Li, S., Chua, T.S., Zhu, J., Miao, C.: Generative topic embedding: a continuous representation of documents. In: Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL) (2016)Google Scholar
  20. 20.
    McMinn, A.J., Moshfeghi, Y., Jose, J.M.: Building a large-scale corpus for evaluating event detection on twitter. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 409–418. ACM (2013)Google Scholar
  21. 21.
    Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)Google Scholar
  22. 22.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  23. 23.
    Müllner, D., et al.: fastcluster: Fast hierarchical, agglomerative clustering routines for R and python. J. Stat. Softw. 53(9), 1–18 (2013)CrossRefGoogle Scholar
  24. 24.
    Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108. Association for Computational Linguistics (2010)Google Scholar
  25. 25.
    Newman, N.: The rise of social media and its impact on mainstream journalism (2009)Google Scholar
  26. 26.
    Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Computat. Linguist. 3, 299–313 (2015)Google Scholar
  27. 27.
    Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2), 103–134 (2000)CrossRefMATHGoogle Scholar
  28. 28.
    Petrović, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to twitter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 181–189. Association for Computational Linguistics (2010)Google Scholar
  29. 29.
    Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 338–346. Association for Computational Linguistics (2012)Google Scholar
  30. 30.
    Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)Google Scholar
  31. 31.
    Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp. 2270–2276 (2015)Google Scholar
  32. 32.
    Rosa, K.D., Shah, R., Lin, B., Gershman, A., Frederking, R.: Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSM (2011)Google Scholar
  33. 33.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefMATHGoogle Scholar
  34. 34.
    Sokal, R.R., Rohlf, F.J.: The comparison of dendrograms by objective methods. In: Taxon, pp. 33–40 (1962)Google Scholar
  35. 35.
    Vakulenko, S., Nixon, L., Lupu, M.: Character-based neural embeddings for tweet clustering. In: SocialNLP 2017, p. 36 (2017)Google Scholar
  36. 36.
    Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 261–270. ACM (2010)Google Scholar
  37. 37.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)Google Scholar
  38. 38.
    Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)Google Scholar
  39. 39.
    Yin, J.: Clustering microtext streams for event identification. In: IJCNLP, pp. 719–725 (2013)Google Scholar
  40. 40.
    Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 338–349. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20161-5_34 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Bo Wang
    • 1
  • Maria Liakata
    • 1
    • 2
  • Arkaitz Zubiaga
    • 1
  • Rob Procter
    • 1
    • 2
  1. 1.Department of Computer ScienceUniversity of WarwickCoventryUK
  2. 2.The Alan Turing InstituteLondonUK

Personalised recommendations