Entity Grouping for Accessing Social Streams via Word Clouds

  • Martin LeginusEmail author
  • Leon Derczynski
  • Peter Dolog
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 246)


Word clouds have been proven as an effective tool for information access in different domains. As social media is a main driver of large increase in available user generated content, means for accessing information in such content are needed. We study word clouds as a means for information access in social media. Currently-used clouds that are generated from social media data include redundant and mis-ranked entries, harming their utility. We propose a method for generating improved word clouds over social streams. In this method, named entities are detected, disambiguated and aggregated into clusters, which in turn inform cloud construction. We show that word clouds using named entity clusters attain broader coverage and decreased content duplication. Further, an extrinsic evaluation shows improved access to data, with word clouds having grouped named entities being rated more relevant and diverse. Additionally we find word clouds with higher Mean Average Precision (MAP) tend to be more relevant to underlying concepts. Critically, this supports MAP as a tool for predicting cloud quality without needing a human.


Word clouds Recognized named entities User evaluation Social media Social stream access 



This work was partially supported by the European Union under grant agreement No. 611233 Pheme.


  1. 1.
    Kuo, B.Y., Hentrich, T., Good, B.M., Wilkinson, M.D.: Tag clouds for summarizing web search results. In: Proceedings of the Conference on the World Wide Web (WWW), pp. 1203–1204. ACM (2007)Google Scholar
  2. 2.
    Miotto, R., Jiang, S., Weng, C.: eTACTS: a method for dynamically filtering clinical trial search results. J. Biomed. Inf. 46, 1060–1067 (2013)CrossRefGoogle Scholar
  3. 3.
    Leginus, M., Dolog, P., Lage, R.: Graph based techniques for tag cloud generation. In: Proceedings of the ACM Conference on Hypertext and Social Media, pp. 148–157, ACM (2013)Google Scholar
  4. 4.
    Venetis, P., Koutrika, G., Garcia-Molina, H.: On the selection of tags for tag clouds. In: Proceedings of the Conference on Web Search and Data Mining (WSDM), pp. 835–844. ACM (2011)Google Scholar
  5. 5.
    Leginus, M., Zhai, C., Dolog, P.: Personalized generation of word clouds from tweets. J. Assoc. Inf. Sci. Technol. (2015)Google Scholar
  6. 6.
    Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. In: Proceedings of the International Conference on Weblogs and Social Media (ICWSM), AAAI, pp. 505–514 (2014)Google Scholar
  7. 7.
    Bernstein, M.S., Suh, B., Hong, L., Chen, J., Kairam, S., Chi, E.H.: Eddi: interactive topic-based browsing of social status streams. In: Proceedings of the Annual Symposium on User Interface Software and Technology (UIST), pp. 303–312. ACM (2010)Google Scholar
  8. 8.
    Lage, R., Dolog, P., Leginus, M.: The role of adaptive elements in web-based surveillance system user interfaces. In: Dimitrova, V., Kuflik, T., Chin, D., Ricci, F., Dolog, P., Houben, G.-J. (eds.) UMAP 2014. LNCS, vol. 8538, pp. 350–362. Springer, Heidelberg (2014)Google Scholar
  9. 9.
    Rout, D., Bontcheva, K., Hepple, M.: Reliably evaluating summaries of Twitter timelines. In: Proceedings of the AAAI Workshop on Analyzing Microtext, AAAI, pp. 64–71 (2013)Google Scholar
  10. 10.
    Derczynski, L., Maynard, D., Aswani, N., Bontcheva, K.: Microblog-genre noise and impact on semantic annotation accuracy. In: Proceedings of the ACM Conference on Hypertext and Social Media, pp. 21–30. ACM (2013)Google Scholar
  11. 11.
    Maynard, D., Greenwood, M.A.: Who cares about sarcastic tweets? investigating the impact of sarcasm on sentiment analysis. In: Proceedings of the Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, ELRA (2014)Google Scholar
  12. 12.
    Leginus, M., Derczynski, L., Dolog, P.: Enhanced information access to social streams through word clouds with entity grouping. In: Proceedings of the conference on Web Information Systems and Technologies (WEBIST) (2015)Google Scholar
  13. 13.
    Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in twitter data with crowdsourcing. In: Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, ACL, pp. 80–88 (2010)Google Scholar
  14. 14.
    Hogan, A., Zimmermann, A., Umbrich, J., Polleres, A., Decker, S.: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semant. Sci. Serv. Agents World Wide Web 10, 76–110 (2012)CrossRefGoogle Scholar
  15. 15.
    Hu, Y., Talamadupula, K., Kambhampati, S., et al.: Dude, srsly?: The surprisingly formal nature of Twitter’s language. In: Proceedings of the International Conference on Weblogs and Social Media (ICWSM), AAAI (2013)Google Scholar
  16. 16.
    Baldwin, T., Cook, P., Lui, M., MacKinlay, A., Wang, L.: How noisy social media text, how diffrnt social media sources. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), pp. 356–364 (2013)Google Scholar
  17. 17.
    Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: TwitIE: an open-source information extraction pipeline for microblog text. In: Proceedings of the conference on Recent Advances in Natural Language Processing (RANLP), pp. 83–90 (2013)Google Scholar
  18. 18.
    Baldwin, T., Kim, Y.B., de Marneffe, M.C., Ritter, A., Han, B., Xu, W.: Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. ACL-IJCNLP 2015, 126 (2015)Google Scholar
  19. 19.
    Han, B., Baldwin, T.: Lexical normalisation of short text messages: makn sens a #twitter. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), ACL, pp. 368–378 (2011)Google Scholar
  20. 20.
    Augenstein, I., Gentile, A.L., Norton, B., Zhang, Z., Ciravegna, F.: Mapping keywords to linked data resources for automatic query expansion. In: Proceedings of the Second International Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data, pp. 9–20 (2013)Google Scholar
  21. 21.
    Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., Petrak, J., Bontcheva, K.: Analysis of named entity recognition and linking for tweets. Inf. Process. Manag. 51, 32–49 (2015)CrossRefGoogle Scholar
  22. 22.
    Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the Meeting of the Special Interest Group on Management of Data (SIGMOD), pp. 1247–1250. ACM (2008)Google Scholar
  23. 23.
    Kergl, D., Roedler, R., Seeber, S.: On the endogenesis of Twitter’s Spritzer and Gardenhose sample streams. In: Proceedings of the Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 357–364. IEEE (2014)Google Scholar
  24. 24.
    Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18, 467–479 (1992)Google Scholar
  25. 25.
    Wittgenstein, L.: Philosophical Investigations. Basic Blackwell, London (1953)zbMATHGoogle Scholar
  26. 26.
    Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the Annual International Conference on Systems Documentation (SIGDOC), pp. 24–26. ACM (1986)Google Scholar
  27. 27.
    Derczynski, L., Chester, S., Bøgh, K.S.: Tune your brown clustering, please. In: Proceedings of the Conference on Recent Advances in Natural Lang Processing (RANLP) (2015)Google Scholar
  28. 28.
    Lui, M., Baldwin, T.: an off-the-shelf language identification tool. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), vol. 3, pp. 25–30. ACL (2012)Google Scholar
  29. 29.
    Wu, W., Zhang, B., Ostendorf, M.: Automatic generation of personalized annotation tags for Twitter users. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), ACL, pp. 689–692 (2010)Google Scholar
  30. 30.
    Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the TREC-2011 microblog track. In: Proceedings of the Text REtrieval Conference (TREC) (2011)Google Scholar
  31. 31.
    McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable Twitter corpus. In: Proceedings of the meeting of the Special Interest Group in Information Retrieval (SIGIR), pp. 1113–1114. ACM (2012)Google Scholar
  32. 32.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  33. 33.
    Mei, Q., Guo, J., Radev, D.: Divrank: the interplay of prestige and diversity in information networks. In: Proceedings of the meeting of the Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), pp. 1009–1018. ACM (2010)Google Scholar
  34. 34.
    Sabou, M., Bontcheva, K., Derczynski, L., Scharl, A.: Corpus annotation through crowdsourcing: towards best practice guidelines. In: Proceedings of the conference on Language Resources and Evaluation (LREC), ELRA (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark
  2. 2.Department of Computer ScienceUniversity of SheffieldSheffieldUK

Personalised recommendations