Mining Newsworthy Topics from Social Media

  • Carlos Martin
  • David Corney
  • Ayse Goker
Part of the Studies in Computational Intelligence book series (SCI, volume 602)


Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit, and journalists now use them to rapidly discover stories and eye-witness accounts. We present a technique that detects “bursts” of phrases on Twitter that is designed for a real-time topic-detection system. We describe a time-dependent variant of the classic tf-idf approach and group together bursty phrases that often appear in the same messages in order to identify emerging topics. We demonstrate our methods by analysing tweets corresponding to events drawn from the worlds of politics and sport, as well as more general mainstream news. We created a user-centred “ground truth” to evaluate our methods, based on mainstream media accounts of the events. This helps ensure our methods remain practical. We compare several clustering and topic ranking methods to discover the characteristics of news-related collections, and show that different strategies are needed to detect emerging topics within them. We show that our methods successfully detect a range of different topics for each event and can retrieve messages (for example, tweets) that represent each topic for the user.


Gaussian Mixture Model News Story Apriori Algorithm Twitter Account Topic Detection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is supported by the SocialSensor FP7 project, partially funded by the EC under contract number 287975. We wish to thank Nic Newman and Steve Schifferes of the Department of Journalism, City University London and Andrew MacFarlane of the Department of Computer Science, City University London, for their invaluable advice.


  1. 1.
    Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)Google Scholar
  2. 2.
    Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., Göker, A., Kompatsiaris, I., Jaimes, A.: Sensing trending topics in Twitter. IEEE Trans. Multimedia 15(6), 1268–1282 (2013). doi:  10.1109/TMM.2013.2265080
  3. 3.
    Alvanaki, F., Sebastian, M., Ramamritham, K., Weikum, G.: Enblogue: emergent topic detection in Web 2.0 streams. In: Proceedings of the 2011 International Conference on Management of Data, pp. 1271–1274. ACM (2011)Google Scholar
  4. 4.
    Becker, H., Naaman, M., Gravano, L.: Beyond trending topics: real-world event identification on Twitter. In: Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM11) (2011)Google Scholar
  5. 5.
    Benhardus, J.: Streaming trend detection in Twitter. National Science Foundation REU for Artificial Intelligence, Natural Language Processing and Information Retrieval, University of Colarado (2010)Google Scholar
  6. 6.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATHGoogle Scholar
  7. 7.
    Byrne, E., Corney, D.: Sweet FA: sentiment, swearing and soccer. In: ICMR2014 1st Workshop on Social Multimedia and Storytelling. Glasgow, UK (2014)Google Scholar
  8. 8.
    Castillo, C., Mendoza, M., Poblete, B.: Information credibility on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, pp. 675–684. ACM (2011)Google Scholar
  9. 9.
    Corney, D., Martin, C., Göker, A.: Spot the ball: detecting sports events on Twitter. In: ECIR 2014, pp. 449–454. Amsterdam, Holland (2014)Google Scholar
  10. 10.
    Corney, D., Martin, C., Göker, A.: Two sides to every story: Subjective event summarization of sports events using Twitter. In: ICMR2014 1st Workshop on Social Multimedia and Storytelling. Glasgow, UK (2014)Google Scholar
  11. 11.
    Cunningham, B.: Re-thinking objectivity. Columbia. Journalism Rev. 42(2), 24–32 (2003)Google Scholar
  12. 12.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological), 1–38 (1977)Google Scholar
  13. 13.
    Dork, M., Gruen, D., Williamson, C., Carpendale, S.: A visual backchannel for large-scale events. IEEE Trans. Vis. Comput. Graph. 16(6), 1129–1138 (2010)CrossRefGoogle Scholar
  14. 14.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pp. 363–370. Stroudsburg, PA, USA (2005). doi: 10.3115/1219840.1219885
  15. 15.
    Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998)MATHCrossRefGoogle Scholar
  16. 16.
    Goel, V., Stelter, B.: Social networks in a battle for the second screen. The New York Times (2013). Accessed 24 Mar 2014
  17. 17.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  18. 18.
    He, D., Göker, A., Harper, D.: Combining evidence for automatic web session identification. Inf. Process. Manage. 38(5), 727–742 (2002)MATHCrossRefGoogle Scholar
  19. 19.
    Ifrim, G., Shi, B., Brigadir, I.: Event detection in Twitter using aggressive filtering and hierarchical tweet clustering. In: Proceedings of the SNOW 2014 Data Challenge (2014)Google Scholar
  20. 20.
    Ku, L.W., Lee, L.Y., Wu, T.H., Chen, H.H.: Major topic detection and its application to opinion summarization. In: 28th ACM SIGIR Conference, pp. 627–628. ACM (2005)Google Scholar
  21. 21.
    Kubo, M., Sasano, R., Takamura, H., Okumura, M.: Generating live sports updates from Twitter by finding good reporters. In: IEEE/WIC/ACM International Joint WI-IAT Conferences, vol. 1, pp. 527–534. IEEE (2013)Google Scholar
  22. 22.
    Liu, B.: Sentiment analysis and subjectivity. In: N. Indurkhya, F.J. Damerau (eds.) Handbook of Natural Language Processing, 2nd edn. Chapman & Hall, Boca Raton (2010)Google Scholar
  23. 23.
    Martin, C., Corney, D., Göker, A.: Finding newsworthy topics on Twitter. IEEE Comput. Soc. Spec. Tech. Community Soc. Netw. E-Letter 1(3) (2013)Google Scholar
  24. 24.
    Martin, C., Göker, A.: Real-time topic detection with bursty \(n\)-grams: RGU’s submission to the 2014 SNOW challenge. In: Proceedings of the SNOW 2014 Data Challenge (2014)Google Scholar
  25. 25.
    Maynard, D., Bontcheva, K., Rout, D.: Challenges in developing opinion mining tools for social media. In: Proceedings of @NLP can u tag #usergeneratedcontent?! Workshop at LREC 2012. Turkey (2012)Google Scholar
  26. 26.
    Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26(4), 354–359 (1983)MATHCrossRefGoogle Scholar
  27. 27.
    Newman, N.: #ukelection2010, mainstream media and the role of the internet. Reuters Institute for the Study of Journalism working paper (2010)Google Scholar
  28. 28.
    Newman, N.: Mainstream media and the distribution of news in the age of social discovery. Reuters Institute for the Study of Journalism working paper (2011)Google Scholar
  29. 29.
    Osborne, M., Petrovic, S., McCreadie, R., Macdonald, C., Ounis, I.: Bieber no more: First story detection using Twitter and Wikipedia. In: SIGIR 2012 Workshop on Time-aware Information Access (2012)Google Scholar
  30. 30.
    Ozdikis, O., Senkul, P., Oguztuzun, H.: Semantic expansion of hashtags for enhanced event detection in Twitter. In: Proceedings of VLDB 2012 Workshop on Online Social Systems (2012)Google Scholar
  31. 31.
    Papadopoulos, S., Corney, D., Aiello, L.M.: SNOW 2014 data challenge: Assessing the performance of news topic detection methods in social media. In: Proceedings of the SNOW 2014 Data Challenge (2014)Google Scholar
  32. 32.
    Petrovic, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to Twitter. In: Proceedings of Human Language Technologies: 2010 Conference of NAACL, vol. 10 (2010)Google Scholar
  33. 33.
    Petrovic, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and Twitter. In: Proceedings of HTL12 Human Language Technologies, pp. 338–346 (2012)Google Scholar
  34. 34.
    Phuvipadawat, S., Murata, T.: Breaking news detection and tracking in Twitter. In: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 3, pp. 120–123 (2010)Google Scholar
  35. 35.
    Phuvipadawat, S., Murata, T.: Detecting a multi-level content similarity from microblogs based on community structures and named entities. J. Emerg. Technol. Web Intell. 3(1), 11–19 (2011)Google Scholar
  36. 36.
    Ratkiewicz, J., Conover, M., Meiss, M., Gonçalves, B., Flammini, A., Menczer, F.: Detecting and tracking political abuse in social media. In: Proceedings of the International Conference on Weblogs and Social Media (ICWSM) (2011)Google Scholar
  37. 37.
    Sayyadi, H., Hurst, M., Maykov, A.: Event detection and tracking in social streams. In: Proceedings of International Conference on Weblogs and Social Media (ICWSM) (2009)Google Scholar
  38. 38.
    Schifferes, S., Newman, N., Thurman, N., Corney, D., Göker, A., Martin, C.: Identifying and verifying news through social media. Digital Journalism (2014). doi: 10.1080/21670811.2014.892747
  39. 39.
    Shamma, D., Kennedy, L., Churchill, E.: Peaks and persistence: modeling the shape of microblog conversations. In: Proceedings of the ACM 2011 conference on Computer Supported Co-operative Work, pp. 355–358. ACM (2011)Google Scholar
  40. 40.
    Spärck, J.K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)Google Scholar
  41. 41.
    Thurman, N., Walters, A.: Live blogging—digital journalism’s pivotal platform? A case study of the production, consumption, and form of live blogs at Digital Journalism 1(1), 82–101 (2013)CrossRefGoogle Scholar
  42. 42.
    van Oorschot, G., van Erp, M., Dijkshoorn, C.: Automatic extraction of soccer game events from Twitter. In: Proceedings of the Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (2012)Google Scholar
  43. 43.
    Zhao, S., Zhong, L., Wickramasuriya, J., Vasudevan, V.: Human as real-time sensors of social and physical events: A case study of Twitter and sports games. arXiv preprint arXiv:1106.4300 (2011)

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.IDEAS Research Institute, School of Computing & Digital Media Robert Gordon UniversityAberdeenScotland, UK

Personalised recommendations