Data Mining and Knowledge Discovery

, Volume 32, Issue 3, pp 764–786 | Cite as

Mining urban events from the tweet stream through a probabilistic mixture model

  • Joan CapdevilaEmail author
  • Jesús Cerquides
  • Jordi Torres
Part of the following topical collections:
  1. Special Issue on Data Mining for Smart Cities


The geographical identification of content in Social Networks have enabled to bridge the gap between online social platforms and the physical world. Although vast amounts of data in such networks are due to breaking news or global occurrences, local events witnessed by users in situ are also present in these streams and of great importance for many city entities. Nowadays, unsupervised machine learning techniques, such as Tweet-SCAN, are able to retrospectively detect these local events from tweets. However, these approaches have limited abilities to reason about unseen observations in a principled way due to the lack of a proper probabilistic foundation. Probabilistic models have also been proposed for the task, but their event identification capabilities are far from those of Tweet-SCAN. In this paper, we identify two key factors which, when combined, boost the accuracy of such models. As a first key factor, we notice that the large amount of meaningless social data requires explicitly modeling non-event observations.Therefore, we propose to incorporate a background model that captures spatio-temporal fluctuations of non-event tweets. As a second key factor, we observe that the shortness of tweets hampers the application of traditional topic models. Thus, we integrate event detection and topic modeling, assigning topic proportions to events instead of assigning them to individual tweets. As a result, we propose Warble, a new probabilistic model and learning scheme for retrospective event detection that incorporates these two key factors. We evaluate Warble in a data set of tweets located in Barcelona during its festivities. The empirical results show that the model outperforms other state-of-the-art techniques in detecting various types of events while relying on a principled probabilistic framework that enables to reason under uncertainty.


Event detection Social networks Probabilistic models Variational inference 



This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract (TIN2015-65316), by the Severo Ochoa Program (SEV2015- 0493), by SGR programs of the Catalan Government (2014-SGR-1051, 2014-SGR-118), Collectiveware (TIN2015-66863-C2-1-R) and BSC/UPC NVIDIA GPU Center of Excellence. We would also like to thank the reviewers for their constructive feedback.


  1. Akbari M, Hu X, Liqiang N, Chua TS (2016) From tweets to wellness: wellness event detection from Twitter streams. In: Proceedings of the 30th AAAI conference on artificial intelligenceGoogle Scholar
  2. Allan J, Carbonell JG, Doddington G, Yamron J, Yang Y (1998) Topic detection and tracking pilot study final report. In: Proceedings of the DARPA broadcast news transcription and understanding workshopGoogle Scholar
  3. Amigó E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486CrossRefGoogle Scholar
  4. Atefeh F, Khreich W (2015) A survey of techniques for event detection in Twitter. Comput Intell 31(1):132–164MathSciNetCrossRefGoogle Scholar
  5. Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. In: Proceedings of the first international conference on language resources and evaluation workshop on linguistics coreference, pp 563–566Google Scholar
  6. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821MathSciNetCrossRefzbMATHGoogle Scholar
  7. Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on Twitter. In: Proceedings of the 5th international AAAI conference on weblogs and social media (ICWSM), vol 11, pp 438–441Google Scholar
  8. Birant D, Kut A (2007) ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Knowl Eng 60(1):208–221CrossRefGoogle Scholar
  9. Blei DM (2012) Probabilistic topic models. Commun ACM 55(4):77–84CrossRefGoogle Scholar
  10. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022zbMATHGoogle Scholar
  11. Boettcher A, Lee D (2012) Eventradar: a real-time local event detection scheme using Twitter stream. In: Proceedings of the IEEE international conference on green computing and communications (GreenCom), IEEE, pp 358–367Google Scholar
  12. Capdevila J, Cerquides J, Torres J (2016a) Recognizing warblers: a probabilistic model for event detection in Twitter. In: The anomaly detection workshop in the international conference on machine learning (ICML)Google Scholar
  13. Capdevila J, Cerquides J, Torres J (2016b) Variational forms and updates for the Warble model. In: Technical report.
  14. Capdevila J, Cerquides J, Nin J, Torres J (2017) Tweet-SCAN: an event discovery technique for geo-located tweets. Pattern Recognit Lett 93:58–68CrossRefGoogle Scholar
  15. Cheng T, Wicks T (2014) Event detection using Twitter: a spatio-temporal approach. PLoS ONE 9(6):1–10Google Scholar
  16. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Proc Second Int Conf Knowl Discov Data Min (KDD) 96:226–231Google Scholar
  17. Fox CW, Roberts SJ (2012) A tutorial on variational Bayesian inference. Artif Intell Rev 38(2):85–95CrossRefGoogle Scholar
  18. Ghahramani Z, Beal MJ (2001) Propagation algorithms for variational Bayesian learning. In: Proceeding of the advances in neural information processing systems (NIPS)Google Scholar
  19. Gomide J, Veloso A, Meira W Jr., Almeida V, Benevenuto F, Ferraz F, Teixeira M (2011) Dengue surveillance based on a computational model of spatio-temporal locality of Twitter. In: Proceedings of the 3rd international web science conference (WebSci), pp 3:1–3:8Google Scholar
  20. Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics, pp 80–88Google Scholar
  21. Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233CrossRefzbMATHGoogle Scholar
  22. Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT press, CambridgezbMATHGoogle Scholar
  23. Krumm J, Horvitz E (2015) Eyewitness: Identifying local events via space-time signals in Twitter feeds. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, ACM, pp 20:1–20:10Google Scholar
  24. Kulldorff M (1997) A spatial scan statistic. Commun Stat Theory Methods 26(6):1481–1496MathSciNetCrossRefzbMATHGoogle Scholar
  25. Kulldorff M, Heffernan R, Hartman J, Assunção R, Mostashari F (2005) A space-time permutation Scan statistic for disease outbreak detection. PLoS Med 2(3):e59CrossRefGoogle Scholar
  26. Lee CH (2012) Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Syst Appl 39(10):9623–9641CrossRefGoogle Scholar
  27. Lee R, Sumiya K (2010) Measuring geographical regularities of crowd behaviors for Twitter-based geo-social event detection. In: Proceedings of the 2nd ACM SIGSPATIAL international workshop on location based social networks (LBSN), pp 1–10Google Scholar
  28. Li J, Cardie C (2014) Timeline generation: tracking individuals on Twitter. In: Proceedings of the 23rd international conference on World Wide Web (WWW), pp 643–652Google Scholar
  29. Li L, Goodchild MF, Xu B (2013) Spatial, temporal, and socioeconomic patterns in the use of twitter and flickr. Cartogr Geogr Inf Sci 40(2):61–77CrossRefGoogle Scholar
  30. Li Z, Wang B, Li M, Ma WY (2005) A probabilistic model for retrospective news event detection. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 106–113Google Scholar
  31. Long R, Wang H, Chen Y, Jin O, Yu Y (2011) Towards effective event detection, tracking and summarization on microblog data. In: Web-age information management, Springer, pp 652–663Google Scholar
  32. McInerney J, Blei DM (2014) Discovering newsworthy tweets with a geographical topic model. In: The news publishing workshop in the 20th ACM SIGKDD conference on knowledge discovery and data mining (KDD)Google Scholar
  33. Newman N (2011) Mainstream media and the distribution of news in the age of social discovery. Reuters Institute for the Study of Journalism, University of OxfordGoogle Scholar
  34. Pan CC, Mitra P (2011) Event detection with spatial latent dirichlet allocation. In: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pp 349–358Google Scholar
  35. Panagiotou N, Katakis I, Gunopulos D (2016) Detecting events in online social networks: definitions, trends and challenges. Springer International Publishing, Cham, pp 42–84Google Scholar
  36. Petrović S, Osborne M, Lavrenko V (2010) Streaming first story detection with application to Twitter. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, pp 181–189Google Scholar
  37. Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th international conference on artificial intelligence (IJCAI)Google Scholar
  38. Ritter A, Etzioni O, Clark S, et al. (2012) Open domain event extraction from Twitter. In: Proceedings of the 18th international conference on Knowledge discovery and data mining (KDD), pp 1104–1112Google Scholar
  39. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World Wide Web (WWW), pp 851–860Google Scholar
  40. Singh S (2015) Spatial temporal analysis of social media data. Master’s thesis, Technische Universität MünchenGoogle Scholar
  41. Stelter B, Cohen N (2008) Citizen journalists provided glimpses of Mumbai attacks.
  42. Tamura K, Ichimura T (2013) Density-based spatiotemporal clustering algorithm for extracting bursty areas from georeferenced documents. In: Proceedings of IEEE international conference on systems, man, and cybernetics (SMC), pp 2079–2084Google Scholar
  43. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581MathSciNetCrossRefzbMATHGoogle Scholar
  44. Wang X, Grimson E (2008) Spatial latent dirichlet allocation. In: Advances in neural information processing systems (NIPS)Google Scholar
  45. Weng J, Lee BS (2011) Event detection in Twitter. In: Proceedings of the 5th international AAAI conference on weblogs and social media (ICWSM)Google Scholar
  46. Wong WK, Neill DB (2009) Tutorial on event detection. In: the international conference on knowledge discovery and data mining (KDD)Google Scholar
  47. Yang Y, Pierce T, Carbonell J (1998) A study of retrospective and on-line event detection. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrievalGoogle Scholar
  48. Zheng Y (2012) Tutorial on location-based social networks. In: the 21st international conference on World Wide Web (WWW)Google Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  • Joan Capdevila
    • 1
    Email author
  • Jesús Cerquides
    • 2
  • Jordi Torres
    • 1
  1. 1.Barcelona Supercomputing Center (BSC)Universitat Politècnica de Catalunya (UPC)BarcelonaSpain
  2. 2.Artificial Intelligence Research Institute (IIIA)Spanish National Research Council (CSIC)BellaterraSpain

Personalised recommendations