Data Mining and Knowledge Discovery

, Volume 30, Issue 3, pp 681–710 | Cite as

Syndromic surveillance of Flu on Twitter using weakly supervised temporal topic models

  • Liangzhe ChenEmail author
  • K. S. M. Tozammel Hossain
  • Patrick Butler
  • Naren Ramakrishnan
  • B. Aditya Prakash


Surveillance of epidemic outbreaks and spread from social media is an important tool for governments and public health authorities. Machine learning techniques for nowcasting the Flu have made significant inroads into correlating social media trends to case counts and prevalence of epidemics in a population. There is a disconnect between data-driven methods for forecasting Flu incidence and epidemiological models that adopt a state based understanding of transitions, that can lead to sub-optimal predictions. Furthermore, models for epidemiological activity and social activity like on Twitter predict different shapes and have important differences. In this paper, we propose two temporal topic models (one unsupervised model as well as one improved weakly-supervised model) to capture hidden states of a user from his tweets and aggregate states in a geographical region for better estimation of trends. We show that our approaches help fill the gap between phenomenological methods for disease surveillance and epidemiological models. We validate our approaches by modeling the Flu using Twitter in multiple countries of South America. We demonstrate that our models can consistently outperform plain vocabulary assessment in Flu case-count predictions, and at the same time get better Flu-peak predictions than competitors. We also show that our fine-grained modeling can reconcile some contrasting behaviors between epidemiological and social models.


Syndromic surveillance Social media Topic model   Hidden Markov model 



This material is based upon work supported by the National Science Foundation under Grant No. IIS-1353346, by the Maryland Procurement Office under Contract H98230-14-C-0127, by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) Contract Number D12PC000337, and by the VT College of Engineering. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the respective funding agencies.


  1. Achrekar H, Gandhe A, Lazarus R, Yu S-H, and Liu B (2011) Predicting flu trends using twitter data. In: 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). pp 702–707Google Scholar
  2. Anderson RM, May RM (1991) Infectious diseases of humans. Oxford University Press, OxfordGoogle Scholar
  3. Andrews M, Vigliocco G (2010) The hidden markov topic model: a probabilistic model of semantic representation. Top Cogn Sci 2(1):101–113CrossRefGoogle Scholar
  4. Aramaki E, Maskawa S, Morita M (2011) Twitter catches the flu: detecting influenza epidemics using twitter. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP ’11). pp 1568–1576Google Scholar
  5. Beretta E, Takeuchi Y (1995) Global stability of an SIR epidemic model with time delays. J Math Biol 33(3):250–260MathSciNetCrossRefzbMATHGoogle Scholar
  6. Blasiak S, Rangwala H (2011) A hidden Markov model variant for sequence classification. In: The 21nd international joint conference on artificial intelligence. pp 1192–1197Google Scholar
  7. Blei D, Carin L, Dunson D (2010) Probabilistic topic models. Signal Process Mag IEEE 27(6):55–65Google Scholar
  8. Blei D, Lafferty J (2006) Dynamic topic models. In: The 23rd international conference on machine learning. pp 113–120Google Scholar
  9. Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  10. Brennan SP, Sadilek A, Kautz HA (2013) Towards understanding global spread of disease from everyday interpersonal interactions. In: Proceedings of the 23rd international joint conference on artificial intelligence. AAAI Press, pp 2783–2789Google Scholar
  11. Butler D (2013) When Google got Flu wrong. Nature 494(7436):155–156CrossRefGoogle Scholar
  12. Chakraborty P, Khadivi P, Lewis B, Mahendiran A, Chen J, Butler P, Nsoesie E, Mekaru S, Brownstein J, Marathe M, Ramakrishnan N (2014) Forecasting a moving target: ensemble models for ili case count predictions. In: 2014 SIAM international conference on data mining (SDM ’14)Google Scholar
  13. Chen L, Hossain KSMT, Butler P, Ramakrishnan N, Prakash BA (2014) Flu gone viral: Syndromic surveillance of flu on twitter using temporal topic models. In: Proceedings of the fifth IEEE international conference on data mining (ICDM ’14)Google Scholar
  14. Christakis NA, Fowler JH (2010) Social network sensors for early detection of contagious outbreaks. PLoS One 5(9):e12948CrossRefGoogle Scholar
  15. Crane R, Sornette D (2008) Robust dynamic classes revealed by measuring the response function of a social system. Proc Natl Acad Sci 105(41):15649–15653CrossRefGoogle Scholar
  16. Culotta A (2010) Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the first workshop on social media analytics. ACM, pp 115–122Google Scholar
  17. Ginsberg J, Mohebbi M, Patel R, Brammer L, Smolinski M, Brilliant L (2008) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014CrossRefGoogle Scholar
  18. Glance N, Hurst M, Tomokiyo T (2004) Blogpulse: automated trend discovery for weblogs. WWW 2004 workshop on the weblogging ecosystem: aggregation, analysis and dynamicsGoogle Scholar
  19. Gruber A, Weiss Y, Rosen-Zvi M (2007) Hidden topic markov models. In: International conference on artificial intelligence and statistics. pp 163–170Google Scholar
  20. Hethcote HW (2000) The mathematics of infectious diseases. Soc Ind Appl Math SIAM Rev 42(4):599–653MathSciNetzbMATHGoogle Scholar
  21. Hong L, Yin D, Guo J, Davison B (2011) Tracking trends: incorporating term volume into temporal topic models. In: the 17th ACM SIGKDD international conference on knowledge discovery and data mining. pp 484–492Google Scholar
  22. Jacquez J, Simon C (1993) The stochastic SI model with recruitment and deaths I. Comparison with the closed SIS model. Math Biosci 117(1):77–125MathSciNetCrossRefzbMATHGoogle Scholar
  23. Lamb A, Paul MJ, Dredze M (2013) Separating fact from fear: tracking flu infections on twitter. In: North American chapter of the association for computational linguistics (NAACL). pp 789–795Google Scholar
  24. Lampos V, Cristianini N (2012) Nowcasting events from the social web with statistical learning. ACM Trans Intell Syst Technol 3(4):72CrossRefGoogle Scholar
  25. Lampos V, De Bie T, Cristianini N (2010) Flu detector: tracking epidemics on twitter. In: Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases: Part III (ECML PKDD’10). pp 599–602Google Scholar
  26. Lazer DM, Kennedy R, King G, Vespignani A (2014) The parable of google flu: traps in big data analysis. Science 343(6176):1203–1205CrossRefGoogle Scholar
  27. Lee K, Agrawal A, Choudhary A (2013) Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, pp 1474–1477Google Scholar
  28. Li J, Cardie C (2013) Early stage influenza detection from twitter. arXiv:1309.7340
  29. Li M, Muldowney J (1995) Global stability for the seir model in epidemiology. Math Biosci 125(2):155–164MathSciNetCrossRefzbMATHGoogle Scholar
  30. Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’12). pp 6–14Google Scholar
  31. PAHO (2012). Epidemic disease database, pan american health organization.
  32. Paul M, Dredze M (2011) You are what you tweet: analyzing twitter for public health. In: Fifth international AAAI conference on weblogs and social media (ICWSM 2011). pp 265–272Google Scholar
  33. Paul M, Girju R (2010) A two-dimensional topic-aspect model for discovering multi-faceted topics. Urbana 51:61801Google Scholar
  34. Romero DM, Meeder B, Kleinberg J (2011) Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In: Proceedings of the 20th international conference on world wide web (WWW ’11). ACM, New York. pp 695–704Google Scholar
  35. Sadilek A, Kautz H, Silenzio V (2012) Predicting disease transmission from geo-tagged micro-blog data. In: AAAI conference on artificial intelligenceGoogle Scholar
  36. Spasojevic N, Yan J, Rao A, Bhattacharyya P (2014) Lasta: large scale topic assignment on multiple social networks. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14). ACM, New York. pp 1809–1818Google Scholar
  37. Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: The 10th ACM SIGKDD international conference on knowledge discovery and data mining. pp 306–315Google Scholar
  38. Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’06). pp 424–433Google Scholar
  39. Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the fourth ACM international conference on web search and data mining. ACM. pp 177–186Google Scholar
  40. Yang J, McAuley J, Leskovec J, LePendu P, Shah N (2014a) Finding progression stages in time-evolving event sequences. In: Proceedings of the 23rd international conference on world wide web (WWW ’14). pp 783–794Google Scholar
  41. Yang S-H, Kolcz A, Schlaikjer A, Gupta P (2014b) Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14). ACM, New York, pp 1907–1916Google Scholar
  42. Zhao S, Zhong L, Wickramasuriya J, Vasudevan V (2011) Human as real-time sensors of social and physical events, A case study of twitter and sports games. arXiv:1106.4300

Copyright information

© The Author(s) 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceVirginia Tech.BlacksburgUSA

Personalised recommendations