Language Resources and Evaluation

, Volume 47, Issue 1, pp 217–238 | Cite as

Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages

  • Aron Culotta
Original Paper


We analyze over 570 million Twitter messages from an eight month period and find that tracking a small number of keywords allows us to estimate influenza rates and alcohol sales volume with high accuracy. We validate our approach against government statistics and find strong correlations with influenza-like illnesses reported by the U.S. Centers for Disease Control and Prevention (r(14) = .964, p < .001) and with alcohol sales volume reported by the U.S. Census Bureau (r(5) = .932, p < .01). We analyze the robustness of this approach to spurious keyword matches, and we propose a document classification component to filter these misleading messages. We find that this document classifier can reduce error rates by over half in simulated false alarm experiments, though more research is needed to develop methods that are robust in cases of extremely high noise.


Social media Regression Classification 



We would like to thank Brendan O’Connor from Carnegie Mellon University for providing access to the Twitter data and Troy Kammerdiener of Southeastern Louisiana University for helpful discussions in early stages of this work. This work was supported in part by a grant from the Research Competitiveness Subprogram of the Louisiana Board of Regents, under contract #LEQSF(2010-13)-RD-A-11.


  1. Brownstein, J., Freifeld, C., Reis, B., & Mandl, K. (2008). Surveillance sans frontieres: Internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Medicine, 5, 1019–1024.CrossRefGoogle Scholar
  2. Chang, C., & Lin, C. (2011). LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology 2(3), 27:1–27:27, Software available at
  3. Chew, C., & Eysenbach, G. (2010). Pandemics in the age of Twitter: Content analysis of tweets during the 2009 H1N1 outbreak. PLoS ONE, 5(11).Google Scholar
  4. Collier, N., Doan, S., Kawazeo, A., Goodwin, R., Conway, M., Tateno, Y., Ngo, H.Q., Dien, D., Kawtrakul, A., Takeuchi, K., Shigematsu, M., & Taniguchi, K. (2008). BioCaster: detecting public health rumors with a web-based text mining system. Bioinformatics, 24, 2940–2941.Google Scholar
  5. Corley, C., Cook, D., Mikler, A., & Singh, K. (2010). Text and structural data mining of influenza mentions in web and social media. International Journal of Environmental Research and Public Health, 7(2), 596–615.CrossRefGoogle Scholar
  6. Culotta, A. (2010). Towards detecting influenza epidemics by analyzing twitter messages. In: Workshop on social media analytics at the 16th ACM SIGKDD conference on knowledge discovery and data mining.Google Scholar
  7. de Quincey, E., & Kostkova, P. (2009). Early warning and outbreak detection using social networking websites: The potential of twitter, electronic healthcare. In: eHealth 2nd international conference. Instanbul, Turkey.Google Scholar
  8. Drucker, H., Burges, C., Kaufman L., Smola A., & Vapnik V. (1996). Support vector regression machines. In: Advances in Neural Information Processing Systems 9, pp. 155–161.Google Scholar
  9. Eysenbach, G. (2006). Infodemiology: Tracking flu-related searches on the web for syndromic surveillance. In: AMIA: Annual symposium proceedings, pp. 244–248.Google Scholar
  10. Giampiccolo, D., Magnini, B., Dagan, I., & Dolan, B. (2007). The third pascal recognizing textual entailment challenge. In: Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9. Prague.Google Scholar
  11. Gilbert, E., & Karahalios, K. (2010). Widespread worry and the stock market. In: Proceedings of the 4th international AAAI conference on weblogs and social media. Washington, DC.Google Scholar
  12. Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457, 1012–1014.Google Scholar
  13. Grishman, R., Huttunen, S., & Yangarber, R. (2002). Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 236–246.CrossRefGoogle Scholar
  14. Gruhl, D., Guha, R., Kumar, R., Novak, J., & Tomkins, A. (2005). The predictive power of online chatter. In: Proceedings of the 11th ACM SIGKDD intlernational conference on knowledge discovery and data mining, pp. 78–87. ACM, New York, NY, USA.Google Scholar
  15. Johnson, H., Wagner, M., Hogan, W., Chapman, W., Olszewski, R., Dowling, J., & Barnas, G. (2004). Analysis of web access logs for surveillance of influenza. MEDINFO pp. 1202–1206.Google Scholar
  16. Kanny, D., Liu, Y., & Bewer, R. (2011). Binge drinking: United States, 2009. Morbidity and Mortality Weekly Report, 60(01), 101–104.Google Scholar
  17. Lampos, V., & Cristianini, N. (2010). Tracking the flu pandemic by monitoring the social web. In: 2nd IAPR workshop on cognitive information processing (CIP 2010), pp. 411–416.Google Scholar
  18. Lavrenko, V., Schmill, M. D., Lawrie, D., Ogilvie, P., Jensen, D., & Allan, J. (2000). Language models for financial news recommendation. In: Proceedings of the ninth international conference on information and knowledge management (CIKM). Washington, DC.Google Scholar
  19. Linge, J., Steinberger, R., Weber, T., Yangarber, R., van der Goot, E., Khudhairy, D., & Stilianakis, N. (2009). Internet surveillance systems for early alerting of health threats. Eurosurveillance, 14(13).Google Scholar
  20. Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(3, (Ser. B)), 503–528.Google Scholar
  21. Mawudeku, A., & Blench, M. (2006). Global public health intelligence network (GPHIN). In: 7th conference of the association for machine translation in the Americas.Google Scholar
  22. McGinnis, J., & Foege, W. (1993). Actual causes of death in the united states. Journal of American Medical Association, 270, 2207–2012.CrossRefGoogle Scholar
  23. Mishne, G., Balog, K., de Rijke, M., & Ernsting, B. (2007). MoodViews: Tracking and searching mood-annotated blog posts. In: international conference on weblogs and social media. Boulder, CO.Google Scholar
  24. O’Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From Tweets to polls: Linking text sentiment to public opinion time series. In: International AAAI conference on weblogs and social media. Washington, DC.Google Scholar
  25. Oreskovic, A. (2010). Twitter snags over 100 million users, eyes money-making. London: Reuters.Google Scholar
  26. Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(12), 1–135.CrossRefGoogle Scholar
  27. Polgreen, P., Chen, Y., Pennock, D., & Forrest, N. (2008). Using internet searches for influenza surveillance. Clinical infectious diseases, 47, 1443–1448.CrossRefGoogle Scholar
  28. Reilly, A., Iarocci, E., Jung, C., Hartley, D., & Nelson, N. (2008). Indications and warning of pandemic influenza compared to seasonal influenza. Advances in Disease Surveillance, 5, 190.Google Scholar
  29. Ritterman, J., Osborne, M., & Klein, E. (2009). Using prediction markets and Twitter to predict a swine flu pandemic. In: 1st international workshop on mining social media.Google Scholar
  30. Signorini, A., Segre, A. M., & Polgreen, P. M. (2011). The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS ONE, 6(5), e19467.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  1. 1.Department of Computer Science & Industrial TechnologySoutheastern Louisiana UniversityHammondUSA

Personalised recommendations