Skip to main content

Advertisement

Log in

Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We analyze over 570 million Twitter messages from an eight month period and find that tracking a small number of keywords allows us to estimate influenza rates and alcohol sales volume with high accuracy. We validate our approach against government statistics and find strong correlations with influenza-like illnesses reported by the U.S. Centers for Disease Control and Prevention (r(14) = .964, p < .001) and with alcohol sales volume reported by the U.S. Census Bureau (r(5) = .932, p < .01). We analyze the robustness of this approach to spurious keyword matches, and we propose a document classification component to filter these misleading messages. We find that this document classifier can reduce error rates by over half in simulated false alarm experiments, though more research is needed to develop methods that are robust in cases of extremely high noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.cdc.gov/flu/weekly/fluactivity.htm.

  2. Twitter older than it looks. Reuters MediaFile blog, March 30th, 2009.

  3. http://mallet.cs.umass.edu.

  4. We use LibSVM (Chang and Lin 2011) with a linear kernel and the default parameter settings.

  5. We use MALLET’s (http://mallet.cs.umass.edu) implementation with the default parameter settings.

  6. http://www.census.gov/retail/.

  7. http://www.cdc.gov/brfss.

  8. The complementary Youth Risk Behavior Surveillance System only partially solves this problem, since it is restricted to surveys of high school students (http://www.cdc.gov/yrbs).

References

  • Brownstein, J., Freifeld, C., Reis, B., & Mandl, K. (2008). Surveillance sans frontieres: Internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Medicine, 5, 1019–1024.

    Article  Google Scholar 

  • Chang, C., & Lin, C. (2011). LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology 2(3), 27:1–27:27, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

  • Chew, C., & Eysenbach, G. (2010). Pandemics in the age of Twitter: Content analysis of tweets during the 2009 H1N1 outbreak. PLoS ONE, 5(11).

  • Collier, N., Doan, S., Kawazeo, A., Goodwin, R., Conway, M., Tateno, Y., Ngo, H.Q., Dien, D., Kawtrakul, A., Takeuchi, K., Shigematsu, M., & Taniguchi, K. (2008). BioCaster: detecting public health rumors with a web-based text mining system. Bioinformatics, 24, 2940–2941.

    Google Scholar 

  • Corley, C., Cook, D., Mikler, A., & Singh, K. (2010). Text and structural data mining of influenza mentions in web and social media. International Journal of Environmental Research and Public Health, 7(2), 596–615.

    Article  Google Scholar 

  • Culotta, A. (2010). Towards detecting influenza epidemics by analyzing twitter messages. In: Workshop on social media analytics at the 16th ACM SIGKDD conference on knowledge discovery and data mining.

  • de Quincey, E., & Kostkova, P. (2009). Early warning and outbreak detection using social networking websites: The potential of twitter, electronic healthcare. In: eHealth 2nd international conference. Instanbul, Turkey.

  • Drucker, H., Burges, C., Kaufman L., Smola A., & Vapnik V. (1996). Support vector regression machines. In: Advances in Neural Information Processing Systems 9, pp. 155–161.

  • Eysenbach, G. (2006). Infodemiology: Tracking flu-related searches on the web for syndromic surveillance. In: AMIA: Annual symposium proceedings, pp. 244–248.

  • Giampiccolo, D., Magnini, B., Dagan, I., & Dolan, B. (2007). The third pascal recognizing textual entailment challenge. In: Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9. Prague.

  • Gilbert, E., & Karahalios, K. (2010). Widespread worry and the stock market. In: Proceedings of the 4th international AAAI conference on weblogs and social media. Washington, DC.

  • Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457, 1012–1014.

    Google Scholar 

  • Grishman, R., Huttunen, S., & Yangarber, R. (2002). Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 35(4), 236–246.

    Article  Google Scholar 

  • Gruhl, D., Guha, R., Kumar, R., Novak, J., & Tomkins, A. (2005). The predictive power of online chatter. In: Proceedings of the 11th ACM SIGKDD intlernational conference on knowledge discovery and data mining, pp. 78–87. ACM, New York, NY, USA.

  • Johnson, H., Wagner, M., Hogan, W., Chapman, W., Olszewski, R., Dowling, J., & Barnas, G. (2004). Analysis of web access logs for surveillance of influenza. MEDINFO pp. 1202–1206.

  • Kanny, D., Liu, Y., & Bewer, R. (2011). Binge drinking: United States, 2009. Morbidity and Mortality Weekly Report, 60(01), 101–104.

    Google Scholar 

  • Lampos, V., & Cristianini, N. (2010). Tracking the flu pandemic by monitoring the social web. In: 2nd IAPR workshop on cognitive information processing (CIP 2010), pp. 411–416.

  • Lavrenko, V., Schmill, M. D., Lawrie, D., Ogilvie, P., Jensen, D., & Allan, J. (2000). Language models for financial news recommendation. In: Proceedings of the ninth international conference on information and knowledge management (CIKM). Washington, DC.

  • Linge, J., Steinberger, R., Weber, T., Yangarber, R., van der Goot, E., Khudhairy, D., & Stilianakis, N. (2009). Internet surveillance systems for early alerting of health threats. Eurosurveillance, 14(13).

  • Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(3, (Ser. B)), 503–528.

    Google Scholar 

  • Mawudeku, A., & Blench, M. (2006). Global public health intelligence network (GPHIN). In: 7th conference of the association for machine translation in the Americas.

  • McGinnis, J., & Foege, W. (1993). Actual causes of death in the united states. Journal of American Medical Association, 270, 2207–2012.

    Article  Google Scholar 

  • Mishne, G., Balog, K., de Rijke, M., & Ernsting, B. (2007). MoodViews: Tracking and searching mood-annotated blog posts. In: international conference on weblogs and social media. Boulder, CO.

  • O’Connor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From Tweets to polls: Linking text sentiment to public opinion time series. In: International AAAI conference on weblogs and social media. Washington, DC.

  • Oreskovic, A. (2010). Twitter snags over 100 million users, eyes money-making. London: Reuters.

    Google Scholar 

  • Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(12), 1–135.

    Article  Google Scholar 

  • Polgreen, P., Chen, Y., Pennock, D., & Forrest, N. (2008). Using internet searches for influenza surveillance. Clinical infectious diseases, 47, 1443–1448.

    Article  Google Scholar 

  • Reilly, A., Iarocci, E., Jung, C., Hartley, D., & Nelson, N. (2008). Indications and warning of pandemic influenza compared to seasonal influenza. Advances in Disease Surveillance, 5, 190.

  • Ritterman, J., Osborne, M., & Klein, E. (2009). Using prediction markets and Twitter to predict a swine flu pandemic. In: 1st international workshop on mining social media.

  • Signorini, A., Segre, A. M., & Polgreen, P. M. (2011). The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS ONE, 6(5), e19467.

    Google Scholar 

Download references

Acknowledgments

We would like to thank Brendan O’Connor from Carnegie Mellon University for providing access to the Twitter data and Troy Kammerdiener of Southeastern Louisiana University for helpful discussions in early stages of this work. This work was supported in part by a grant from the Research Competitiveness Subprogram of the Louisiana Board of Regents, under contract #LEQSF(2010-13)-RD-A-11.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aron Culotta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Culotta, A. Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages. Lang Resources & Evaluation 47, 217–238 (2013). https://doi.org/10.1007/s10579-012-9185-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9185-0

Keywords

Navigation