SPDF: Set Probabilistic Distance Features for Prediction of Population Health Outcomes via Social Media
Abstract
Measurement of population health outcomes is critical to understanding the health status of communities and thus enabling the development of appropriate health-care programmes for the communities. This task acquires the prediction of population health status to be fast and accurate yet scalable to different population sizes. To satisfy these requirements, this paper proposes a method for automatic prediction of population health outcomes from social media using Set Probabilistic Distance Features (SPDF). The proposed SPDF are mid-level features built upon the similarity in posting patterns between populations. Our proposed SPDF hold several advantages. Firstly, they can be applied to various low-level features. Secondly, our SPDF fit well problems with weakly labelled data, i.e., only the labels of sets are available while the labels of sets’ elements are not explicitly provided. We thoroughly evaluate our approach in the task of prediction of health indices of counties in the US via a large-scale dataset collected from Twitter. We also apply our proposed SPDF to two different textual features including latent topics and linguistic styles. We conduct two case studies: across-year vs across-county prediction. The performance of the approach is validated against the Behavioral Risk Factor Surveillance System surveys. Experimental results show that the proposed approach achieves state-of-the-art performance on linguistic style features in prediction of all health indices and in both case studies.
Keywords
Population health Social mediaReferences
- 1.Andreu-Perez, J., Poon, C.C.Y., Merrifield, R.D., Wong, S.T.C., Yang, G.-Z.: Big data for health. IEEE J. Biomed. Health Inform. 19(4), 1193–1208 (2015)CrossRefGoogle Scholar
- 2.Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
- 3.Culotta, A.: Estimating county health statistics with Twitter. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1335–1344 (2014)Google Scholar
- 4.De Choudhury, M., Gamon, M., Counts, S., Horvitz, E.: Predicting depression via social media. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 128–137 (2013)Google Scholar
- 5.Dittrich, J., Quiané-Ruiz, J.-A.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5(12), 2014–2015 (2012)CrossRefGoogle Scholar
- 6.Dredze, M.: How social media will change public health. IEEE Intell. Syst. 27(4), 81–84 (2012)CrossRefGoogle Scholar
- 7.Dredze, M., Paul, M.J.: Natural language processing for health and social media. IEEE Intell. Syst. 29(2), 64–67 (2014)Google Scholar
- 8.França, U., Sayama, H., McSwiggen, C., Daneshvar, R., Bar-Yam, Y.: Visualizing the “Heartbeat” of a city with Tweets. Complexity 21(6), 280–287 (2016)MathSciNetCrossRefGoogle Scholar
- 9.Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., Brilliant, L.: Detecting influenza epidemics using search engine query data. Nature 457(7232), 1012–1014 (2009) CrossRefGoogle Scholar
- 10.Lan, R., Lieberman, M.D., Samet, H.: The picture of health: Map-based, collaborative spatio-temporal disease tracking. In: Proceedings of the SIGSPATIAL International Workshop on Use of GIS in Public Health, pp. 27–35 (2012)Google Scholar
- 11.Leetaru, K., Wang, S., Cao, G., Padmanabhan, A., Shook, E.: Mapping the global Twitter heartbeat: the geography of Twitter. First Monday 18(5) (2013)Google Scholar
- 12.Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)zbMATHGoogle Scholar
- 13.Nguyen, T., et al.: Kernel-based features for predicting population health indices from geocoded social media data. Decis. Support Syst. 102, 22–31 (2017)CrossRefGoogle Scholar
- 14.Nguyen, T., et al.: Prediction of population health indices from social media using kernel-based textual and temporal features. In: Proceedings of the International Conference on World Wide Web Companion, pp. 99–107 (2017)Google Scholar
- 15.Parrish, R.G.: Peer reviewed: measuring population health outcomes. Prev. Chronic Dis. 7(4) (2010)Google Scholar
- 16.Pennebaker, J.W., Booth, R.J., Boyd, R.L., Francis, M.E.: Linguistic Inquiry and Word Count: LIWC 2015 [Computer software]. Pennebaker Conglomerates Inc. (2015)Google Scholar
- 17.Quercia, D., Capra, L., Crowcroft, J.: The social world of Twitter: topics, geography, and emotions. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 12, pp. 298–305 (2012)Google Scholar
- 18.Schwartz, H.A., et al.: Characterizing geographic variation in well-being using tweets. In: Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), pp. 583–591 (2013)Google Scholar
- 19.Shekhar, S., et al.: Spatiotemporal data mining: a computational perspective. ISPRS Int. J. Geo-Inf. 4(4), 2306–2338 (2015)CrossRefGoogle Scholar
- 20.Thacker, S.B., Stroup, D.F., Carande-Kulis, V., Marks, J.S., Roy, K., Gerberding, J.L.: Measuring the public’s health. Public Health Rep. 121(1), 14–22 (2006)CrossRefGoogle Scholar
- 21.Venerandi, A., Quattrone, G., Capra, L.: City form and well-being: what makes London neighborhoods good places to live? In: Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems (2016)Google Scholar
- 22.Ye, M., Yin, P., Lee, W.-C.: Location recommendation for location-based social networks. In: Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 458–461 (2010)Google Scholar
- 23.Zaharia, M., et al.: Fast and interactive analytics over Hadoop data with Spark. Usenix Login 37(4), 45–51 (2012)MathSciNetGoogle Scholar