SPDF: Set Probabilistic Distance Features for Prediction of Population Health Outcomes via Social Media

  • Hung NguyenEmail author
  • Duc Thanh Nguyen
  • Thin Nguyen
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1127)


Measurement of population health outcomes is critical to understanding the health status of communities and thus enabling the development of appropriate health-care programmes for the communities. This task acquires the prediction of population health status to be fast and accurate yet scalable to different population sizes. To satisfy these requirements, this paper proposes a method for automatic prediction of population health outcomes from social media using Set Probabilistic Distance Features (SPDF). The proposed SPDF are mid-level features built upon the similarity in posting patterns between populations. Our proposed SPDF hold several advantages. Firstly, they can be applied to various low-level features. Secondly, our SPDF fit well problems with weakly labelled data, i.e., only the labels of sets are available while the labels of sets’ elements are not explicitly provided. We thoroughly evaluate our approach in the task of prediction of health indices of counties in the US via a large-scale dataset collected from Twitter. We also apply our proposed SPDF to two different textual features including latent topics and linguistic styles. We conduct two case studies: across-year vs across-county prediction. The performance of the approach is validated against the Behavioral Risk Factor Surveillance System surveys. Experimental results show that the proposed approach achieves state-of-the-art performance on linguistic style features in prediction of all health indices and in both case studies.


Population health Social media 


  1. 1.
    Andreu-Perez, J., Poon, C.C.Y., Merrifield, R.D., Wong, S.T.C., Yang, G.-Z.: Big data for health. IEEE J. Biomed. Health Inform. 19(4), 1193–1208 (2015)CrossRefGoogle Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Culotta, A.: Estimating county health statistics with Twitter. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1335–1344 (2014)Google Scholar
  4. 4.
    De Choudhury, M., Gamon, M., Counts, S., Horvitz, E.: Predicting depression via social media. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 128–137 (2013)Google Scholar
  5. 5.
    Dittrich, J., Quiané-Ruiz, J.-A.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5(12), 2014–2015 (2012)CrossRefGoogle Scholar
  6. 6.
    Dredze, M.: How social media will change public health. IEEE Intell. Syst. 27(4), 81–84 (2012)CrossRefGoogle Scholar
  7. 7.
    Dredze, M., Paul, M.J.: Natural language processing for health and social media. IEEE Intell. Syst. 29(2), 64–67 (2014)Google Scholar
  8. 8.
    França, U., Sayama, H., McSwiggen, C., Daneshvar, R., Bar-Yam, Y.: Visualizing the “Heartbeat” of a city with Tweets. Complexity 21(6), 280–287 (2016)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., Brilliant, L.: Detecting influenza epidemics using search engine query data. Nature 457(7232), 1012–1014 (2009) CrossRefGoogle Scholar
  10. 10.
    Lan, R., Lieberman, M.D., Samet, H.: The picture of health: Map-based, collaborative spatio-temporal disease tracking. In: Proceedings of the SIGSPATIAL International Workshop on Use of GIS in Public Health, pp. 27–35 (2012)Google Scholar
  11. 11.
    Leetaru, K., Wang, S., Cao, G., Padmanabhan, A., Shook, E.: Mapping the global Twitter heartbeat: the geography of Twitter. First Monday 18(5) (2013)Google Scholar
  12. 12.
    Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)zbMATHGoogle Scholar
  13. 13.
    Nguyen, T., et al.: Kernel-based features for predicting population health indices from geocoded social media data. Decis. Support Syst. 102, 22–31 (2017)CrossRefGoogle Scholar
  14. 14.
    Nguyen, T., et al.: Prediction of population health indices from social media using kernel-based textual and temporal features. In: Proceedings of the International Conference on World Wide Web Companion, pp. 99–107 (2017)Google Scholar
  15. 15.
    Parrish, R.G.: Peer reviewed: measuring population health outcomes. Prev. Chronic Dis. 7(4) (2010)Google Scholar
  16. 16.
    Pennebaker, J.W., Booth, R.J., Boyd, R.L., Francis, M.E.: Linguistic Inquiry and Word Count: LIWC 2015 [Computer software]. Pennebaker Conglomerates Inc. (2015)Google Scholar
  17. 17.
    Quercia, D., Capra, L., Crowcroft, J.: The social world of Twitter: topics, geography, and emotions. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 12, pp. 298–305 (2012)Google Scholar
  18. 18.
    Schwartz, H.A., et al.: Characterizing geographic variation in well-being using tweets. In: Proceedings of the International AAAI Conference on Web and Social Media (ICWSM), pp. 583–591 (2013)Google Scholar
  19. 19.
    Shekhar, S., et al.: Spatiotemporal data mining: a computational perspective. ISPRS Int. J. Geo-Inf. 4(4), 2306–2338 (2015)CrossRefGoogle Scholar
  20. 20.
    Thacker, S.B., Stroup, D.F., Carande-Kulis, V., Marks, J.S., Roy, K., Gerberding, J.L.: Measuring the public’s health. Public Health Rep. 121(1), 14–22 (2006)CrossRefGoogle Scholar
  21. 21.
    Venerandi, A., Quattrone, G., Capra, L.: City form and well-being: what makes London neighborhoods good places to live? In: Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems (2016)Google Scholar
  22. 22.
    Ye, M., Yin, P., Lee, W.-C.: Location recommendation for location-based social networks. In: Proceedings of the SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 458–461 (2010)Google Scholar
  23. 23.
    Zaharia, M., et al.: Fast and interactive analytics over Hadoop data with Spark. Usenix Login 37(4), 45–51 (2012)MathSciNetGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Applied Artificial Intelligence InstituteDeakin UniversityGeelongAustralia
  2. 2.School of Information TechnologyDeakin UniversityGeelongAustralia

Personalised recommendations