Journal of Healthcare Informatics Research

, Volume 3, Issue 2, pp 159–183 | Cite as

Contextual Word Embeddings and Topic Modeling in Healthy Dieting and Obesity

  • Vijaya Kumari Yeruva
  • Sidrah Junaid
  • Yugyung LeeEmail author
Research Article


An alarming proportion of the US population is overweight. Obesity increases the risk of illnesses such as diabetes and cardiovascular diseases. In this paper, we propose the Contextual Word Embeddings (ContWEB) framework that aims to build contextual word embeddings on the relationship between obesity and healthy eating from the crowd domain (Twitter) and the expert domain (PubMed). For this purpose, our work is based on a pipeline model that consists of a chain of processing elements as follows: (1) to use term frequency and inverse document frequency (TF-IDF) and Word2Vec in the data collected from the crowd and expert domains; (2) to apply natural language processing (NLP) algorithms to the corpus; (3) to construct social word embeddings by sentiment analysis; (4) to discover the contextual word embeddings using co-occurrence and conditional probability; (5) to find an optimal number of topics in a topic modeling with the obesity and healthy dieting corpus; (6) to extract latent features extracted using Latent Dirichlet Allocation (LDA). The ContWEB framework has been implemented on Apache Spark and TensorFlow platforms. We have evaluated the ContWEB framework in terms of the effectiveness in contextual word embeddings constructed from the crowd and the expert domains. We conclude that the ContWEB framework would be useful in enhancing the decision-making process for healthy eating and obesity prevention.


Natural language processing Word embeddings Topic modeling Sentiment analysis Obesity and healthy dieting 



  1. 1.
    Flegal KM, Carroll MD, Ogden CL, Curtin LR (2010) Prevalence and trends in obesity among us adults, 1999-2008. Jama 303(3):235–241CrossRefGoogle Scholar
  2. 2.
    Ogden CL, Carroll MD, Kit BK, Flegal KM (2012) Prevalence of obesity and trends in body mass index among us children and adolescents, 1999-2010. Jama 307(5):483–490CrossRefGoogle Scholar
  3. 3.
    Diary Council of California (2017) Healthy eating made easier. [Online]. Available:
  4. 4.
    USDAMyPlate (2017) The usda myplate (2015-20 dietary guidelines for americans for children). [Online]. Available:
  5. 5.
    Mejova Y, Weber I, Macy MW (eds) (2015) Twitter: a digital socioscope. Cambridge University Press, CambridgeGoogle Scholar
  6. 6.
    Achrekar H, Gandhe A, Lazarus R, Yu S-H, Liu B (2011) Predicting flu trends using twitter data. In: 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). IEEE, pp 702–707Google Scholar
  7. 7.
    Culotta A (2010) Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the first workshop on social media analytics. ACM, pp 115–122Google Scholar
  8. 8.
    Huang M, ElTayeby O, Zolnoori M, Yao L (2018) Public opinions toward diseases: Infodemiological study on news media data. J Med Internet Res 5:20Google Scholar
  9. 9.
    Ghosh D, Guha R (2013) What are we ‘tweeting’ about obesity? mapping tweets with topic modeling and geographic information system. Cartogr Geogr Inf Sci 40(2):90–102CrossRefGoogle Scholar
  10. 10.
    Widener MJ, Li W (2014) Using geolocated twitter data to monitor the prevalence of healthy and unhealthy food references across the us. Appl Geogr 54:189–197CrossRefGoogle Scholar
  11. 11.
    Karami A, Dahl AA, Turner-McGrievy G, Kharrazi H, Shaw G (2018) Characterizing diabetes, diet, exercise, and obesity comments on twitter. Int J Inf Manag 38(1):1–6CrossRefGoogle Scholar
  12. 12.
    Statista (2017) Number of social media users worldwide from 2010 to 2020. [Online]. Available:
  13. 13.
    Nofer M, Hinz O (2014) Are crowds on the internet wiser than experts? the case of a stock prediction community. J Bus Econ 84(3):303–338CrossRefGoogle Scholar
  14. 14.
    Poetz MK, Schreier M (2012) The value of crowdsourcing: can users really compete with professionals in generating new product ideas? J Prod Innov Manag 29(2):245–256CrossRefGoogle Scholar
  15. 15.
    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142Google Scholar
  16. 16.
    Mikolov T, Chen K, Corrado G, Dean J, Sutskever L, Zweig G (2014) “word2vec”, Google ScholarGoogle Scholar
  17. 17.
    Lund K, Burgess C (1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behav Res Methods Instrum Comput 28(2):203–208CrossRefGoogle Scholar
  18. 18.
    Levy O, Goldberg Y (2014) Linguistic regularities in sparse and explicit word representations. In: Proceedings of the eighteenth conference on computational natural language learning, pp 171–180Google Scholar
  19. 19.
    Globerson A, Chechik G, Pereira F, Tishby N (2007) Euclidean embedding of co-occurrence data. J Mach Learn Res 8(Oct):2265–2295MathSciNetzbMATHGoogle Scholar
  20. 20.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  21. 21.
    Socher R, Bauer J, Manning C, et al. (2013) Parsing with compositional vector grammars. In: Proceedings of the 51st annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 455–465Google Scholar
  22. 22.
    Socher R, Perelygin A, Wu J, Chuang J, Manning C, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1631–1642Google Scholar
  23. 23.
    Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543Google Scholar
  24. 24.
    Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Liu H (2018) A comparison of word embeddings for the biomedical natural language processing, arXiv:
  25. 25.
    Bast ES, Berry EM (2014) Laugh away the fat? therapeutic humor in the control of stress-induced emotional eating. Rambam Maimonides Medical Journal 1:5Google Scholar
  26. 26.
    Yau YH, Potenza MN (2013) Stress and eating behaviors. Minerva Endocrinol 38(3):255Google Scholar
  27. 27.
    Tryon MS, Carter CS, DeCant R, Laugero KD (2013) Chronic stress exposure may affect the brain’s response to high calorie food cues and predispose to obesogenic eating habits. Physiol Behav 120:233–242CrossRefGoogle Scholar
  28. 28.
    Nguyen QC, Li D, Meng H-W, Kath S, Nsoesie E, Li F, Wen M (2016) Building a national neighborhood dataset from geotagged twitter data for indicators of happiness, diet, and physical activity. JMIR Public Health Surveill 2:2CrossRefGoogle Scholar
  29. 29.
    Eichstaedt JC, Schwartz HA, Kern ML, Park G, Labarthe DR, Merchant RM, Jha S, Agrawal M, Dziurzynski LA, Sap M et al (2015) Psychological language on twitter predicts county-level heart disease mortality. Psychol Sci 26(2):159–169CrossRefGoogle Scholar
  30. 30.
    CDC (2017) Centers for disease and control prevention: Adult obesity prevalence maps. [Online]. Available:
  31. 31.
    Paul MJ, Dredze M (2011) You are what you tweet: analyzing twitter for public health. Icwsm 20:265–272Google Scholar
  32. 32.
    Madan A, Moturu ST, Lazer D, Pentland AS (2010) Social sensing: obesity, unhealthy eating and exercise in face-to-face networks. In: Wireless Health 2010. ACM, pp 104–110Google Scholar
  33. 33.
    Scanfeld D, Scanfeld V, Larson EL (2010) Dissemination of health information through social networks: Twitter and antibiotics. Am J Infect Control 38(3):182–188CrossRefGoogle Scholar
  34. 34.
    Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE, pp 439–448Google Scholar
  35. 35.
    Go A, Huang L, Bhayani R (2009) Twitter sentiment analysis. Entropy 17:252Google Scholar
  36. 36.
    Dixon N, Jakić B, Lagerweij R, Mooij M, Yudin E (2012) Foodmood: measuring global food sentiment one tweet at a time. In: Proceedings of sixth international AAAI conference on Weblogs and social mediaGoogle Scholar
  37. 37.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022zbMATHGoogle Scholar
  38. 38.
    Erk K, Padó S (2008) A structured vector space model for word meaning in context. In: Proceedings of the conference on empirical methods in natural language processing. Association for computational linguistics, pp 897–906Google Scholar
  39. 39.
    Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60Google Scholar
  40. 40.
    NCBO (2017) Bioportal api. [Online]. Available:
  41. 41.
    Gilbert CHE (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international conference on Weblogs and social media (ICWSM-14). Available at (20/04/16)
  42. 42.
    Loria S, Keen P, Honnibal M, Yankovsky R, Karesh D, Dempsey E, et al. (2014) Textblob: simplified text processing, Secondary TextBlob: simplified text processingGoogle Scholar
  43. 43.
    Chuang J, Manning C, Heer J (2012) Termite: visualization techniques for assessing textual topic models. ACM, pp 74–77Google Scholar
  44. 44.
    Sievert C, Shirley K (2014) Ldavis: a method for visualizing and interpreting topics, pp 63–70Google Scholar
  45. 45.
    NCBI (2017) Pubmed central (pmc). [Online]. Available:
  46. 46.
    Dorlhiac GF, Fare C, van Thor JJ (2017) Pyldm-an open source package for lifetime density analysis of time-resolved spectroscopic data. PLoS Comput Biol 13(5):e1005528CrossRefGoogle Scholar
  47. 47.
    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10-10):95Google Scholar
  48. 48.
    Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems, arXiv:
  49. 49.
    scikit (2018) Scikit-learn machine learning in python. [Online]. Available:
  50. 50.
    SI Media (2018) Social influence media: active users. [Online]. Available:
  51. 51.
    SI Media (2018) Social influence media: Text size. [Online]. Available:
  52. 52.
  53. 53.
    brandwatch (2018) Internet users. [Online]. Available:
  54. 54.
    Twitter (2018) Twitter api. [Online]. Available:
  55. 55.
    Facebook (2018) Facebook api. [Online]. Available:
  56. 56.
    Instagram (2018) Instagram api. [Online]. Available:
  57. 57.
    Twitter (2016) Twitter streaming api. [Online]. Available:
  58. 58.
    USDA (2017) The usda standard on food and nutrition. [Online]. Available:
  59. 59.
    BusinessInsider (2017) The 8 unhealthiest restaurant meals in america. [Online]. Available:
  60. 60.
    Eatthis (2017) The #1 worst menu option at 41 popular restaurants. [Online]. Available:
  61. 61.
    Unicode (2017) Emoji list, v11.0. [Online]. Available:
  62. 62.
    TEFLtastic (2013) Positive and negative words in food. [Online]. Available:
  63. 63.
    Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for computational linguistics, pp 262–272Google Scholar
  64. 64.
    Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34(5):301–310CrossRefGoogle Scholar
  65. 65.
    Afzal Z, Pons E, Kang N, Sturkenboom MC, Schuemie MJ, Kors JA (2014) Contextd: an algorithm to identify contextual properties of medical terms in a dutch clinical corpus. BMC Bioinf 15(1):373CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Vijaya Kumari Yeruva
    • 1
  • Sidrah Junaid
    • 1
  • Yugyung Lee
    • 1
    Email author
  1. 1.School of Computing and EngineeringUniversity of Missouri- Kansas CityKansas CityUSA

Personalised recommendations