Topic Modeling for Exploring Cancer-Related Coverage in Journalistic Texts

  • Naomi Hariman
  • Marjolein de Vries
  • Ionica Smeets
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1021)


Topic modeling has been used for many applications, but has not been applied to science and health communication research yet. In this paper, using topic modeling for this novel domain is explored, by investigating the coverage of cancer in news items from the New York Times since 1970 with the Latent Dirichlet Allocation (LDA) model. Content analysis of cancer in print media has been performed before, but at a much smaller scope and with manual rather than computational analysis. We collected 45.684 articles concerning cancer via the New York Times API to build the LDA model upon.

Our results show a predominance of breast cancer in news articles as compared with other types of cancer, similar to previous studies. Additionally, our topic model shows 6 distinct topics: research on cancer, lifestyle and mortality, the healthcare system, business and insurance issues regarding cancer treatment, environmental politics and American politics on cancer-related policies.

Since topic modeling is a computational technique, the model has more difficulty with understanding the meaning of the analyzed text than (most) humans. Therefore, future research will be set up to let the public contribute to analysis of a topic model.


Topic modeling Cancer Content analysis 


  1. 1.
    Greenberg, R.H., Freimuth, V.S., Bratick, E.A.: A content analytic study of daily newspaper coverage of cancer. Commun. Yearb. 3(8985), 645–654 (1979)Google Scholar
  2. 2.
    Freimuth, V.S., Greenberg, R.H., DeWitt, J., Romano, R.M.: Covering cancer: newspapers and the public interest. J. Commun. 34(1), 62–73 (1984)CrossRefGoogle Scholar
  3. 3.
    Clarke, J.N., Everest, M.M.: Cancer in the mass print media: fear, uncertainty and the medical model. Soc. Sci. Med. 62(10), 2591–2600 (2006)CrossRefGoogle Scholar
  4. 4.
    Musso, E., Wakefield, S.E.L.: “Tales of mind over cancer”: cancer risk and prevention in the canadian print media. Health, Risk Soc. 11(1), 17–38 (2009)CrossRefGoogle Scholar
  5. 5.
    The New York Times Developer Network. Accessed 28 Aug 2018
  6. 6.
    Lau, J., Collier, N., Baldwin, T.: On–line trend analysis with topic models: #twitter trends detection topic model online. In: Proceedings of COLING 2012: Technical Papers, pp. 1519–1534 (2012)Google Scholar
  7. 7.
    Xie, W., Zhu, F., Jiang, J., Lim, E.P., Wang, K.: TopicSketch: real–time bursty topic detection from Twitter. In: IEEE 13th International Conference on Data Mining, pp. 837–846 (2013)Google Scholar
  8. 8.
    Fang, A., Ounis, I., Habel, P., Macdonald, C., Limsopatham, N.: Topic–centric classification of Twitter user’s political orientation. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 791–794 (2015)Google Scholar
  9. 9.
    Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8(9), e73791 (2013)CrossRefGoogle Scholar
  10. 10.
    Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific articles. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 448–456 (2011)Google Scholar
  11. 11.
    Jacobi, C., van Atteveldt, W., Welbers, K.: Quantitative analysis of large amounts of journalistic texts using topic modelling. Digital J. 4(1), 89–106 (2016)CrossRefGoogle Scholar
  12. 12.
    Nltk.corpus package. Accessed 28 Aug 2018
  13. 13.
    Hong, L., Davison, B.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88 (2010)Google Scholar
  14. 14.
    Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010)Google Scholar
  15. 15.
    Spacy. Accessed 28 Aug 2018
  16. 16.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)Google Scholar
  17. 17.
    PyLDAvis, Accessed 28 Aug 2018
  18. 18.
    Sievert, C., Shirley, K.E.: LDAvis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 63–70 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Naomi Hariman
    • 1
    • 2
  • Marjolein de Vries
    • 1
    • 3
  • Ionica Smeets
    • 1
  1. 1.Science Communication and Society, Faculty of ScienceLeiden UniversityLeidenThe Netherlands
  2. 2.Bio-Pharmaceutical SciencesLeiden UniversityLeidenThe Netherlands
  3. 3.Mathematics and Computer ScienceEindhoven University of TechnologyEindhovenThe Netherlands

Personalised recommendations