Topic Modeling for Exploring Cancer-Related Coverage in Journalistic Texts
Topic modeling has been used for many applications, but has not been applied to science and health communication research yet. In this paper, using topic modeling for this novel domain is explored, by investigating the coverage of cancer in news items from the New York Times since 1970 with the Latent Dirichlet Allocation (LDA) model. Content analysis of cancer in print media has been performed before, but at a much smaller scope and with manual rather than computational analysis. We collected 45.684 articles concerning cancer via the New York Times API to build the LDA model upon.
Our results show a predominance of breast cancer in news articles as compared with other types of cancer, similar to previous studies. Additionally, our topic model shows 6 distinct topics: research on cancer, lifestyle and mortality, the healthcare system, business and insurance issues regarding cancer treatment, environmental politics and American politics on cancer-related policies.
Since topic modeling is a computational technique, the model has more difficulty with understanding the meaning of the analyzed text than (most) humans. Therefore, future research will be set up to let the public contribute to analysis of a topic model.
KeywordsTopic modeling Cancer Content analysis
- 1.Greenberg, R.H., Freimuth, V.S., Bratick, E.A.: A content analytic study of daily newspaper coverage of cancer. Commun. Yearb. 3(8985), 645–654 (1979)Google Scholar
- 5.The New York Times Developer Network. https://developer.nytimes.com/. Accessed 28 Aug 2018
- 6.Lau, J., Collier, N., Baldwin, T.: On–line trend analysis with topic models: #twitter trends detection topic model online. In: Proceedings of COLING 2012: Technical Papers, pp. 1519–1534 (2012)Google Scholar
- 7.Xie, W., Zhu, F., Jiang, J., Lim, E.P., Wang, K.: TopicSketch: real–time bursty topic detection from Twitter. In: IEEE 13th International Conference on Data Mining, pp. 837–846 (2013)Google Scholar
- 8.Fang, A., Ounis, I., Habel, P., Macdonald, C., Limsopatham, N.: Topic–centric classification of Twitter user’s political orientation. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 791–794 (2015)Google Scholar
- 10.Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific articles. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 448–456 (2011)Google Scholar
- 12.Nltk.corpus package. https://www.nltk.org/api/nltk.corpus.html. Accessed 28 Aug 2018
- 13.Hong, L., Davison, B.: Empirical study of topic modeling in Twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88 (2010)Google Scholar
- 14.Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50 (2010)Google Scholar
- 15.Spacy. https://spacy.io/. Accessed 28 Aug 2018
- 16.Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)Google Scholar
- 17.PyLDAvis, https://pyldavis.readthedocs.io/. Accessed 28 Aug 2018
- 18.Sievert, C., Shirley, K.E.: LDAvis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 63–70 (2014)Google Scholar