INTRODUCTION

Given the rapidly changing nature of the coronavirus disease 2019 (COVID-19) pandemic, real-time monitoring of COVID-19 cases and deaths has been widely embraced.1 The pandemic has also been accompanied by an “infodemic,” an overabundance of information and misinformation.2 Public response to the pandemic and infodemic is important, but undermeasured.3 Real-time analysis of public response could lead to earlier recognition of changing public priorities, fluctuations in wellness, and uptake of public health measures, all of which carry implications for individual- and population-level health.3 To test this hypothesis, we measured daily changes in the frequency of topics of discussion across 94,467 COVID-19-related comments on an online public forum in March, 2020.

METHODS

Reddit is the 19th most popular website in the world with 420 million monthly active users.4 Between March 3 and March 31, 2020, we obtained all comments from the “Daily Discussion Post” on “r/Coronavirus,” the most popular COVID-19 subreddit with 1.9 million members. We defined 50 discussion topics, groups of commonly co-occurring words, using a machine learning based approach to natural language processing, latent Dirichlet allocation (LDA).5

For each of the 50 topics, we reviewed the ten words and comments most associated with each topic.6 We identified topics that fell into three categories of interest: response to public health measures, impact on daily life, and sense of pandemic severity. We tracked daily variations in the average prevalence of topics across all comments. In order to improve visualization of patterns of topic change, we used locally estimated scatterplot smoothing (LOESS) lines. To quantify the degree of change in prevalence, we compared 4-day periods using the two-proportion z-test. We used R version 3.6.1 for all analyses. All data was publicly available, and the study was considered exempt under University of Pennsylvania Institutional Review Board guidelines.

RESULTS

In the 29 days between March 3 and March 31, we collected 94,467 posts from r/Coronavirus daily discussion threads, with peak activity between March 15 and 17 (16% of comments). Of the 50 LDA topics (available by request), ten pertained to the three categories of interest. Other topics included those related to news sharing, political discussions, and discussions about the science of COVID-19. Table 1 shows key topic words and representative comments, and Figure 1 displays the change in topic frequency over time by category. In the “public health measures” category, for instance, “hand washing” became less prevalent throughout March (2.7% from March 3 to March 6 vs 1.9% from March 28 to March 31, p < .001; two-proportion z-test). “Impact on daily life” topics showed “travel” peaking early and dropping throughout the month (3.2% March 3–March 6 vs 1.0% March 28–March 31, p < .001) and concern regarding “personal finances” increasing (1.5% March 3–March 6 vs 2.1% March 28–March 31, p = .003). “Sense of pandemic severity” evolved over the month, with fewer comments comparing COVID-19 with the flu (2.3% March 3–March 6 vs 1.8% March 28–March 31, p = .04) and mid- to late-month growth in comments reporting numbers of cases and deaths (2.1% March 12–March 15 vs 2.7% March 28–March 31, p = .001).

Table 1 Latent Dirichlet Allocation Topics from a Coronavirus Subreddit Throughout March, 2020, with a Collection of Top Words Used to Define the Topic and a Redacted Representative Reddit Comment
Fig. 1
figure 1

The change in the prevalence over the month of March, 2020, in Reddit comment content related to a public health measures, b daily life impact, and c sense of pandemic severity. Lines show locally estimated scatterplot smoothing (LOESS) for the daily average prevalence of the topic across all comments; shaded grey area represents the standard error of the LOESS estimation.

DISCUSSION

This analysis indicates that longitudinal topic modeling of Reddit content is effective in identifying patterns of public dialogue and could be used to guide targeted interventions. For instance, comparisons to the flu were embraced by the public. Early recognition of this reality could have led to more specific information dissemination campaigns and earlier public acknowledgement of disease severity. Questions about safely spending time outdoors peaked in mid-March, representing a missed opportunity for public guidance. Tracking and responding proactively to common questions, such as what material is best used for a homemade mask, may minimize the spread of misinformation. Notably missing from these Reddit topics were discussions of contact tracing, a growing area of public concern. Limitations of this study include that Reddit users are not representative of all segments of the population, and that Reddit data is not associated with a geographic location. Real-time monitoring of online COVID-19 dialogue holds promise for more dynamically understanding and responding to needs in public health emergencies.