The first year of the Covid-19 pandemic through the lens of r/Coronavirus subreddit: an exploratory study

Tan, Zachary; Datta, Anwitaman

doi:10.1007/s12553-023-00734-6

The first year of the Covid-19 pandemic through the lens of r/Coronavirus subreddit: an exploratory study

Original Paper
Published: 21 February 2023

Volume 13, pages 301–326, (2023)
Cite this article

Download PDF

Health and Technology Aims and scope Submit manuscript

The first year of the Covid-19 pandemic through the lens of r/Coronavirus subreddit: an exploratory study

Download PDF

1352 Accesses
Explore all metrics

Abstract

Data

This study looks at the content on Reddit’s COVID-19 community, r/Coronavirus, to capture and understand the main themes and discussions around the global pandemic, and their evolution over the first year of the pandemic. It studies 356,690 submissions (posts) and 9,413,331 comments associated with the submissions, corresponding to the period of 20th January 2020 and 31st January 2021.

Methodology

On each of these datasets we carried out analysis based on lexical sentiment and topics generated from unsupervised topic modelling. The study found that negative sentiments show higher ratio in submissions while negative sentiments were of the same ratio as positive ones in the comments. Terms associated more positively or negatively were identified. Upon assessment of the upvotes and downvotes, this study also uncovered contentious topics, particularly “fake” or misleading news.

Results

Through topic modelling, 9 distinct topics were identified from submissions while 20 were identified from comments. Overall, this study provides a clear overview on the dominating topics and popular sentiments pertaining the pandemic during the first year.

Conclusion

Our methodology provides an invaluable tool for governments and health decision makers and authorities to obtain a deeper understanding of the dominant public concerns and attitudes, which is vital for understanding, designing and implementing interventions for a global pandemic.

Sentiment analysis and topic modeling for COVID-19 vaccine discussions

Article Open access 25 February 2022

COVID-19 Pandemic: Identifying Key Issues Using Social Media and Natural Language Processing

Article 11 February 2022

Detecting Topic and Sentiment Dynamics Due to COVID-19 Pandemic Using Social Media

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The outbreak of the 2019 novel coronavirus (COVID-19) was an event that dominated global news since 2020 and has continued to do so as of the original writing of this paper, well past mid-2021. With its first reported cases in Wuhan, China, the highly infectious virus spread rapidly and was declared a global pandemic on 11 March 2020 [1]. To ‘flatten the curve’ and reduce the burden on overwhelmed healthcare systems, many countries worldwide implemented a series of measures, such as travel bans, temporary closures of public spaces and businesses. In a time of reduced in-person social interactions, social media became the best way to stay connected with other people. It also provided an outlet for people to express their frustrations and anxieties caused by the virus and the various impacts on individuals as well as the society at large due to the restrictive measures.

Apart from the social element, social media has also become a popular way to get the latest news. One such platform is Reddit, which provides a platform for both sharing news (links/content) and carry out lengthy discussions unhindered by practical limits to express oneself or discuss and debate topics in great details, since comments on Reddit have a 10,000 characters limit (as opposed to 280 characters limit on Twitter) and Reddit allows pseudonymous participation. Home to more than 400 million monthly users [2], Reddit is a social news aggregator that features a collection of news articles, text posts and visual content submitted by users. Users then curate the submissions and their comments with upvotes and downvotes. The forum is divided into communities called subreddits, each revolving around a central theme or topic. This research focuses on one such subreddit, r/Coronavirus which discusses issues related to COVID-19. Some distinctive aspects of Reddit from conventional social media platforms such as Facebook is its nature of user anonymity and user curation. These distinctions have fostered an environment where truthful, unfiltered sentiments might be more easily shared, and popular ones are upvoted and showcased.

As more people use social media to find and share news [3], social media content has become invaluable data sources for mining trends and sentiments. There have been recent works focused on extracting and analysing online sentiments with respect to COVID-19. For instance, Yin et al. [4] proposed a framework to analyse topic and sentiment dynamics due to COVID-19 based on tweets from Twitter over a span of two weeks. Another study by Low et al. [5] utilised natural language processing (NLP) techniques to characterise changes in fifteen subreddits focused on mental health before and during the pandemic. However, more research is necessary to analyse the shifts in popular topics and sentiments over a longer timespan to identify patterns and trends.

This study is exploratory in nature, aimed to gain insight on the public’s sentiments and attitudes towards COVID-19, for general understanding but also since it may help policymakers and health officials understand public opinions and perceptions, and respond more effectively to the people’s concerns. We do so by identifying popular sentiments and topics on the subreddit r/Coronavirus and evaluating how these elements have evolved in the span of a year, beginning from the virus outbreak to the rollout of vaccines.

Data comprising submissions and comments was first crawled from r/Coronavirus and preprocessed. Subsequently, the Valence Aware Dictionary and Sentiment Reasoner (VADER) [6] was employed to acquire and understand sentiments from the data. Series of topics were also mined using the latent Dirichlet allocation (LDA) model for further analysis. These methods help to achieve a more comprehensive overview of the various sentiments and trends in the one year after COVID-19. Our findings revealed that while there are 9 distinct themes in Reddit submissions, the comments in the subreddit are more diverse with 20 distinct topics. Sentiment of submissions are generally more negative than positive due to the nature of COVID-19 news, while the ratio of negative and positive sentiment in comments is fairly equal. We also observed through upvoted and downvoted comments that the r/Coronavirus community generally disapproved of comments that did not treat the COVID-19 virus seriously.

Overall, the contribution of this work is exploratory in nature, and its value is in (i) the establishment of the summary insights from r/Coronavirus subreddit data, in the process, (ii) creating a curated dataset which would serve as a valuable resource for any future studies by the research community, and (iii) the accompanying code (also available at the same link) and methodology establishes a base-line approach, which can be reused and extended upon, to continue to gather similar insight over time, as and if the pandemic continues to persist.

Next, we review past studies of COVID-19 on social media, followed by this study’s methods of data collection and preparation, and then we delve deeper with the analysis and evaluation of our findings.

2 Literature review

In the early stages of COVID-19, there were several studies on the sentiments present on social media. Twitter was amongst the social media platforms for which such research was carried out. One study analysed the topic and sentiment dynamics surrounding COVID-19 based on a compilation of 13 million tweets over two weeks [4]. Another study mined a collection of 107,990 tweets related to COVID-19 in the first three months of the outbreak and identified topics using the LDA model [7]. Xue et al. [8] also utilised the LDA model to analyse 1.9 million tweets and discovered 11 topics related to COVID-19.

Despite Twitter’s popularity, it is greatly limited in the length of its average content. Given that the most common length of a tweet is 33 characters [9], the data may be limited, which may consequently affect the comprehensiveness of topics and evaluation. The studies also focused on data within a maximum span of three months, which is a relatively short duration compared to the length of the pandemic. Therefore, there is a need to collect longer text data over an extended period for a more complete analysis. To address this gap, we have chosen a period of approximately a year, spanning the period from the public knowledge of the pandemic till the time point when the first vaccine rolls out, which in retrospect can be viewed as the early (pre-vaccines) stage of the pandemic. While it is still an ongoing event, and vaccination is happening across the globe in a very heterogeneous manner, we believe that in future, understanding the social discourse on COVID -19 may be decomposed in three logical phases – pre-vaccination, during the period of initial vaccination phase, and eventually when society has started to view COVID -19 endemic with its impact reasonably under control. In that context, the current study captures the first (pre-vaccines) phase.

3 Methods

3.1 Reddit dataset

3.1.1 Data collection

Reddit is organized as theme or topic centric subreddits. Users post submissions within any given subreddit, and the submission comprises a title, often along with an URL link referencing some online content, and possibly accompanied with a further body of text. Users then post comments within a submission thread, as response to the original post or as response to other comments. Users can also up or downvote the original submission, as well as individual comments.

Data was retrieved from the subreddit r/Coronavirus using Pushshift API. 356,690 submissions and 9,413,331 comments between 20th January 2020 (the beginning of the subreddit) and 31st January 2021 were extracted. The features of the data are shown in Table 1.

Table 1 Attributes available with the data collected. The aggregate votes are represented by the ‘score’ attribute

The first year of the Covid-19 pandemic through the lens of r/Coronavirus subreddit: an exploratory study

Abstract

Data

Methodology

Results

Conclusion

Similar content being viewed by others

Sentiment analysis and topic modeling for COVID-19 vaccine discussions

COVID-19 Pandemic: Identifying Key Issues Using Social Media and Natural Language Processing

Detecting Topic and Sentiment Dynamics Due to COVID-19 Pandemic Using Social Media

1 Introduction

2 Literature review

3 Methods

3.1 Reddit dataset

3.1.1 Data collection

3.1.2 Data preprocessing

3.2 Sentiment analysis

3.2.1 Comments and submissions with VADER

3.2.2 Score-weighted sentiment score

3.2.3 Comment score

3.3 Topic modelling

3.3.1 Submissions

3.3.2 Comments

4 Results and discussion

4.1 Submissions

4.1.1 Sentiment analysis

4.1.2 Topic model

4.2 Comments

4.2.1 Sentiment Analysis with VADER

4.2.2 Analysis of upvoted and downvoted comments

4.2.3 Topic model

4.3 Comparison of submissions and comments

4.3.1 Sentiments

4.3.2 Topics

5 Conclusion

5.1 Principal findings and practical implications

5.2 Limitations and future work

Code and data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Research involving human participants and/or animals

Informed consent

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Appendix 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation