1 Introduction

The outbreak of the 2019 novel coronavirus (COVID-19) was an event that dominated global news since 2020 and has continued to do so as of the original writing of this paper, well past mid-2021. With its first reported cases in Wuhan, China, the highly infectious virus spread rapidly and was declared a global pandemic on 11 March 2020 [1]. To ‘flatten the curve’ and reduce the burden on overwhelmed healthcare systems, many countries worldwide implemented a series of measures, such as travel bans, temporary closures of public spaces and businesses. In a time of reduced in-person social interactions, social media became the best way to stay connected with other people. It also provided an outlet for people to express their frustrations and anxieties caused by the virus and the various impacts on individuals as well as the society at large due to the restrictive measures.

Apart from the social element, social media has also become a popular way to get the latest news. One such platform is Reddit, which provides a platform for both sharing news (links/content) and carry out lengthy discussions unhindered by practical limits to express oneself or discuss and debate topics in great details, since comments on Reddit have a 10,000 characters limit (as opposed to 280 characters limit on Twitter) and Reddit allows pseudonymous participation. Home to more than 400 million monthly users [2], Reddit is a social news aggregator that features a collection of news articles, text posts and visual content submitted by users. Users then curate the submissions and their comments with upvotes and downvotes. The forum is divided into communities called subreddits, each revolving around a central theme or topic. This research focuses on one such subreddit, r/Coronavirus which discusses issues related to COVID-19. Some distinctive aspects of Reddit from conventional social media platforms such as Facebook is its nature of user anonymity and user curation. These distinctions have fostered an environment where truthful, unfiltered sentiments might be more easily shared, and popular ones are upvoted and showcased.

As more people use social media to find and share news [3], social media content has become invaluable data sources for mining trends and sentiments. There have been recent works focused on extracting and analysing online sentiments with respect to COVID-19. For instance, Yin et al. [4] proposed a framework to analyse topic and sentiment dynamics due to COVID-19 based on tweets from Twitter over a span of two weeks. Another study by Low et al. [5] utilised natural language processing (NLP) techniques to characterise changes in fifteen subreddits focused on mental health before and during the pandemic. However, more research is necessary to analyse the shifts in popular topics and sentiments over a longer timespan to identify patterns and trends.

This study is exploratory in nature, aimed to gain insight on the public’s sentiments and attitudes towards COVID-19, for general understanding but also since it may help policymakers and health officials understand public opinions and perceptions, and respond more effectively to the people’s concerns. We do so by identifying popular sentiments and topics on the subreddit r/Coronavirus and evaluating how these elements have evolved in the span of a year, beginning from the virus outbreak to the rollout of vaccines.

Data comprising submissions and comments was first crawled from r/Coronavirus and preprocessed. Subsequently, the Valence Aware Dictionary and Sentiment Reasoner (VADER) [6] was employed to acquire and understand sentiments from the data. Series of topics were also mined using the latent Dirichlet allocation (LDA) model for further analysis. These methods help to achieve a more comprehensive overview of the various sentiments and trends in the one year after COVID-19. Our findings revealed that while there are 9 distinct themes in Reddit submissions, the comments in the subreddit are more diverse with 20 distinct topics. Sentiment of submissions are generally more negative than positive due to the nature of COVID-19 news, while the ratio of negative and positive sentiment in comments is fairly equal. We also observed through upvoted and downvoted comments that the r/Coronavirus community generally disapproved of comments that did not treat the COVID-19 virus seriously.

Overall, the contribution of this work is exploratory in nature, and its value is in (i) the establishment of the summary insights from r/Coronavirus subreddit data, in the process, (ii) creating a curated dataset which would serve as a valuable resource for any future studies by the research community, and (iii) the accompanying code (also available at the same link) and methodology establishes a base-line approach, which can be reused and extended upon, to continue to gather similar insight over time, as and if the pandemic continues to persist.

Next, we review past studies of COVID-19 on social media, followed by this study’s methods of data collection and preparation, and then we delve deeper with the analysis and evaluation of our findings.

2 Literature review

In the early stages of COVID-19, there were several studies on the sentiments present on social media. Twitter was amongst the social media platforms for which such research was carried out. One study analysed the topic and sentiment dynamics surrounding COVID-19 based on a compilation of 13 million tweets over two weeks [4]. Another study mined a collection of 107,990 tweets related to COVID-19 in the first three months of the outbreak and identified topics using the LDA model [7]. Xue et al. [8] also utilised the LDA model to analyse 1.9 million tweets and discovered 11 topics related to COVID-19.

Despite Twitter’s popularity, it is greatly limited in the length of its average content. Given that the most common length of a tweet is 33 characters [9], the data may be limited, which may consequently affect the comprehensiveness of topics and evaluation. The studies also focused on data within a maximum span of three months, which is a relatively short duration compared to the length of the pandemic. Therefore, there is a need to collect longer text data over an extended period for a more complete analysis. To address this gap, we have chosen a period of approximately a year, spanning the period from the public knowledge of the pandemic till the time point when the first vaccine rolls out, which in retrospect can be viewed as the early (pre-vaccines) stage of the pandemic. While it is still an ongoing event, and vaccination is happening across the globe in a very heterogeneous manner, we believe that in future, understanding the social discourse on COVID -19 may be decomposed in three logical phases – pre-vaccination, during the period of initial vaccination phase, and eventually when society has started to view COVID -19 endemic with its impact reasonably under control. In that context, the current study captures the first (pre-vaccines) phase.

3 Methods

3.1 Reddit dataset

3.1.1 Data collection

Reddit is organized as theme or topic centric subreddits. Users post submissions within any given subreddit, and the submission comprises a title, often along with an URL link referencing some online content, and possibly accompanied with a further body of text. Users then post comments within a submission thread, as response to the original post or as response to other comments. Users can also up or downvote the original submission, as well as individual comments.

Data was retrieved from the subreddit r/Coronavirus using Pushshift API. 356,690 submissions and 9,413,331 comments between 20th January 2020 (the beginning of the subreddit) and 31st January 2021 were extracted. The features of the data are shown in Table 1.

Table 1 Attributes available with the data collected. The aggregate votes are represented by the ‘score’ attribute

3.1.2 Data preprocessing

In the submission dataset, discussion threads submitted by the subreddit’s moderators were removed. Submissions with titles that were not in English were also removed using Python’s langdetect library. In the comment dataset, deleted, removed, duplicated and bots’ comments were removed. Subsequently, URLs, email addresses and special charactersFootnote 1 were stripped from the remaining submissions and comments as they did not contribute meaning to the text. This process improved the quality of tokens extracted in the later stages for sentiment analysis and topic modelling. Data that returned empty strings as a result of text cleaning were further removed. Figure 1 shows the composition of comments that were filtered out. The final cleaned data comprised 90.3% of the comments originally retrieved.

Fig. 1
figure 1

Distribution of comment dataset

In preparation for the topic modelling phase, the submission and comments datasets were sanitised further. Both datasets were pre-processed by the lemmatisation of tokens, removal of stop words and phrase extraction. Terms that were redundant in submission texts, such as ‘coronavirus’ and ‘covid’ were also removed from the submission dataset.

This was followed by the filtering of submissions with less than 5 tokens and comments with less than 7 tokens as these did not contribute substantially useful information. The difference in minimum token count was due to the headline-like nature of submissions (and thus submission texts being inherently shorter) and the conversational nature of comments.

268,109 submissions constitute the submission dataset for sentiment analysis and topic modelling; 8,499,244 comments (short comments preserved) constitute the comment dataset for sentiment analysis while a smaller subset of 5,243,419 comments were used for topic modelling.

3.2 Sentiment analysis

3.2.1 Comments and submissions with VADER

Sentiment analysis involves the contextual mining of text to extract its sentiment polarity. In this study, VADER (Valence Aware Dictionary and Sentiment Reasoner, [6]) is utilised to identify if each submission or comment is positive, negative, or neutral. VADER is a lexicon and rule-based sentiment analysis tool that is specifically adapted to sentiments expressed on social media. It takes into consideration negation, punctuation, sentiment-related initialisms and commonly used slangs [6], making it more applicable for sentiment analysis of informal content on Reddit than other popular tools such as Textblob and NLTK [10].

VADER’s sentiment analyser was employed on each submission and comment individually. Consequently, each submission or comment is given a set of scores, of which the most important is the compound score that lies between 1 and -1. The score represents the overall sentiment of the given text, where -1 is the extreme negative and 1 is the extreme positive. A text document with a score between 0.05 and -0.05 is classified as neutral; a score greater than or equal to 0.05 was given a “positive” classification while a score less than or equal to -0.05 was attributed a “negative” classification.

On a further note, a phrase such as “tested positive” is semantically positive due to the lexicon “positive” but is in fact negative in the context. Therefore, the terms “positive” and “negative” were swapped prior to sentiment analysis to improve the precision of sentiment scores and restored afterwards. An example of the impact of the swapping is illustrated in Table 2.

Table 2 Example of sentiment analysis with the term 'positive'

To capture the differences between the submissions with positive sentiments and those with negative sentiments, word clouds were generated using the wordcloud library in Python. Treating them as two separate documents, positive and negative submissions were subsequently converted to the bag-of-words model using Scikit-Learn’s CountVectorizer with “ngram_range = (1,3)” for bigram and trigram generation. The result is a 2 × 2,151,767 sparse matrix, excerpt of which is shown in Table 3.

Table 3 Bag-of-words model

To better capture the differences in submissions of opposing polarity, log odds ratio was calculated for each token in the matrix. This method helps to identify tokens that are more common in the positive submissions than in the negative submissions, and vice versa. The formula below is applied to each value in the matrix:

$$\log_2\left[\frac{\left(\frac{1+\,{frequency}\;\mathrm{of}\;X\;\mathrm{in}\;A}{1+\,{total}\;\mathrm{word}\;\mathrm{count}\;\mathrm{in}\;A}\right)}{\left(\frac{1+\,{frequency}\;\mathrm{of}\;X\;\mathrm{in}\;B}{1+\,{total}\;\mathrm{word}\;\mathrm{count}\;\mathrm{in}\;B}\right)}\right]$$
(1)

where X is a token, and A and B are different corpuses. This transforms the matrix to the one, excerpt of which is shown in Table 4. Subsequently, word clouds were generated using both matrices to visually identify and interpret the most prominent words and corresponding themes.

Table 4 Matrix with log odds ratio

3.2.2 Score-weighted sentiment score

To reflect the sentiment of the subreddit’s community more accurately, the “score” attribute from each submission was used to compute a score-weighted sentiment. The score indicates the net popularity of the content, therefore taking into consideration may allow clearer comparisons of the popularity of negative and positive content on the subreddit.

Table 5 describes the score in the data which ranges from 0 to 77,296. The minimum score is zero as submissions can only be downvoted to zero irrespective of the true number of downvotes it received. Nonetheless, the score still reveals the popular submissions on the subreddit. Reddit’s nature of making popular content more visible justifies the greatly right-skewed distribution observed in Fig. 2. To reduce the skew, log transformation is applied to the score and the resulting value is used as a multiplier. The range of the multiplier is shown in Fig. 2. The score-weighted sentiment is then derived using the following equation:

Table 5 Statistics of submission score
Fig. 2
figure 2

Multiplier values after log transformation of score

$$\text{Score-weighted sentiment}={\mathrm{log}}_{10}\left(score+1\right) \times \text{compound score}$$
(2)

By nature, submissions have a score of 1 at the time of creation. Since 85.9% of submissions in the dataset has a score of 1, this suggests that they were likely not upvoted or seen by the community. A small, nevertheless significant portion of the submissions attract moderate popularity, while a very few submissions attain extremely high popularity. Both these latter groups are of interest. Those attaining moderate popularity capture subjects that are generally of concern of the public, while the extremely popular (outlier) posts help identify topics of mass appeal. For instance, an animated cartoon explaining the effect of flattening the curve, and how that would augur with the health care system’s capacity to handle a peak load, and it’s impact on the eventual mortality became one of the post popular submission. Other informative posts for prevention, acts of selfless service and sacrifices were some of the prominently upvoted posts, indicating that the public was using this forum for information which was both to help with practicalities, as well as boost morale.

3.2.3 Comment score

The score of a comment indicates the net quantity of upvotes or downvotes it has received from the community. Table 6 shows the range of the score in the comment dataset. As each comment is given a default score of 1, it is unsurprising that 67% of comments has a score of 1. Comments with a negative score are interpreted as unpopular in the subreddit, while those with a positive score may be viewed as popular. The Pearson correlation coefficient between the sentiment score of a comment and its net score (votes) is 0.000326, indicating that there is no relationship between these two variables. This can be observed in the scatter plot displayed in Fig. 3.

Table 6 Statistics of comment score
Fig. 3
figure 3

Scatter plot with sentiment score on the y-axis and log-transformed score in the x-axis. The blue dots represent downvoted comments while the red dots represent upvoted comments

As such, it was not whether the comments were positive or negative in nature, which mattered in determining their (un)popularity, and instead, it was more representative of the subreddit’s opinions on their veracity and utility. Hence based on the statistics shown in Table 7, upvoted comments with a score in the upper 25% and downvoted comments with a score in the lower 25% were selected to provide insights on the popular and unpopular opinions on r/Coronavirus. These two sets of data were subsequently used to build two corresponding sets of corpora with unigrams and bigram. The bag-of-words model and log odds ratio method were applied, followed by the generation of word clouds, which we revisit later in the paper.

Table 7 Score statistics of upvoted and downvoted comments

3.3 Topic modelling

Python’s gensim library was used to create LDA (Latent Dirichlet Allocation) models on each submission and comment datasets. LDA is an unsupervised, generative statistical model that assumes each document – in this case, a submission or comment – to be a mixture of topics and each topic a mixture of words. To find the topic representation of each document and the words that contribute to each topic, LDA goes through each document and randomly assigns each word to one of the K topics. It subsequently repeats a generative process multiple times to improve on the assignments to produce a final set of topics and their associated words.

To achieve the optimal number of topics for each model, multiple models with varying numbers of topics were generated. The gensim topic coherence pipeline [11] was implemented on each model to calculate its coherence value ‘c_v’ to measure the extent of semantic similarity between high scoring words of topics. The model that gave the highest coherence value was selected for further study.

3.3.1 Submissions

268,109 pre-processed submissions were used to create a dictionary of 16,069 unique tokens and bigrams, for each token which has appeared in more than 5 submissions. The dictionary was then applied to the submission dataset to create a bag-of-words corpus which was used to generate six LDA models with 3, 5, 7, 9, 11 and 13 topics. Evaluation of their coherence is shown in Fig. 4, where the model with 9 topics had the highest coherence value. The final LDA model with 9 topics was regenerated with 20 passes.

Fig. 4
figure 4

Coherence values of LDA models for submissions

3.3.2 Comments

5,243,419 pre-processed comments were used to create a filtered dictionary of 62,017 unique tokens and bigrams. The dictionary was then applied to the comment dataset to create a bag-of-words corpus that generated six separate LDA models with 4, 8, 12, 16, 20 and 24 topics. As illustrated in Fig. 5, the model with 20 topics produced the highest coherence value, hence it was selected for topic identification and analysis.

Fig. 5
figure 5

Coherence values of LDA models for comments

4 Results and discussion

4.1 Submissions

4.1.1 Sentiment analysis

In Fig. 6 we show the confirmed cases of Covid-19 as of 28 February 2020. Submissions with positive, negative, and neutral sentiments for each week were tallied and plotted over the whole year of data, as shown in Fig. 7. We see a huge spike of posts during the early months when it was realized that Covid-19 is becoming a global pandemic. For this whole period, the number of negative submissions was consistently higher than the number of positive submissions. This reflects, unsurprisingly, that the nature of news tended to be negative.

Fig. 6
figure 6

Countries, territories, or areas with reported confirmed cases of COVID-19, 28 February 2020. Italy, China, South Korea, and Japan were amongst the first few countries to report high number of COVID-19 cases

Fig. 7
figure 7

Weekly count of submissions of varying sentiments. The orange, blue, and green lines represent the trend of positive, negative, and neutral submissions across the year. The red line represents the trend in the number of new reported COVID-19 cases worldwide [12]. The peak in submissions aligns with the sudden uptick (red circle) in new reported COVID-19 cases globally, which led to a declaration of a global pandemic in March 2020 by WHO

Submissions on r/Coronavirus began from the 3rd week of 2020, when the first confirmed cases of the novel virus were reported outside China [13]. The subreddit gradually received more submissions during the month of February (weeks 5 to 8) when outbreaks occurred in several countries such as Italy, South Korea and first cases surfaced in numerous other countries, as shown in Fig. 6. From week 8 onwards, the number of submissions increased rapidly and reached its peak in week 10 when the WHO (World Health Organization) declared the COVID-19 outbreak a pandemic on 11th March [1]. Following this, the number of submissions fell quickly back to pre-March levels despite the continuous rise in new cases. This indicates a possibility that March was the point in time when the virus was taken more seriously due to the influx of negative news, subsequent to which, a form of ‘normalization’ set in. This influx also resulted in the biggest gap between positive and negative submissions in March.

To evaluate the extent of the positivity and negativity of the subreddit, the score-weighted sentiment as described in Section 3.2.2 is employed. As the votes received by each submission signifies its popularity on the subreddit, the score-weighted sentiment allows us to give submissions with higher number of votes greater significance in its sentiment polarity.

Figure 8 shows the score-weighted sentiment scores of each submission in every month. It can be observed that in most months, there are more submissions with more extreme negative scores than positive ones. A sample of submissions are shown in Table 8. This reinforces our previous observation that popularity of the submission does not have an immediate bearing with respect to the sentiment carried in it, but rather, it is aligned with the nature of the content itself. In particular, in this subreddit, factual information (e.g. news of quarantine violations, projected death toll), and scientifically sound ideas (e.g., pro-mask, pro-vaccince) were promoted, while baseless fear mongering and unscientific conspiracy theories were censored by public consensus through downvotes. We note that some other subreddits (not captured in our study) are particular hotbed of consipiracy theory and fringe unscientific ideas, where the situation would likely be flipped.

Fig. 8
figure 8

Score-weighted sentiment score of submissions

Table 8 Sample of positive and negative submissions with high score-weighted sentiment score

The word clouds in Figs. 9 and 10 allow comparisons of positive and negative submissions based on word frequency. In Fig. 9, words such as ‘help’ and ‘vaccine’ are in reference to more resources and an expectation of end to the pandemic or relief from it, hence associated with positive sentiments. In Fig. 10, ‘death’ and ‘positive’ mainly refers to positive COVID-19 cases and COVID-19-related deaths.

Fig. 9
figure 9

Word cloud of positive submissions (word frequency)

Fig. 10
figure 10

Word cloud of negative submissions (word frequency)

The differences between positive and negative submissions become more distinct when comparing Figs. 11 and 12, which are word clouds generated using log odds ratio of tokens in positive and negative submissions, respectively. The phrases “health authority update” and “negative case” being associated positively suggests that they are often used in the context of positive headlines such as negative test results for COVID-19, as well as “safe effective” which refers to vaccines. On the other hand, the distinctive phrases in negative submissions are “total cases dead”, “daily death toll” and “UK death toll”. This strongly indicates that negative news on r/Coronavirus is predominantly on COVID-19 deaths.

Fig. 11
figure 11

Word cloud of positive submissions (log odds ratio)

Fig. 12
figure 12

Word cloud of negative submissions (log odds ratio)

The Kendall Tau correlation scores were also computed between the top N most frequently appearing tokens in positive and negative submissions, as shown in Table 9. The low Tau scores indicate that there is little monotonous relation between the ranks of tokens in the positive and negative submissions. Some tokens appeared frequently in both positive and negative submissions (as shown in Figs. 9 and 10), but as N increases, we observe that Tau score decreases, indicating that there is little correlation between tokens in negative submissions and those in positive submissions. This implies that the popular topics captured with positive sentiment are very distinct from those with negative sentiments, and thus, we again see, that the upvotes on this subreddit are not only about the topic itself, but (implicitly) also about its adherence to factual correctness.

Table 9 Kendall Tau score of token ranks in positive and negative submissions

Besides the immediate issues, like dead toll directly attributed to Covid-19, which naturally dominate in frequency, it also brings out some secondary issues – particularly, suicide, work-place issues (with the term employee appearing in the world cloud). These secondary themes are of particular interest, since they are easily overlooked under the state of emergency response to Covid-19, and yet, they disruption of lives and livelihood, issues of mental health and increased suicide, have all been lingering collaterals, signals for which were already present in the social media from the early stages of the pandemic.

4.1.2 Topic model

The objective of topic modelling was to examine the emergent themes and discourse surrounding COVID-19. Presented in Table 10, 9 topics were identified based on the model with the highest topic coherence, along with each of their most frequent words. The words were subsequently used to manually interpret and label each topic’s theme.

Table 10 Submission topics identified using LDA model

Python’s pyLDAvis library also enables the observation of topic segregation through an inter-topic distance map shown in Fig. 13 to examine any possible overlaps. The final model generated has mostly distinctive topics that contain different tokens at various frequencies. However, there are also topics which overlap, or, even if distinct, are for some reason or other closely correlated to each other. For instance, face masks and preventive measures is deemed close to the issue of spread of Covid-19 even if there is no immediate overlap, and both these topics also overlap with the issue of medical supplies and vaccines. This demonstrates, on one hand, the robustness of the natural language processing tools and topic modelling, in their ability to identify relevant topics and their interdependence; and on the other, it exposes what we know in retrospect as some of the major themes of discussions and controversies – e.g., people’s refusal to use masks, conspiracy theories discrediting vaccines, leading to huge volume of social media discussions devoted to these topics – and the connection of these issues to the spread of the pandemic. The themes identified in our model share some similarities to the 10 themes discovered by Xue et al. [8] using COVID-19-related tweets published between 23 January 2020 and 7 March 2020. The common themes are highlighted by the same-coloured cells in Table 11.

Fig. 13
figure 13

Visualising the fit of the submissions’ LDA model to the submission corpus. Each of the 9 circles represent one topic, whose area is proportional to the proportions of the topics across the number of tokens in the corpus. The circles are labelled in descending order of their areas. The centres of the topic circles are laid out in two dimensions according to a multidimensional scaling algorithm that is run on the inter-topic distance matrix. The distances between topics are computed using Jensen-Shannon [14] divergence

Table 11 Comparison of themes with the study by Xue et al. [8]

Despite the difference in volume of data (356,690 submissions on r/Coronavirus compared to 1.9 M tweets), it can be observed that the submissions encompass a more diverse set of themes as the dataset was collected over the duration of a year. The informative nature of submissions on the subreddit also meant that there was more content associated with the science behind COVID-19 and its vaccines.

The proportions of topics in the submissions are presented in Fig. 14. It reveals that the most dominant theme of all submissions is “infection rate and death toll of COVID-19”. An example of a submission of this topic is “New York state reports 3,832 new cases of coronavirus and 58 new deaths, adding to the 2,461 new cases and 67 new deaths reported 1 h ago”. This suggests that the r/Coronavirus subreddit as a community monitoring regularly the latest news on new cases and deaths. The second most dominant theme is “spread of COVID-19”, for which a representative example is “University of Hong Kong study finds eyes are ‘important route’ for coronavirus, up to 100 times more infectious than SARS”. This topic relates to the new studies and discoveries on COVID-19 with regards to its transmission, occasionally in comparison to other flus such as SARS. The third most dominant topic amongst submissions is “face masks and other preventive measures”. This topic may be popular as the wearing of face masks was at the centre of political divide in the United States [15], therefore, many submissions were intended to educate and inform the community. For example, a submission was titled “Masks don't merely help to protect the wearer, they limit the amount of virus exposure, which in turn limits the severity of symptoms should the wearer get sick”.

Fig. 14
figure 14

Proportion of submission topics in a year

At the start of 2020 when COVID-19 first emerged, most submissions were related to “face masks & other preventive measures”, as illustrated in Fig. 15. This suggests that a majority of submissions in the early phase of COVID-19 was about staying safe as individuals. Submissions of this category decreased in ratio as the situation evolved over time.

Fig. 15
figure 15

Fraction of submissions related to "Face masks & other preventive measures" and “Medical supplies & vaccines”

The development of vaccines, along with the availability of medical supplies, was a prominent topic in the submission dataset as well. Presented in Fig. 15, it rose quickly in the last quarter of 2020 to become the most dominant topic on the subreddit. This trend coincides with the moment when the first COVID-19 vaccine was approved by the U.S. Food and Drug Administration on 14 December 2020 [16]. It can therefore be inferred that the dominant subject in submissions evolved from preventive measures to vaccinations in the span of one year.

The full graphs on each topic’s distribution over time is shown in Fig. 26 in the appendix.

4.2 Comments

4.2.1 Sentiment Analysis with VADER

Figure 16 presents the overall sentiment distribution of the subreddit’s comments in the one-year period. The number of comments follows a trend similar to the number of submissions as shown in Fig. 7, likely due to the chain of events described in Section 4.1.1. Similarly, comment activity was the highest in the month of March when COVID-19 was officially classified a pandemic. Nearly one-third of comments in the dataset were from March.

Fig. 16
figure 16

Sentiment polarity of comments on r/Coronavirus

However, unlike the sentiment trend seen in submissions, the level of negative and positive comments remained consistently balanced. The number of positive comments actually exceeded that of negative comments in March, while the opposite was observed in the submission dataset. This indicates that even in the shadow of negative news on COVID-19, the community remained positive. This is consistent with the findings in other studies [4, 17] where it was found that people remained hopeful that government restrictions and proper personal hygiene measures would end the pandemic.

To get an overview of the sentiments and their corresponding words, a word cloud was generated for each sentiment. The size of the words in each word cloud is proportional to its log odds ratio, which indicates their regularity of appearance in positive comments relative to negative ones.

Figure 17 is a word cloud containing words with a high log odds ratio. Many of the tokens that stood out are related to the mild symptoms of COVID-19, such as coughing and sneezing in “cough_sneeze”, as well as previous viruses such as the Spanish flu and the swine flu. On further inspection, it suggests that the comments containing “spanish flu” and “swine flu” were often assuring other people that there were pandemics in the past, and that the situation would return to normal eventually. Users in the community reassured one another that showing mild symptoms such as coughs were not conclusive of being infected. They also encouraged others to wear masks to cover their mouth and nose. The phrase “icu bed” is also associated positively when referring to sufficient medical resources. On the other hand, comments containing “spring break” are classified positive as they mostly described fun activities common during spring break, yet in reality, they were viewed negatively by the commenters. Some representative comments are shown in Table 12.

Fig. 17
figure 17

Word cloud for comments with positive sentiments

Table 12 Sample of positive comments containing specified tokens

Figure 18 visualises the word cloud of words that appear more frequently in comments with a negative sentiment score. The biggest token is “wash_hand”, which is commonly encouraged as a way to prevent being infected with the virus. The phrase “wild animal” is commonly associated in negative comments as the source of COVID-19, while pre-existing conditions (“pre_exist”) are generally linked to COVID-19-related deaths. Lastly, “dry_cough” which to the COVID-19 symptom of dry cough is also viewed negatively. Contrasting this with “cough_sneeze” in Fig. 17 which is associated more positively, dry cough is commonly mentioned as a more severe effect of COVID-19. It is also often associated with other serious symptoms such as pneumonia, therefore it is seen in a highly negative light. Some representative comments are shown in Table 13.

Fig. 18
figure 18

Word cloud for comments with negative sentiments

Table 13 Sample of negative comments containing specified tokens

The Kendall Tau scores were also computed between the top N most frequently appearing tokens in positive and negative comments, as shown in Table 14. The relatively low scores indicate that distinctiveness of the phrases carrying positive or negative connotations, even as there is a lot of shared vocabulary across such posts (which is why the score is not too low).

Table 14 Kendall Tau score of token ranks in positive and negative comments

4.2.2 Analysis of upvoted and downvoted comments

650,753 upvoted comments and 151,199 downvoted comments were analysed for their topics. This subset of comments was selected as their score were deemed more representative of the subreddit.

Figure 19 illustrates the word cloud based on the log odds ratio of tokens in upvoted comments. Comments that contained “spring break”, “shake hand” and derided people for flouting social distancing rules were widely upvoted by the community. The token “intensive_care” generally appears in comments discussing the appropriate allocation of medical resources, while “false_negative” appears in discussions on COVID-19 tests that gave false negative results, for example: “It is unlikely a new infection would show symptoms that quickly. That strongly suggests this is an existing infection that lingered and showed a false negative.” Among comments that contain the bigram “false negatives”, there were 50% more comments with negative sentiments than positive ones. Given that COVID-19 tests showed risks of producing false negatives [18], this suggests that false negative cases were a topic of concern on the subreddit.

Fig. 19
figure 19

Word cloud for upvoted comments

On the other hand, Fig. 20 presents the word cloud for downvoted comments. The most prominent tokens are “long_term” and “fake_news”. The token “long_term” appears in many downvoted comments that were unsupportive of COVID-19 restrictions, citing reasons such as “long term unemployment” or debating that the restrictions do not work “long term”. Similarly, comments that mentioned “mental health” as a reason to disregard restrictions were also downvoted by the community. Additionally, downvoted comments that included “nursing home” consisted of comments minimising COVID-19-related deaths in nursing homes in the United States, some of which accused New York State Governor Andrew Cuomo of “sending COVID patients into nursing homes”. This was a point of controversy as, at some time point, one-third of COVID-19-related deaths in the United States were nursing home employees or residents [19].

Fig. 20
figure 20

Word cloud for downvoted comments

Many comments that claimed information as “fake news” were also unpopular, many of which were mistrusting towards science or the media.

Comparing the attitudes adopted in upvoted and downvoted comments, it can be observed that majority of the r/Coronavirus community treated COVID-19 restrictions and recommendations with great importance, even being cautious of false negative results that might hint at undetected cases. The community largely did not tolerate the underestimation of COVID-19’s impacts and the undermining of government orders, as well as the outright dismissal of information using phrases such as “fake news”.

4.2.3 Topic model

The comment dataset was used to train the LDA model. 20 topics were identified based on the model with the highest topic coherence and these are presented in Table 15, along with the most frequent associated words. The words were subsequently used to manually determine and label each topic’s theme. Examples of comments are shown in Table 16 in the appendix.

Table 15 Comment topics identified using LDA model

Figure 21 illustrates the inter-topic distance of the LDA model containing 20 topics and it can be observed that the topics are generally distinct from one another. We notice that, unlike for submissions whose topics were shown in Fig. 13, topics across comments are more densely packed, which is in part an artefact of more number of topics, but also that there are many more distinct nevertheless related themes, particularly as observed on the lower right quadrant. For example, scientific studies, infection death rates and comparison with other strains of flu and viruses are recognized as closely related yet distinct topics. Infection death rate is also close to the topic of travel restrictions & lockdowns. This figure thus captures a gradient of relationship across topics.

Fig. 21
figure 21

Visualising the fit of the comments’ LDA model to the comment corpus. Each of the 20 circles represent one topic, whose areas are proportional to the proportions of the topics across the number of tokens in the corpus. The circles are labelled in descending order of their areas. The centres of the topic circles are laid out in two dimensions according to a multidimensional scaling algorithm that is run on the inter-topic distance matrix. The distances between topics are computed using Jensen-Shannon [14] divergence, a measure based on (dis)similarity between probability distributions

Figure 22 proportion of comment topics across the year charts the percentage of occurrence of each topic in the comments on r/Coronavirus. Through the LDA model, the topics have provided insight into the discussion themes occurring during the span of the year.

Fig. 22
figure 22

Proportion of comment topics across the year

The most notable themes were as follows: (6) Negative outlook towards the situation, (2) Angry and hateful comments, (8) Scientific studies and sources. The first one largely reflects people’s worry and scepticism towards the pandemic, commonly believing that the situation would become worse. This can be inferred from the tokens shown in Fig. 23.

Fig. 23
figure 23

Top 30 most relevant terms for Topic (6) Negative outlook towards the situation. The relative term frequency provides a `signature’ of the topic, distinguishing it from other topics. These frequencies are also used to measure distances across topics

The second most dominant theme falls under “Angry and hateful comments”, which is characterised by tokens that include “hate” and “dumb”, amongst many other expletives. These were intentionally left in the text as they represented an outburst in human emotions such as frustration, anger, and contempt. These terms are also unique to the comment dataset due to its conversational nature. The high prevalence of this theme suggests that COVID-19 was a divisive subject that invited uncivilised discourse online and amplified political and other societal divides (Fig. 24).

Fig. 24
figure 24

Top 30 most relevant terms for Topic (2) Angry and hateful comments

The third largest theme is “Scientific studies and sources”. Figure 25 shows that the relevant terms are “article”, “news”, “datum”, “study”, etc., which clearly implies that the sources of information were a widely discussed topic on r/Coronavirus and is an important factor as to whether a piece of information should be trusted. This also suggests that the community is cautious and critical towards the accuracy of any claims or posts that appeared on the subreddit.

Fig. 25
figure 25

Top 30 most relevant terms for Topic (8) Scientific studies and sources

Topic distribution was assessed at each quarter of the year to capture the changes in topic prevalence. Inspection of the topic distribution across all timeframes indicated that there was no significant shift in any topics that aligned with significant events revolving around COVID-19. The graphs on each topic’s distribution over time is shown in Fig. 27 in the appendix.

4.3 Comparison of submissions and comments

4.3.1 Sentiments

The sentiment polarity of each submission is greatly dependent on the nature of its subject matter. For example, news about vaccines and medical aid are often positive while news about positive COVID-19 cases and COVID-19 death toll are negative. As the sources of the submissions are generally from news media and healthcare authorities, the subjects become quite one-dimensional. In contrast, Reddit comments are discussion-based and conversational in nature. Hence it is unsurprising to see that the topics found in comments with positive or negative sentiment are more diverse, ranging from flu symptoms to pre-existing medical conditions and wild animals. The vote feature on Reddit also allows the identification of popular and unpopular comments, and consequently, the popularity of the associated themes. Evaluation of these polarised comments revealed that both popular and unpopular comments carried negative sentiments towards a diverse set of topics, such as spring break (popular) and fake news (unpopular). It is evident that comments offer deeper insights into people’s perceptions and sentiments towards topics that are more societal in nature, while submissions tend to be confined to a smaller set of themes, which are often more scientific and authoritative.

4.3.2 Topics

The sets of topics generated from the submission dataset and the comment dataset share a small degree of overlap, such as “face masks and other preventive measures”, “infection rate and death toll”, and “travel restrictions and lockdown”.

However, as noted previously, comments are discussion-based and tend to relate to personal concerns and experiences, therefore the topics found in comments are largely different and much more diverse than those found in submissions. For example, the most dominant topic in comments is about people’s outlook towards the COVID-19 situation. This is followed by emotional debates between users on the subreddit, while the third is associated with comments discerning news sources. Given these observations, it can be noted that submission topics represent the category of news related to COVID-19 across the year, while comment topics correspond to the list of concerns that the public had. A prominent example of this is the issue of suicides which was not captured in the headlines of submissions, but could be gleaned out from the comments.

5 Conclusion

5.1 Principal findings and practical implications

This study examined the themes that have emerged around the central topic of COVID-19 through the examination of sentiments in submissions and comments, as well as the popular and controversial views that emerged in the form of upvotes and downvotes. From the submissions, it was noted that death tolls dominated the negative news and authority updates dominated the positive ones as they were perceived as progress, or hope thereof. The topics extracted from the submission data also clearly shows that information on the transmission of COVID-19 and the proper preventive measures were highly reported on r/Coronavirus. Vaccine-related news became the most prevalent topic at the end of 2020. These reflected the dominant focus and narratives in the mainstream media.

On the other hand, the comments presented a more vivid understanding of the sentiments and concerns of the populace at large, and our analysis captures a varied picture, in which the data better revealed a divide in attitudes towards COVID-19. Analysis of the sentiments showed that there were two main groups of people on the subreddit: one that was cautious and vigilant of COVID-19, contrasting against another that was dismissive and sceptical about its severity. The comment section of Reddit provided a platform for expression of ideas and support for each of these groups. In spite of the pervasiveness of misinformation, the community remained prudent with the news they received, as reflected from the upvotes and downvotes.

This study has demonstrated that Reddit can be employed to investigate levels of public awareness and sentiments which follow real-time events of the pandemic. The government and authorities can use these to understand the common misinformation, talking points and misunderstandings – and explore various channels to proactively advise and educate the public to fight against “fake news”. Likewise, they can also use such a forum to understand people's expectations and anxieties, and respond to these in a prompter manner. Timely intervention in public discourse may also help limit the politicisation of the virus to prevent mixed messaging and quell the public’s concerns with authoritative information and rigorous scientific studies.

5.2 Limitations and future work

In this work we study the social media discourse regarding an entire year, in contrast to most prior works which were restricted to shorter time window. The nature of the platform (with no length limit for most practical purposes, and pseudo-anonymity) also provided us access to in-depth and candid discussions. Nevertheless, there are several limitations to this study, that should be addressed in future by the broader research community, including ourselves.

Firstly, this study only focused on submissions and comments of r/Coronavirus. While it is the largest COVID-19-related community on Reddit, there are other subreddits that contained COVID-19 news and may provide additional venues of data. The methods used in this study can be extended to geographical-based subreddits to accurately mine sentiments of the corresponding demographic since government restrictions varied in time across geographic areas.

This leads to the second point, which is that research data may be expanded to cover sources of other languages to gain broader insights, particularly in a global pandemic unbounded by borders.

Furthermore, even though multiple vaccines have emerged, they vary in their efficacy – and this concern is particularly aggravated by the advent of mutated versions of the virus itself. Moreover, there is a huge extent of heterogeneity and latency in rolling out the vaccine across different countries. As such, there have been major new waves in many countries beyond the time frame studied in this work. While our methodology would apply, the concerns and thus the themes could be different during the second year of the pandemic. We provide the data curated in this work, so that future studies can pick-up where this study terminated, and can readily use our data for extension or comparison.

Lastly, although the natural language processing tools we have used, e.g., LDA model, have shown to be effective in topic extraction, one may want to apply other tools to gain further or more precise insight. E.g., the scientific quality and consistency of the themes may be further improved to reduce human bias when labelling topic themes. Future studies may also perform analysis on the secondary themes in the documents for a finer granularity of details.