1 Introduction

Over the past 5 years, the number of social media users has increased by more than a billion, which is greater than the increase in new Internet users and unique mobile phone subscribers over the same period, as seen in Fig. 1.

Fig. 1
figure 1

Growth and digital development trends 2019–2023: world population development, unique mobile phone subscribers, internet usage, social media users. Infographic is based on information collected from Global Digital Reports series, from DataReportal by Simon Kemp (2023)

YouTube, Facebook (Meta), Twitter (X), and Reddit are ranked among the ten most visited websites based on website traffic, according to the Semrush’s ranking as of April 2023 (Simon Kemp, 2023). Over the past 5 years, from 2019 to 2023, these platforms consistently maintain their positions in the top ten, with some small variations. Notably, Twitter (X) did not appear in the list in 2020 while Reddit was not among the top ten in 2019.

The total monthly average visits by Internet users are as follows: YouTube (2nd place with 94.8 billion visits after Google), Facebook (Meta) (3rd place with 13.8 billion visits), Twitter (X) (6th place with 8.52 billion visits) and Reddit (9th place with 5.41 billion visits) (Simon Kemp, 2023).

Over the past 4 years, Facebook (Meta) has been the world’s most used social media platform followed by YouTube (Simon Kemp, 2023). Facebook’s monthly active users account for 37.2% share of all people on Earth, totalling 2.989 billion active users. However, this share increases if users under thirteen are not accounted for (due to their restricted access) and users from China (where Meta is blocked). The top five countries with the most Facebook users, according to data from April 2023, are India (370 million), the United States (186 million), Indonesia (135 million), Brazil (114 million) and Mexico (93 million) (Simon Kemp, 2023). Similarly, the share of Facebook audience is the highest in the respective regions, with 22.1% in Southern Asia, 18.5% in South-Eastern Asia, 11.6% in Southern America, 9.3% in North America and 5.4% in Central America (Simon Kemp, 2023). The share of male users on Facebook is 56.8%, while female users account for 43.2%, which is a more balanced distribution than prevailing male audiences on Reddit and Twitter (X) platforms. The median age of Facebook users is 32 years, with the largest share belonging to the 25–34 age group constitutes the largest share (29.6%), followed by the 18–24 age group (22.6%), and the 35–44 age group (19.0%) (Simon Kemp, 2023). While there is a considerable share of young people using Facebook, it is mostly the older population that considers Facebook their favourite platform; the share of older audience is also growing. Engagement by a ‘typical’ user on Facebook is eleven “likes”, five “comments”, and one “share” per month. Females users tend to be more active on social media, engaging by liking more content and writing more comments, according to Meta advertising resources data from July 2021 (Simon Kemp, 2023).

Country of origin determines level of engagement to a greater extent than gender. In Greece, a single post received an average of sixteen likes in 2021, while in South Korea—only two likes (Simon Kemp, 2023). People from the Faroe Islands appeared to be more active, liking twenty-four posts per month (median), whereas people in Japan liked only two posts, according to data from June 2021. Since vaccine data in the current study is in English, it will most likely be responded to by an audience from the United States, one of the countries with the most registered users on the platform with an average number of ten likes per post, per month for a ‘typical’ user, according to a July 2021 report.

When comparing the average number of likes on the platform, both pro-vaccine and anti-vaccine discussions on Facebook (Meta) might fall below the engagement level of a ‘typical’ user from the United States. Notably, pro-vaccine discourse tends to attract higher engagement, especially if the sentiment contains stigma (Table 1). However, the main interest of current study is engagement with laymen discussions (comments), rather than the frequency of liking by a ‘typical’ user. Another aspect is the difference in average engagement per post and comment data. Upon observation, it is noted that the engagement with a comment is usually lower than with a post, Table 1 illustrates that the average engagement varies with 9 likes for pro-vaccine and 4 likes for anti-vaccine discussions.

The Reddit platform is the 9th most visited website according to the latest Semrush’s ranking as of December 2023, with 7.23 billion visits (Semrush, 2023). On average, users spend just under eight minutes on the site per visit (Mercado, 2023). Most of the Reddit users come from United States (48.69%) followed by the UK (7.05%), Canada (6.99%), Australia (4.19%) and Germany (3.14%) according to Skillademia data from March 2023 (Mercado, 2023) with 52 million daily active users (Simon Kemp, 2023). In the current research, Reddit is the most male dominated social media platform with 65.02% male and 34.98% female users (Mercado, 2023). The largest age group on Reddit are users aged 18 to 29, followed by 30 to 49 range (22% of the traffic) (Lin, 2023).

Roughly 4.6% of all people on Earth use Twitter (X), the share will be higher if not to account for users younger than thirteen (where access to the platform is limited) (Simon Kemp, 2023). Most users that use Twitter (X) are from the United States (64.9 million), followed by Japan (51.8 million), Brazil (16.6 million), and India (15 million) (Simon Kemp, 2023). The Twitter (X) platform is predominantly used by male users (64.3%), while female users make up more than one-third (35.7%) (Simon Kemp, 2023).

YouTube is the second most popular social media platform in the world according to Q4, 2023 data (Statista, 2023) with roughly 31,5% of people on Earth using YouTube (Simon Kemp, 2023). Most users of YouTube coming from the same top five countries as users of Facebook (Meta): India, the United States, Brazil, Indonesia and Mexico). 54.4% of YouTube’s users are male and 45.6% are female (based on April 2023 data) (Simon Kemp, 2023). Another report based on survey reveals that 48.6% of YouTube viewers are men and 51.4% are women. Users aged 25–34 constitute 20.7% of YouTube’s audience, followed by 35–44 (16.7%), and 18–24 (15.0%).

The current research does not track, age, gender and place of origin of the social media audience. However, the discussions are conducted in English and will attract users who consume content in the language. Since the United States and India have large user base across all discussed platforms one can expect users from those countries to have an impact on the engagement figures in the current research. Gender distribution on social media platforms might also be reflected in engagement figures (Table 2). YouTube, with the highest share of female users, who are also more actively engaged in sharing and liking the content, might explain the much higher engagement rate in Table 2. The largest share on Facebook (Meta), Reddit, Twitter (X), and YouTube is the young audience in the age range up to 35 years. One of the top three main reasons for individuals in the age range from 24 to 35 years to use social media is to read news.

To a large extent, Facebook (Meta) users post and share photos/videos with friends and family and then keep up to date with the news on the platform according to data from October, 2023 (Simon Kemp,  2023). Audience on Reddit looks for funny/entertaining content and subsequently keeps up to date with the news (Simon Kemp, 2023), whereas Twitter (X) is used primarily to keep up to date with the news (Simon Kemp, 2023). Most YouTube subscribers gather information, knowledge on the platform, with one-third using platform to watch news (Mahajan, 2023).

As social media channels in the current research are primarily used for receiving information and news from social media platforms, there should be a level of trust in the information received, since most likely, their decisions will be based on it.

Therefore, it is quite important to study engagement with health information shared on social media platforms, especially regarding vaccines, as it can impact health-care choices, such as the decision to vaccinate and subsequently determine the future course of a pandemic. Literature review showed that only a small group of studies looked into engagement with health content on social media. One third of researchers do not consider information from social media to be of value for their research efforts (Keller et al., 2014). Even though researchers might recognise the potential of social media data, a majority is either sceptical or oppose obtaining information from social media platforms (Keller et al., 2014). In the interim, when researchers are not professionally engaged in the social media space, public discourse is shaped by various laymen opinions regarding important health-related issues, which are often treated as factual information.

1.1 Social media and engagement with health information

Before the widespread use of social media channels, patients increasingly turned to the internet and reported that health-related information they found online was as useful, if not more so, than advice received from their doctors (Keller et al., 2014). One-way communication from the official channels to the public, coupled with a lack of discussions about the content, might contribute to a greater divide between patient seeking information on social media platforms and health-care experts who provide the content without prior insight into the patient’s perspective.

As previously discussed, the largest demographic using social media platforms fall within the age range of 18–35, with one-third of this group utilising social media to seek content and read news/stories (Simon Kemp, 2023). There are reports that young people demonstrate greater engagement with health-care content on social media due to improved learning, socialisation, and emotional support, and that such information impacts them even if they quickly scroll through it (Goodyear et al., 2018). Given that official organisations were reported to have the most impact on the behaviour of young people (Goodyear et al., 2018), one needs to examine communication strategies between official channels and the public, leading to greater participation of health professionals in social media discussions. The authors in Wong et al. (2014) discuss collaboration with health professionals and patients as a potential remedy against misinformation on social media. They suggest looking for cues beyond social media platforms such as health rating sites, to understand what different age groups expect from health-care providers. If authors in Wong et al. (2014) report that a better understanding of the patient’s perspective on health is needed, the authors in Keller et al. (2014) note that there is very little evidence on how public health professionals use social media and communicate information with patients. All known studies on social media engagement with health information have reported the need for a better understanding of the relationship between different agents and health content. Keller et al. (2014), Wong et al. (2014), Goodyear et al. (2018), Pérez-Escoda et al. (2020) discuss the need to understand the relationship between patients, health content and official agents on social media better. Straton et al. (2017a, b, c) discuss engagement of people with health information based on big data from Facebook (Meta), however these studies only look into characteristics of the posted content rather than sentiment analyses.

The ongoing research aims to bridge the gap in comprehension by exploring the content and its attributes that effectively captivate the public within health-care conversations. It's objective is to provide guidance to health experts on the optimal methods for communicating with the general public. The type of content, and especially the meaning assigned to it (Miller et al., 2016), can have an overwhelming impact. The current study attempts to explore whether prejudiced sentiment about vaccines can impact engagement with the content. Additionally, it seeks to explore other features in a text that can capture attention and trigger a response. Health experts on platforms like YouTube, for example, attempt to disprove prejudiced beliefs about COVID vaccines. However, the information is often presented through the one-way communication in videos, where official channels aim to disprove conspiracies by addressing prejudices about vaccines in general and COVID vaccines in particular. Discussions among laypeople are often left with no further involvement from public health professionals or official channels.  Various quantitative methodologies are being employed in the present study to ascertain the efficacy of textual discourse on social media platforms. This investigation is anticipated to yield insights that will enhance our understanding of communication strategies directed at the general public, thereby facilitating adherence to evidence-based health-care practices and informed decision-making regarding personal health and preventive measures such as vaccines.

Based on the author's comprehensive understanding, the primary contributions of the study can be summarized as follows:

  1. 1.

    This is the first study on the engagement with health-care content studied on a number of social media platforms with different length and structure of the comments: Facebook (Meta), Reddit, Twitter (X), YouTube.

  2. 2.

    The first research that studies engagement with stigmatised sentiment and other textual features in the context of general vaccine and COVID vaccines discussions in particular.

  3. 3.

    The study provides a meaningful insight on engagement with socially shared health information and its variance with time, discusses demographics and why it is important to study the concept within a narrow context.

  4. 4.

    The findings are supported with a publicly shared dataset.

The first contribution is partially addressed in Sect. 2 the Dataset,  3 Methods, and 4 Discussion sections. The second contribution is discussed in Sects. 3,  4: Methods and Discussion section. The third contribution is addressed in Sects. 4,  5: Discussion, Conclusion, and partially in Abstract, 1 Introduction and Sect. 1.1. The forth contribution is discussed in Sect. 2 the Dataset.

2 The dataset

The data were collected before the COVID-19 pandemic from the largest Facebook (Meta) pages discussing general vaccines. One Facebook (Meta) wall exhibited an anti-vaccine stance, while another presented a pro-vaccine sentiment, as shown in Table 1. The data were collected from January 2018 to February 2019.

Despite the public nature of the pages, they exhibit strong in-group support, with anti-vaccine and pro-vaccine sentiment prevailing on their respective pages. This in-group support and alignment of views lead to homogeneous opinions, seldom welcoming open discussions or opposing views on the topic. Notably, stigmatised content triggers responses and results in higher engagement on average, particularly evident in stigmatised anti-vaccine discussions.

Table 1 Engagement with stigmatised sentiment, general vaccination discourse

The COVID vaccine data outlined in Table 2 were collected after the outbreak of the COVID-19 pandemic and subsequent endeavors aimed at eliminating anti-vaccination pages from social media platforms. The data spans the period from April 2020 to March 2021, covering roughly nine months before the COVID vaccine roll-out and three months after. Out of the 117 main pages on Facebook (Meta) discussing vaccines, 26 were removed (Straton, 2022), and some pages also changed their view settings from public to private. As a result of de-platforming the most influential pages, the number of followers of anti-vaccine pages decreased on average, while the number of followers on pro-vaccination pages increased (Straton, 2022).

After the removal of the most influential pages from Facebook (Meta), followers of anti-vaccination pages shifted their discussions to express their stance on pro-vaccination pages (Straton, 2022). This behaviour is not typical for the anti-vaccine movement, as observed in general vaccine discussions before the COVID pandemic outbreak. The theories of cognitive dissonance (Festinger, 1962) and selective exposure (Freedman and Sears, 1965) suggest that dealing with contrasting ideas can be emotionally and psychologically exhausting. As a result, the logical reaction is either to avoid circumstances with contradictory information or to make a decision not to question prior beliefs when confronted with opposing views (Festinger, 1962). Confirmation bias, which is part of cognitive dissonance theory, demonstrates the tendency to interpret information that supports one’s own beliefs while ignoring alternatives. Group members also tend to believe they have no bias and perceive themselves as morally superior to the out-group (Festinger, 1962).

On the other hand, Abrams and other authors in Abrams et al. (1990), Abrams and Hogg (2010) have discussed the positive social identity linked to self-esteem, which is consequently enhanced through a sense of identification with a specific group and ‘intergroup’ discrimination. Therefore, the sense of belonging to an anti-vaccine group and the experience of cognitive dissonance when confronted with opposing views can fuel conflict and discrimination, hindering constructive discussion between the groups. Allport, in his contact hypothesis, identified four preconditions that should be met in order to facilitate inter-group understanding: equal status of the groups in a given situation, common goals, absence of intergroup competition, and support from authorities or social norms (Allport et al., 1954). The first three preconditions can be challenging to achieve in the context of vaccine stigma. Elliot Aronson added two additional preconditions related to frequent interaction between the in-group and out-group, as well as mutual interdependence (Aronson et al., 1994). When popular anti-vaccine pages were removed from social media, the movement became somewhat dependent on pro-vaccine pages, resulting in frequent interaction between the vaccine groups. However, anti-vaccine and pro-vaccine discussions did not show signs of intergroup cooperation during the pandemic. Authorities and society were interested in limiting anti-vaccine bias and erroneous information about the vaccines, making the possibility of equal status unrealistic. Even if the authorities did not impose limits on the anti-vaccine presence, cooperation between the groups is highly unlikely as the anti-vaccine and pro-vaccine groups pursue different goals even during the pandemic, which is one of the necessary preconditions for finding common ground.

Table 2 Engagement with stigmatised sentiment, COVID vaccination discourse

During the COVID-19 pandemic, social media users responded to vaccine information by liking, disliking, or re-sharing the content, discussing events in the comments. Engagement levels with ‘not stigma’ and ‘stigma’ content are about the same with small variations (Table 2). Response to disproving COVID conspiracy sentiment shows slightly higher engagement with ‘not stigma’ sentiment (Table 2).

Engagement behaviour within general vaccine discourse shows a distinctly higher trend of engagement with stigmatised sentiment among those with an anti-vaccine stance (Table 1). Before the pandemic, both anti-vaccine and pro-vaccine groups primarily discussed vaccines within their respective groups, with minimal to no participation from the out-of-group. In such environments, strong biases towards vaccines in anti-vaccine groups could be expressed among like-minded group members without much reservation. This environment promotes confirmation bias and allows the avoidance of alternative points of view and contradictory ideas.

The higher engagement with stigmatised sentiment among anti-vaccine group members could be attributed to the homogeneous nature of discussions, reinforcing their existing beliefs about vaccines and views targeted against out-of-group members, such as governments and pharmaceutical companies. Hence higher endorsement of stigmatised sentiment. Another reason could be the different nature of social media structure, even though, all information analysed in the research is publicly accessible.

Within Facebook (Meta), a convenient mechanism exists for facilitating discussions within its own in-group, whether public or private. This platform features a singular route for incoming and outgoing traffic, along with restrictions on the number of friends permitted to follow an account. The environment is characterised by a symmetric structure, which may foster heightened in-group engagement compared to alternative social media platforms such as Twitter (X). According to Paul and Friginal (2019), discussions on Facebook (Meta) are observed to be more interactive than those on Twitter (X). This observation is predominantly corroborated in relation to stigmatised sentiment, as indicated in Tables 1 and 2.

On Twitter (X), information is received asymmetrically through various channels (Peters et al., 2013), which may create opportunities for inter-group interaction. Additionally, there are no constraints on the number of followers a particular account may have, and there is no obligation to reciprocally follow those who follow the account. 

During the COVID-19 pandemic, there was increased interaction between anti-vaccine and pro-vaccine groups, often responding to the same content. Whether engaging with posts perpetuating stigma surrounding COVID vaccines or those debunking conspiracy theories related to COVID vaccines, a prevalent anti-vaccination stance and distinct arguments in favour of vaccinations are frequently observed.

When the pro-vaccine movement becomes involved in discussions, reactions to stigmatised anti-vaccine sentiment become diverse. This diversity may account for the relatively consistent engagement levels, on average, across stigma, non-stigma, and undefined sentiment in response to COVID vaccine posts. However, there is a slightly elevated level of engagement observed with non-stigmatised sentiment, particularly on Twitter (X) and YouTube platforms (Table 2). Further investigation is conducted using ANOVA F-score regression to determine if stigmatised sentiment can be a good predictor of engagement, if the difference in engagement with non-stigma sentiment is significant for COVID vaccine data, and if there are noticeable variations in the features across different levels of engagement prior to and during the COVID-19 pandemic.

3 Methods

3.1 Main language features with LIWC

Engagement concept holds significant importance in this study, as it serves as an indicator of participation in vaccine discussions and elucidates the factors that contribute to increased involvement in online debates. Diverse attributes are gathered from multiple social media platforms to measure engagement, as specified in Table 3. The engagement attribute in the propagated dataset is derived from upvotes/downvotes (Reddit), likes (Twitter (X), YouTube), dislikes (YouTube), and retweets (Twitter (X)).

Table 3 Engagement measures on various social media domains

The views feature on social media platforms, such as YouTube, is not directly related to engagement with the content, as it is unclear whether the person viewed the content intentionally or opened it by mistake. Therefore, it was disregarded in the study. Negative engagement values are converted into positive values due to the recognition that negative engagement, like its positive counterpart, signifies a response to the content. 

The engagement feature is log-normalised to eliminate skewness from the highly variable data and is presented as different engagement levels for better visualisation of the results. The engagement levels are established based on the distribution of values in the dataset. Their levels differ slightly in general anti-vaccination discussions (low engagement: 0 \(<=\) and < 2; medium engagement: 2 \(<=\) and < 4; high engagement: 4 \(<=\)) and pro-vaccination discussions (low engagement: 0 \(<=\) and < 2.5; medium engagement: 2.5 \(<=\) and < 5; high engagement: 5 \(<=\)). Anti-COVID vaccine discussions with following engagement levels (low engagement: 0 \(<=\) and < 2; medium engagement: 2 \(<=\) and <5; high engagement: 5 \(<=\)) and discussions that aim to disprove COVID anti-vaccine sentiment (low engagement: 0 \(<=\) and < 3; medium engagement: 3 \(<=\) and < 6; high engagement: 6 \(<=\)).

Table 4 General vaccine (data from Meta 2018 to 2019): top 30 features that predict Engagement

In order to understand which features are highly dependent on the response variable, or in other words which features are relevant for the analyses of engagement ANOVA F-score (F-value) is calculated using a slightly simplified version from Ding et al. (2014), Kumar et al. (2015):

$$\begin{aligned} F = \frac{SSA/df_b}{SSW/df_w} = \frac{\sum _{j=1}^{k} n_j\left( \bar{X_j} - \bar{X}\right) ^2/df_b}{\sum _{j=1}^{k}\sum _{i=1}^{n_j}\left( X_{ij}-\bar{X_j}\right) ^2/df_w} = \frac{MSA}{MSW} \end{aligned}$$
(1)

There are several features with data points. Dispersion between data points is established with sum of squares. SSA—sum of squares among groups (features), SSW—sum of squares within groups (features), MSA—mean of squares among, MSW—mean of squares within, df—degrees of freedom, (\(\bar{X_j}\)\(\bar{X}\))—distance between each feature average value \(X_j\) and grand means, \(X_{ij}\)—distance between each observed value within the feature from the feature mean \(\bar{X_j}\).

If \(H_0\) is true, variances are equal and if there is equal variance between feature and response variable, it means that feature has no impact on the engagement and can be disregarded from the model. The ANOVA F-ratio will be close to 1. However, if \(P \le 0.05\), the null hypothesis is rejected and relevant features selected. Implementation is performed in Python with SelectKBest and f_classif functions imported from feature_selection module from sklearn package. Out of the 84 features in Table 6, 30 features were selected (Tables 4, 5).

While ANOVA F-value measures dependency between different features and continuous engagement variable. Z-score provides an insight on the type of dependency at different engagement levels, where engagement is converted into categorical variable (high, medium, low) and Z-score is calculated based on Wang and Chen (2012):

$$\begin{aligned} Z = (x - \mu ) / \sigma \end{aligned}$$

where x is an individual value, \(\mu\) is population mean, \(\sigma\) is population standard deviation. Before the COVID-19 pandemic, messages with stigmatised sentiment generated higher engagement compared to messages without stigma, particularly stigmatised anti-vaccine discussions, which achieved higher engagement (HighEng. \(-\)0.5340; MedEng. 1.1103; LowEng. 16.7038 as presented in Table 4). Higher Z-score levels indicate lower stigma in the text. The ratio of variance (ANOVA F-score regression) between engagement values and stigmatised sentiment demonstrates that stigma can be a relevant feature and predictor of engagement (Table 4) based on data from general vaccine discussions. Stigmatised sentiment is identified as part of semi-automated process of manual annotation in a smaller subset of data and propagation of the labels to a larger social media dataset with supervised learning techniques as described in Straton (2022). The annotated and propagated data are stored in the Figshare depository (Straton 2023).

Greedy k-means\(++\) algorithm helps to visualise connection between engagement attribute and various features not only limited to stigmatised sentiment in a text. It is an unsupervised way to show hidden structure in the input data and distribution of data points into clusters, where one can possibly see connection between the features if any (Bhattacharya et al., 2019). With each iteration greedy k-means\(++\) algorithm samples several new centers, and then greedily chooses the one that decreases the objective the most (Arthur et al., 2007):

Input: Xkl

1: Uniformly independently sample \(c_1^1,...,c_1^l1 \in X\);

2: Let \(c_1 = argmin_{c_\in \{c_1^1,...,c_1^l\}} \varphi (X,c)\) and set \(C_1=\{c_1\}\).

3: for i \(\leftarrow 1,2,3,...,\) k - 1 do

4: Sample \(c_i^1+1,.., c_i^l+1 \in X\) independently, sampling x with probability \(\varphi (x,C_i)/\varphi (X,C_i)\);

5: Let \(c_i^1+1 = argmin_{c_\in \{c_i^1,...,c_i^l\}} \varphi (X, C_i \cup \{c\})\) and set \(C_{i+1} = C_i \cup \{c_{i+1}\}\).

6: return \(C:= C_k\) (Grunau et al. 2023)

l candidate centers are sampled \(c^1_{i+1},..., c^l_{i+1}\) from the constructed distribution in every step. For each candidate center \(c^j_{i+1}\) the new cost of the solution \(\varphi (X, C \cup \{c^j_{i+1}\})\) is computed. Subsequently, the candidate center that minimises the solution is selected (Grunau et al., 2023; Arthur et al., 2007).

The greedy k-means\(++\) algorithm is implemented using the KMeans function imported from the cluster module, in Scikit-learn library (Pedregosa et al., 2011) in Python. Four clusters, ten initialisation, and a maximum three hundred iterations are utilised. The figures in 2D (Figs. 2, 3) depict the relationship between engagement, stigma, and one additional feature.

Figure 2a, b show that highly engaging messages contain a higher number of function words. In contrast, messages with more positive emotions do not elicit the same level of response, as messages without. Highly engaging messages show low level of positive emotions. This trend is consistent with the patterns observed in COVID vaccine discussions.

Fig. 2
figure 2

General vaccine data, prior to the pandemic

Some features that predict engagement with COVID vaccines can also predict engagement with general vaccines; however, there are also differences. Stigmatised sentiment does not affect engagement in COVID vaccine discussions, whereas it is associated with higher engagement in general vaccine discussions. Stigmatised sentiment prevailed  in conversations within both anti-vaccine and pro-vaccine groups prior to the COVID-19 pandemic (Fig. 2). Anti-vaccine movement openly expressed stigmatised views regarding vaccines and the out-group without encountering opposing opinions in the discussion stream.

In Straton (2022), it is observed that stigmatised sentiment does not influence engagement with the content. This is evidenced by the absence of significant variances in engagement across stigma and non-stigma annotation labels, as indicated by z-score and ANOVA F-score. Users exhibited similar reactions to discussions on COVID vaccines that contained high levels of stigma and those that were free from prejudice towards vaccines. 

While COVID vaccine discussions void of stigma exhibit slightly higher engagement, z-score for different engagement levels is not significant, as evidenced in  Table 6 (HighEng. 2.0312; MedEng. \(-\)0.3827; LowEng. \(-\)0.3515).

Table 5 COVID vaccines (data from Twitter (X), YouTube, Reddit 2020–2021): top 30 features that predict Engagement
Table 6 Features of the model and their performance within each engagement cluster (Z-score, Standard Error of the MEAN)

Furthermore, the ratio of variance (ANOVA F-score regression) of 270 between engagement values and the stigma, not stigma features is relatively insignificant compared to the other top 30 features in Table 5. The stigma feature is not a strong predictor of engagement based on COVID vaccine discussion data and is one of the least correlated features with the engagement.

This is further supported by the results of visualisation using unsupervised learning K-means clustering on the engagement feature, annotated class, and one of the top features that predict engagement. The visualisation examples in Fig. 3a, b depict positive emotion and function words features.

Fig. 3
figure 3

COVID vaccine data

Users react similarly to stigmatised and non-stigmatised sentiment, as observed in Fig. 3a, b. Across different engagement values and engagement clusters, the ratio of stigma, non-stigma, and undefined sentiment remains largely consistent, as indicated by Z-scores, ANOVA F-score regression, and unsupervised learning K-means clustering. However, messages with a high ratio of function words typically demonstrate medium to high engagement. Conversely, discussions characterised by a high ratio of positive emotions tend to garner lower engagement with the content.

4 Discussion

Data regarding general vaccines have been collected from Facebook (Meta) and displayed in Table 1, whereas comments related to COVID vaccines Straton (2023) were gathered from Reddit, YouTube, and Twitter (X) (Table 2).

Different engagement levels in Tables 4 and 5 are presented for better visualisation and are derived from log-normalised engagement attribute. Yet, the relationship between features and engagement values is discerned through regression analysis. Engagement with stigmatised content, especially on anti-vaccination pages, was quite high before the pandemic. The anti-vaccination and pro-vaccination movements predominantly shared information within their own groups, rarely interacting with out-of-groups in order to avoid cognitive dissonance. Amidst the COVID-19 pandemic, the elimination of anti-vaccination pages (Straton, 2022) resulted in increased interactions with pro-vaccine groups in the comment sections. This transition might have influenced engagement with vaccine stigma content.

Throughout the COVID-19 pandemic, there is consistently higher engagement with not stigmatised sentiment on Twitter (X), and YouTube platforms in Table 2. The Z-score standard error of the mean shows slightly higher engagement with not stigma sentiment than stigmatised sentiment based on COVID vaccine data in Table 6 (a lower Z-score indicates stigmatised sentiment, while a higher Z-score indicates not stigmatised sentiment).

Nonetheless, an ANOVA F-score regression of 270 indicates that the variance between the stigma feature and engagement is insignificant compared to other features in COVID vaccine discussions (Table 5), suggesting that the stigma feature is not a particularly strong predictor of engagement.

The significance of the stigmatised feature as a predictor of engagement in general vaccine discussions before the pandemic, juxtaposed with its diminished relevance after the COVID-19 pandemic, suggests that the concept of engagement should be studied within a specific context. Efforts have been undertaken previously to analyse the interaction with health-related content on social media platforms. These studies have utilised large dataset extracted from Facebook (Meta) in order to discern the potential attributes of comments and posts that contribute to increased engagement.  Straton et al. (2017a, 2017b) identified the primary factors influencing engagement as post type, a time span between post creation and post update, and a time of a day the post was created. Post types on Facebook (Meta) included short status updates, other textual type posts, photos, videos, and link types posts. Although visual content was identified as the most engaging in the study (Straton et al., 2017b), highly engaging textual messages, such as short texts, were also observed. The discoveries from Straton et al. (2017c) indicate that the sentiment conveyed in social media messages can influence engagement.

Thus, when delving deeper into textual features, it becomes evident that certain language features contribute more to engagement than others. Nevertheless, these features vary in vaccine discussions before and after the pandemic, which could be attributed to specific events. The removal of anti-vaccine pages from social media platforms during the COVID-19 pandemic resulted in both anti-vaccination and pro-vaccination debates occurring within the same comment threads of COVID vaccine-related posts. The latter certainly changed the engagement dynamics with vaccine content on social media. If stigmatised language against out-of-group is encouraged before the pandemic through in-group discussions, mixed in-group and out-of-group debates during the COVID pandemic no longer show any engagement variance for this type of sentiment. Various factors may contribute to alterations in engagement dynamics, whereby features of the text preceding and following the pandemic may offer additional context into these fluctuations.

ANOVA F-score regression analyses was utilised to identify the most relevant features impacting engagement, with the Z-score (representing the standard error of the mean) employed as the measure of engagement variation. Engagement levels were subsequently categorised as high, medium, and low, predicated upon log-normalised engagement values and dataset features.

Prior to the pandemic, twelve features emerged as robust predictors of engagement (Table 4): number of words per sentence, word count, number of characters in the message, stigma/not stigma sentiment, prepositions (with, above), function words (of, and), affect as process (ugly, bitter), family references, clout (power), health references (flu, cough), article, and the pronoun ‘they’.

During the COVID-19 pandemic, the features demonstrating the greatest influence on engagement, as determined by ANOVA F-score regression and Z-score analyses (Table 5), encompass: article, punctuation mark (comma), pronouns ‘she’ and ‘he’, function words (of, and), time (hour, day), number of words per sentence, first-person pronoun ‘I’, prepositions (with, above), conjunctions (but, whereas), positive emotions (happy, good), affect as process (ugly, bitter), and second-person pronoun ‘you’.

There is an overlap among some of the top thirty features in both datasets, yet their significance, as assessed through ANOVA F-score regressions, demonstrates variations across the datasets. These common features include: number of words per sentence, prepositions (with, above), function words (of, and), affect as process (ugly, bitter), clout (power words), health references (clinic, flu), article, positive emotions (happy, good), assent (agree, yes), exclamations, personal pronouns (them, itself), netspeak (lol, thx), and conjunctions (but, whereas).

Even though, 43% of the features exhibit significant correlations in both datasets, the majority of features do not overlap and hold predictive capacity concerning engagement within the specific context of discussions concerning COVID vaccines or general vaccines. This observation suggests that a nuanced understanding and an effective measurement of engagement with content on social media platforms are best attained by considering specific features of the text or when exploring engagement within a specific context. It aligns with previous findings in Straton et al. (2017b) and Straton et al. (2017c), indicating that textual features can influence engagement. Exploring engagement through the analysis of big data offers broad insights into the phenomenon; the multitude of factors influencing engagement can yield varied outcomes contingent upon shifts in the contextual framework under study.

Features prevalent in general vaccine data before the pandemic (Table 4) suggest that high engagement is associated with an increased number of words per sentence, as well as a higher number of words and characters count in a message. The latter can be linked to verbal fluency, cognitive complexity, or dominance in the conversation (Tausczik and Pennebaker, 2010).

In a cooperative context, a higher word count feature indicates better coordination in communication. However, in a discussion context, a higher word count may suggest a breakdown in the discussion (Tausczik and Pennebaker, 2010). Vaccine discussions resemble conversations with elements of debate and diatribe, suggesting a need for lengthier argumentation to convince and persuade.

High engagement in general vaccine discussions is strongly linked to prejudice, stereotypes, and stigmatised messages, eliciting responses from users. Such behaviour can be expected in highly disputed topics, especially considering that the discussions primarily took place within in-group settings. Prepositions also play a prominent role in highly engaging comments. According to Hartley et al. (2003) in Tausczik and Pennebaker (2010), prepositions signify complex and detailed information, typically found in the reasoning and discussion parts of written text. The prevalence of function words in highly engaging comments indicates how individuals communicate and allocate their attention, as reflected in their use of pronouns. The frequent use of ‘they’ pronouns in engaging texts suggests an awareness of the out-group. As mentioned in Gunsch et al. (2000) and cited in Tausczik and Pennebaker (2010), an increase in references to the other party (in this instance, ‘they’) suggests a concentration on an adversary. In diatribes, this focus is likely to be negative. Family references can be attributed to discussions about parents’ vaccine choices for their children, while clout/power words may suggest control and the authority of government institutions in administering vaccines. Health references, such as 'clinic' and 'flu', are present in engaging comments in both general and COVID vaccine discussions, and they are also found in low engagement discussions. Health references are commonly observed in discussions about health and vaccines. The prevalence of articles in highly engaging posts suggests a tendency to be very specific through the utilisation of concrete nouns, as discussed in Pennebaker and Lay (2002) and Tausczik and Pennebaker (2010).

According to Newman et al. (2003) and Pennebaker and Lay (2002), articles, pronouns, and prepositions can reveal important information about a person, similar to relevant nouns or verbs. The observation that the use of articles and prepositions is linked to fewer emotion words, as discussed in Tausczik and Pennebaker (2010), finds support in the current research. Highly engaging posts contain fewer positive and other emotion words, including swear words, which are considered negative words according to Rassin and Muris (2005) in Oliver et al. (2008). Affective processes are associated with emotionality, as discussed in Blonder et al. (2005), Djikic et al. (2006), and Gill et al. (2008), with citation in Tausczik and Pennebaker (2010). The findings in Table 4 indicate a negative correlation between highly engaging discussions and the presence of affect words and emotions. This suggests that highly engaging posts are less likely to be written impulsively and are written with more self-control.

Highly engaging general vaccine discussions appear to be more planned and less emotional, featuring reasoning and differentiation typical of more complex discussion sections, also incorporating simultaneous references to authority and family concerns. The engaging sentiment aligns with the stigmatised sentiment described in Straton (2022). Stigma can be intentionally shared to deceive, or the person may genuinely believe in biased information being shared.

Engaging content in COVID vaccine discussions, similar to general vaccine discourse, often lacks emotional elements. This is evident based on the negative correlation observed between positive emotion features, affective processes, and engagement categories. Moreover, the prevalence of articles and prepositions in the text suggests the use of less emotional language, as discussed in Tausczik and Pennebaker (2010). The presence of articles may also indicate higher linguistic complexity. The number of words per sentence feature, also observed in general vaccine discussions, implies that more words are employed to persuade, as vaccine discussions often resemble diatribes. However, as discussed in Hancock et al. (2007), this could also indicate deception when comparing the message characteristics of liars and truth tellers. Contrary to expectations, the use of the pronoun ‘you’ in the current research is linked with low engagement, as indicated by ANOVA F-score regression in Table 5. As discussed in Tausczik and Pennebaker (2010), pronouns can signal the quality of the relationship. Excessive use of the pronoun ‘you’ can indicate blaming or distancing attitudes, as mentioned in Hahlweg et al. (1984) and cited in Simmons et al. (2005). Simmons et al. (2005) elaborated that the ‘you’ pronoun suggests negative interaction behaviour. The finding could elucidate the lower engagement observed in discussions characterised by in-group and out-group dynamics, when messages contain a predominance of second-person pronouns.

The use of the first-person singular pronoun ‘I’ is correlated with high engagement and, as mentioned in Tausczik and Pennebaker (2010), it is more commonly used in discussions of lower status. Additionally, the first-person singular pronoun is also associated with honesty; less frequent use of the first-person singular pronoun is associated with deception, according to Newman et al. (2003) and later confirmed by Hancock et al. (2007). Since the first-person pronoun signifies ‘taking ownership of the statement’, liars tend to avoid using it in a discussion (Hancock et al. 2007).

There is a positive correlation between high engagement and the use of she/he pronouns. According to Bond and Lee (2005), third-person pronouns (he, she, her) are less common in deceptive statements and more frequent in truthful messages.

Messages that exhibit high levels of engagement also feature an increased frequency of references to temporal indicators. According to the findings in Vrij et al. (2007) and Vrij (2005), contextual references, including mentions of time, appear more frequently in truthful discussions than in deceptive ones.

Conjunctions are pervasive in engaging comments and play a role in logically organising words, thereby contributing to the coherence of the overall statement, as observed in Graesser et al. (2004), Tausczik and Pennebaker (2010). Additionally, highly engaging comments tend to have more punctuation marks, such as commas, which can be attributed to the relatively longer sentences used in these discussions.

Emotionally neutral content tends to be highly engaging when discussing COVID vaccines. While a high number of words might suggest deception, other features, such as references to time, the use of first-person singular pronouns, and the use of third-person singular pronouns, indicate honesty in the discussion. Blaming and distancing attitudes, often conveyed through the pronoun ‘you’, are infrequently encountered in comments that exhibit medium to highl levels of engagement. Moreover, highly engaging COVID-19 vaccine discussions exhibit coherence and include features such as the use of articles, which may indicate the complexity of the language (Pennebaker and King, 1999).

The extent to which the results of the analyses encompass demographics remains uncertain, given the absence of demographic data collection in the current study. As social media channels mature, the user cohort using them also ages. However, despite of extensive range of demographics already using social media platforms, there are many groups who do not have access to it. Primarily, young adults and middle age groups are receiving their health-information from social media domains and it is possible that analyses is disregarding older age groups from the study. Furthermore, the data sourced from Facebook (Meta), Reddit, and Twitter (X) might exhibit ‘gender-bias’ owing to the predominance of male audiences on these platforms.

5 Conclusion

Previous studies  (Keller et al., 2014; Wong et al., 2014; Goodyear et al., 2018; Pérez-Escoda et al., 2020) advocate for increased involvement in examining interactions with health-care discourse, researcher engagement, youth participation, and regional interest in health-care content. Additionally, several quantitative inquiries have explored engagement with health information disseminated on social media (Straton et al., 2017a, b, c). Nevertheless, these subsequent studies have not delved into the  sentiment of the discourse. The COVID-19 pandemic emphasised the pivotal significance of socially shared health discourse, particularly due to the prevalent use of social media for obtaining health-related information. This study not only reaffirms the significance of previous research but also sheds light on how discussions among the general public rapidly morph into perceived expert viewpoints, possibly influenced by the relatively low involvement of the research community. Filtering out pages containing highly stigmatised opinions about health can help foster a more balanced perspective; however, it may not completely eradicate the issue. Stigmatised narratives about health on social media are fuelled by uncertainty and dichotomisation of topics, such as vaccines, especially newly developed ones. Despite the efforts to remove misleading content, it finds its way back.

In order to gain insights into the features that capture the attention of the general public regarding health-care information, as well as to comprehend their reactions to stigmatising narratives, this study utilised a combination of quantitative techniques, such as ANOVA F-score regression, Z-score, and K-means clustering. The integration of all three approaches aids in providing a holistic understanding of engagement with stigmatised sentiment, alongside other sentiment features, within health discussions shared across social media platforms.

Prior to the COVID-19 pandemic, vaccine discussions featuring stigmatised, prejudiced, and stereotyped content often garnered high engagement. Consequently, there was an increase in references to blame, attributable to the in-group nature of such discussions. There was little to no reservation in stigmatising the out-group, since stigmatised discussions are easily echoed among like minded group members. In-group communication on social media platforms can serve as echo-chambers, perpetuating fear and consequently creating challenges for health organisations and practitioners in convincing those within the group to make informed decision about their health. High engagement with stigmatised content suggests an imbalance in discussions, indicating a lack of fact-checking with health authorities.

Studying the public's engagement with vaccine discourse or any health-care information is pivotal, as it provides health authorities with valuable insights into the most impactful sentiment features to utilise when communicating with the general public or participating in pertinent discussions. Engaging discussions reference authority and family, whereas stigma is primarily directed against pharmaceutical companies, government institutions, and members of the out-group movement. Such content is more structured, complex, less emotional, and ‘polished’ compared with low-engaging content. The sentiment features indicate that comments were not spontaneously written, with certain premeditation, which is a significant finding in the study.

Certain events, such as the removal of anti-vaccination pages after the COVID-19 pandemic began, can alter the dynamics of a discussion and the content that is liked and shared. Additionally, stigmatised sentiment, alongside other features, appears to exert a limited influence on engagement in vaccine discourse during the COVID pandemic. The conclusions drawn in the current research underscore the significance of investigating engagement with social media content within a narrowly defined context, given the challenges of accounting for evolving dynamics over time.

Current research serves as a reminder to health authorities about the importance of avoiding stereotyped and prejudiced sentiment when communicating with the public about health care and prevention measures such as vaccines in particular.  Persistent polarisation surrounding vaccines fueled by stigma and prejudice, can results in decisions with broader implications beyond individual vaccine hesitancy.

One potential direction for future research is to build upon the findings of this study and investigate engagement with socially shared health discourse within a narrower demographic or regional contexts. This would entail extracting data from social media platforms that are relevant to specific regions and analysing the language features within that context.

Future research on engagement with health-care content should foster more collaboration between computer scientist, social scientists, patients, and health-care providers. This might requite undertaking longitudinal studies, such as monitoring patients’ exposure to stigmatised health-related content (including vaccines) and analysing their subsequent health-care decisions, taking into account their demographic traits.

Closer collaboration between public health experts, researchers, doctors, and laypeople discussing vaccine or other health-related topics on social media can serve as fact checking mechanism. When expert opinions are part of the discussion, such collaboration can lead to a reduction in stigmatised sentiment and more effective communication.