Background

In the light of increased vaccine hesitance in various countries, consistent monitoring of public beliefs and opinions about the national immunization program is important. Besides performing qualitative research and surveys, real-time monitoring of social media data about vaccination is a valuable tool to this end. The advantage is that one is able to detect and respond to possible vaccine concerns in a timely manner, that it generates continuous data and that it consists of unsolicited, voluntary user-generated content.

Several studies that analyse tweets have already been conducted, providing insight in the content that was tweeted most during the 2009 H1N1 outbreak [1], the information flow between users with a certain sentiment during this outbreak [2], or trends in tweets that convey, for example, the worries on efficacy of HPV vaccines [3, 4]. While human coders are best at deploying world knowledge and interpreting the intention behind a text, manual coding of tweets is laborious. The above-mentioned studies therefore aimed at developing and evaluating a system to code tweets automatically. There are several systems in place that make use of this automatic coding. The Vaccine Confidence Project [5] is a real-time worldwide internet monitor for vaccine concerns. The Europe Media Monitor (EMM) [6] was installed to support EU institutions and Member State organizations with, for example, the analysis of real-time news for medical and health-related topics and with early warning alerts per category and country. MEDISYS, derived from the EMM and developed by the Joint Research Center of the European Commission [7], is a media monitoring system providing event-based surveillance to rapidly identify potential public health threats based on information from media reports.

These systems cannot be used directly for the Netherlands because they do not contain search words in Dutch, are missing an opinion-detection functionality, or do not include categories of the proper specificity. Furthermore, opinions towards vaccination are contextualized by national debates rather than a multinational debate [8], which implies that a system for monitoring vaccination stance on Twitter should ideally be trained and applied to tweets with a similar language and nationality. Finally, by creating an automatic system for mining public opinions on vaccination concerns, one can continue training and adapting the system. We therefore believe it will be valuable to build our own system. Besides analysing the content of tweets, several other applications that use social media with regard to vaccination have been proposed. They, for example, use data about internet search activity and numbers of tweets as a proxy for (changes in) vaccination coverage or for estimating epidemiological patterns. Huang et al. [9] found a high positive correlation between reported influenza attitude and behavior on Twitter and influenza vaccination coverage in the US. In contrast, Aquino et al. [10] found an inverse correlation between Mumps, Measles, Rubella (MMR) vaccination coverage and tweets, Facebook posts and internet search activity about autism and MMR vaccine in Italy. This outcome was possibly due to a decision of the Court of Justice in one of the regions to award vaccine-injury compensation for a case of autism. Wagner, Lampos, Cox and Pebody [11] assessed the usefulness of geolocated Twitter posts and Google search as source data to model influenza rates, by measuring their fit to the traditional surveillance outcomes and analyzing the data quality. They find that Google search could be a useful alternative to the regular means of surveillance, while Twitter posts are not correlating well due to a lower volume and bias in demographics. Lampos, de Bie and Christianinni [12] also make use of geolocated Twitter posts to track academics, and present a monitoring tool with a daily flu-score based on weighted keywords.

Various studies [1315] show that estimates of influenza-like illness symptoms mentioned on Twitter can be exploited to track reported disease levels relatively accurately. However, other studies [16, 17] showed that this was only the case when looking at severe cases (e.g. hospitalizations, deaths) or only for the start of the epidemic when interest from journalists was still high.

Other research focuses on detecting discussion communities on vaccination in Twitter [18] or analysing semantic networks [19] to identify the most relevant and influential users as well as to better understand complex drivers of vaccine hesitancy for public health communication. Tangherlini et al. [20] explore what can be learned about the vaccination discussion from the realm of "mommy blogs": parents posting messages about children’s health care on forum websites. They aim to obtain insights in the underlying narrative frameworks, and analyse the topics of the messages using Latent Dirichlet Allocation (LDA) [21]. They find that the most prominent frame is a focus on the exemption of one’s child from receiving a vaccination in school. The motivation against vaccination is most prominently based on personal belief about health, but could also be grounded in religion. Surian et al. [22] also apply topic modeling to distinguish dominant opinions in the discussion about vaccination, and focus on HPV vaccination as discussed on Twitter. They find a common distinction between tweets reporting on personal experience and tweets that they characterize as ‘evidence’ (statements of having had a vaccination) and ‘advocacy’ (statements that support vaccination).

Most similar to our work is the study by Du, Xu, Song, Liu and Tao [3]. With the ultimate aim to improve the vaccine uptake, they applied supervised machine learning to analyse the stance towards vaccination as conveyed on social media. Messages were labeled as either related to vaccination or unrelated, and, when related, as ’positive’, ’negative’ or ’neutral’. The ’negative’ category was further broken down into several considerations, such as ’safety’ and ’cost’. After having annotated 6,000 tweets, they trained a classifier on different combinations of features, obtaining the highest macro F1-score (the average of the separate F1-scores for each prediction category) of 0.50 and micro F1-score (the F1-score over all predictions) of 0.73. Tweets with a negative stance that point to safety risks could best be predicted, at an optimal F1 score of 0.75, while the other five sub-categories with a negative stance were predicted at an F1 score below 0.5 or even 0.0.

Like Du et al. [3], we focus on analysing sentiment about vaccination using Twitter as a data source and applying supervised machine learning approaches to extract public opinion from tweets automatically. In contrast, in our evaluation we focus on detecting messages with a negative stance in particular. Accurately monitoring such messages helps to recognize discord in an early stage and take appropriate action. We do train machine learning classifiers on modeling other categories than the negative stance, evaluating whether this is beneficial to detecting tweets with a negative stance. For example, we study whether it is beneficial to this task to model tweets with a positive and neutral stance as well. We also inquire whether a more fine-grained categorization of sentiment (e.g.: worry, relief, frustration and informing) offers an advantage. Apart from comparing performance in the context of different categorizations, we compare different machine learning algorithms and compare data with different levels of annotation reliability. Finally, the performance of the resulting systems is compared to regular sentiment analysis common to social media monitoring dashboards. At the public health institute in the Netherlands, we make use of social media monitoring tools offered by CoostoFootnote 1. For defining whether a message is positive, negative or neutral with regard to vaccination, this system makes use of the presence or absence of positive or negative words in the messages. We believe that we could increase the sensitivity and specificity of the sentiment analysis by using supervised machine learning approaches trained on a manually coded dataset. The performance of our machine learning approaches is therefore compared to the sentiment analysis that is currently applied in the Coosto tool.

Implementation

We set out to curate a corpus of tweets annotated for their stance towards vaccination, and to employ this corpus to train a machine learning classifier to distinguish tweets with a negative stance towards vaccination from other tweets. In the following, we will describe the stages of data acquisition, from collection to labeling.

Data collection

We queried Twitter messages that refer to a vaccination-related key term from TwiNLFootnote 2, a database with IDs of Dutch Twitter messages from January 2012 onwards [23]. In contrast to the open Twitter Search APIFootnote 3, which only allows one to query tweets posted within the last seven days, TwiNL makes it possible to collect a much larger sample of Twitter posts, ranging several years.

We queried TwiNL for different key terms that relate to the topic of vaccination in a five-year period, ranging from January 1, 2012 until February 8, 2017. Query terms that we used were the word ‘vaccinatie’ (Dutch for ‘vaccination’) and six other terms closely related to vaccination, with and without a hashtag (‘#’). Among the six words is ‘rijksvaccinatieprogramma’, which refers to the vaccination programme in The Netherlands. An overview of all query terms along with the number of tweets that could be collected based on them is displayed in Table 1.

Table 1 Overview of the number of Twitter messages that were queried from TwiNL and filtered, from the period between January 2012 and February 2017

We collected a total of 96,566 tweets from TwiNL, which we filtered in a number of ways. First, retweets were removed, as we wanted to focus on unique messagesFootnote 4. This led to a removal of 31% of the messages. Second, we filtered out messages that contain a URL. Such messages often share a news headline and include a URL to refer to the complete news message. As a news headline does not reflect the stance of the person who posted the tweet, we decided to apply this filtering step. It is likely that part of the messages with a URL do include a message composed by the sender itself, but this step helps to clean many unwanted messages. Third, we removed messages that include a word related to animals and traveling (‘dier’, animal; ‘landbouw’, agriculture; and ‘teek’, tick), as we strictly focus on messages that refer to vaccination that is part of the governmental vaccination program. 27,534 messages were left after filtering. This is the data set that is used for experimentation.

Data annotation

The stance towards vaccination was categorized into ‘Negative’, ‘Neutral’, ‘Positive’ and ‘Not clear’. The latter category was essential, as some posts do not convey enough information about the stance of the writer. In addition to the four-valued stance classes we included separate classes grouped under relevance, subject and sentiment as annotation categories. With these additional categorizations we aimed to obtain a precise grasp of all possibly relevant tweet characteristics in relation to vaccination, which could help in a machine learning settingFootnote 5.

The relevance categories were divided into ‘Relevant’, ‘Relevant abroad’ and ‘Irrelevant’. Despite our selection of vaccination-related keywords, tweets that mention these words might not refer to vaccination at all. A word like ‘vaccine’ might be used in a metaphorical sense, or the tweet might refer to vaccination of animals.

The subject categorization was included to describe what the tweet is about primarily: ‘Vaccine’, ‘Disease’ or ‘Both’. We expected that a significant part of the tweets would focus on the severeness of a disease when discussing vaccination. Distinguishing these tweets could help the detection of the stance as well.

Finally, the sentiment of tweets was categorized into ‘Informative’, ‘Angry/Frustration’, ‘Worried/Fear/Doubts’, ‘Relieved’ and ‘Other’, where the latter category lumps together occasional cases of humor, sarcasm, personal experience, and question raised. These categories were based on the article by [1], and emerged from analysing their H1N1-related tweets. The ‘Informative’ category refers to a typical type of message in which information is shared, potentially in support of a negative or positive stance towards vaccination. If the message contained more than one sentiment, the first sentiment identified was chosen. Table 2 shows examples of tweets for the above-mentioned categories.

Table 2 Specification of the annotation categories

We aimed at a sufficient number of annotated tweets to feed a machine learning classifier with. The majority of tweets were annotated twice. We built an annotation interface catered to the task. Upon being presented with the text of a Twitter post, the annotator was first asked whether the tweet was relevant. In case it was deemed relevant, the tweet could be annotated for the other categorizations. Otherwise, the user could click ‘OK’ after which he or she was directly presented with a new Twitter post. The annotator was presented with sampled messages that were either not annotated yet or annotated once. We ensured a fairly equal distribution of these two types, so that most tweets would be annotated twice.

As annotators, we hired four student assistants and additionally made use of the Radboud Research Participation SystemFootnote 6. We asked participants to annotate for the duration of an hour, in exchange for a voucher valued ten Euros, or one course credit. Before starting the annotation, the participants were asked to read the annotation manual, with examples and an extensive description of the categories, and were presented with a short training round in which feedback on their annotations was given. The annotation period lasted for six weeks. We stopped when the number of applicants dropped.

A total of 8259 tweets were annotated, of which 6,472 were annotated twice (78%)Footnote 7. 65 annotators joined in the study, with an average of 229.5 annotated tweets per person. The number of annotations per person varied considerably, with 2388 tweets coded by the most active annotator. This variation is due to the different ways in which annotators were recruited: student-assistants were recruited for several days, while participants recruited through the Radboud Research Participation System could only join for the duration of an hour.

We calculated inter-annotator agreement by Krippendorff’s Alpha [24], which accounts for different annotator pairs and empty values. To also zoom in on the particular agreement by category, we calculated mutual F-scores for each of the categories. This metric is typically used to evaluate system performance by category on gold standard data, but could also be applied to annotation pairs by alternating the roles of the two annotators between classifier and ground truth. A summary of the agreement by categorization is given in Table 3. While both the Relevance and Subject categorizations are annotated at a percent agreement of 0.71 and 0.70, their agreement scores are only fair, at α=0.27 and α=0.29. The percent agreement on Stance and Sentiment, which carry more categories than the former two, is 0.54 for both. Their agreement scores are also fair, at α=0.35 and α=0.34. The mutual F-scores show marked differences in agreement by category, where the categories that were annotated most often typically yield a higher score. This holds for the Relevant category (0.81), the Vaccine category (0.79) and the Positive category (0.64). The Negative category yields a mutual F-score of 0.42, which is higher than the more frequently annotated categories Neutral (0.23) and Not clear (0.31). We found that these categories are often confused. After combining the annotations of the two, the stance agreement would be increased to α=0.43.

Table 3 Agreement scores for all four categorizations; mutual F-score is reported by category

The rather low agreement over the annotation categories indicates the difficulty of interpreting stance and sentiment in tweets that discuss the topic of vaccination. We therefore proceed with caution to categorize the data for training and testing our models. The agreed upon tweets will form the basis of our experimental data, as was proposed by Kovár, Rychlý and Jakubíček [25], while the other data is added as additional training material to see if the added quantity is beneficial to performance. We will also annotate a sample of the agreed upon tweets, to make sure that these data are reliable in spite of the low agreement rate.

Data categorization

The labeled data that we composed based on the annotated tweets are displayed in Table 4. We combined the Relevant and Relevant abroad categories into one category (‘Relevant’), as only a small part of the tweets was annotated as Relevant abroad. We did not make use of the subject annotations, as a small minority of the tweets that were relevant referred a disease only. For the most important categorization, stance, we included all annotated labels. Finally, we combined part of the more frequent sentiment categories with Positive.

Table 4 Overview of data set (the cells indicate the number of examples per label and data type)

We distinguish three types of labeled tweets: ‘strict’, ‘lax’ and ‘one’. The strictly labeled tweets were labeled by both annotators with the same label. The lax labels describe tweets that were only annotated with a certain category by one of the coders. The categories were ordered by importance to decide on the lax labels. For instance, in case of the third categorization, Negative was preferred over Positive, followed by Neutral, Not clear and Irrelevant. If one of the annotators labeled a tweet as Positive and the other as Neutral, the lax label for this tweet is Positive. In Table 4, the categories are ordered by preference as imposed on the lax labeling. The ‘one’ labeling applies to all tweets that were annotated by only one annotator. Note that the total counts can differ between label categorizations due to the lax labeling: the counts for Positive labels in the Polarity + sentiment labeling (Positive + Frustration, Positive + Information and Positive + other) do not add up to the count of the Positive label in the Polarity labeling.

With the ‘strict’, ‘lax’ and ‘one’ labeling, we end up with four variants of data to experiment with: only strict, strict + lax, strict + one and strict + lax + one. The strict data, which are most reliable, are used in all variants. By comparing different combinations of training data, we test whether the addition of less reliably labeled data (lax and/or one) boosts performance.

The four labelings have an increasing granularity, where the numbers of examples for the Negative category are stable across each labeling. In the first labeling, these examples are contrasted with any other tweet. It hence comprises a binary classification task. In the second labeling, irrelevant tweets are indicated in a separate category. The Other class here represents all relevant tweets that do not convey a negative stance towards vaccination. In the third labeling, this class is specified as the stance categories Positive, Neutral and Not clear. In the fourth labeling, the Positive category, which is the most frequent polarity class, is further split into ‘Positive + frustration’, ‘Positive + Information’ and ‘Positive + Other’. Positivity about vaccination combined with a frustration sentiment reflects tweets that convey frustration about the arguments of people who are negative about vaccination (e.g.: "I just read that a 17 year old girl died of the measles. Because she did not want an inoculation due to strict religious beliefs. -.- #ridiculous"). The Positive + Information category reflects tweets that provide information in favor of vaccination, or combined with a positive stance towards vaccination (e.g.: "#shingles is especially common with the elderly and chronically diseased. #vaccination can prevent much suffering. #prevention")Footnote 8.

In line with Kovár, Rychlý and Jakubíček [25], we evaluate system performance only on the reliable part of the annotations - the instances labeled with the same label by two annotators. As the overall agreement is not sufficient, with Krippendorff’s Alpha ranging between 0.27 and 0.35, the first author annotated 300 tweets sampled from the strict data (without knowledge of the annotations) to rule out the possibility that these agreed upon annotations are due to chance agreement. Comparing these new annotations to the original ones, the Negative category and the Positive category are agreed upon at mutual F-scores of 0.70 and 0.81. The percent agreement on the binary classification scheme (e.g.: Negative versus Other) is 0.92, with α=0.67, which decreases to α=0.55 for the Relevance categorization, α=0.54 for the Polarity categorization and α=0.43 for the Polarity + Sentiment categorization. We find that instances of a negative and positive stance can be clearly identified by humans, while the labels Neutral and Not Clear are less clear cut. Since it is our focus to model tweets with a negative stance, the agreement on the binary decision between Negative and Other is just sufficient to use for experimentation based on Krippendorff’s [26] remark that " α≥.667 is the lowest conceivable limit" (p.241). In our experimental set-up we will therefore only evaluate our system performance on distinguishing the Negative category from any other category in the strict data.

Experimental set-up

For each combination of labeling (four types of labeling) and training data (four combinations of training data) we train a machine learning classifier to best distinguish the given labels. Two different classifiers are compared: Multinomial Naive Bayes and Support Vector Machines (SVM). In total, this makes for 32 variants (4 labelings ×4 combinations of training data ×2 classifiers). All settings are tested through ten-fold cross-validation on the strict data and are compared against two sentiment analysis implementations, two random baselines and an ensemble system combining the output of the best machine learning system and a rule-based sentiment analysis system. All components of the experimental set-up are described in more detail below.

Preprocessing

To properly distinguish word tokens and punctuation we tokenized the tweets by means of Ucto, a rule-based tokenizer with good performance on the Dutch language, and with a configuration specific for TwitterFootnote 9. Tokens were lowercased in order to focus on the content. Punctuation was maintained, as well as emoji and emoticons. Such markers could be predictive in the context of a discussion such as vaccination. To account for sequences of words and characters that might carry useful information, we extracted word unigrams, bigrams, and trigrams as features. Features were coded binary, i.e. set to 1 if a feature is seen in a message and set to 0 otherwise. During training, all features apart from the top 15,000 most frequent ones were removed.

System variants

We compare the performance of four types of systems on the data: Machine learning, sentiment analysis, an ensemble of these two, and random baselines.

Machine Learning

We applied two machine learning algorithms with a different perspective on the data: Multinomial Naive Bayes and SVM. The former algorithm is often used on textual data. It models the Bayesian probability of features to belong to a class and makes predictions based on a linear calculation. Features are naively seen as independent of one another [27]. In their simplest form, SVMs are binary linear classifiers that make use of kernels. They search for the optimal hyperplane in the feature space that maximizes the geometric margin between any two classes. The advantage of SVMs is that they provide a solution to a global optimization problem, thereby reducing the generalization error of the classifier [28].

Both algorithms were applied by means of the scikit-learn toolkit, a python library that offers implementations of many machine learning algorithms [29]. To cope with imbalance in the number of instances per label, for Multinomial Naive Bayes we set the Alpha parameter to 0.0 and muted the fit prior. For SVM, we used a linear kernel with the C parameter set to 1.0 and a balanced class weight.

Sentiment analysis

Two sentiment analysis systems for Dutch were included in this study. The first sentiment analysis system is Pattern, a rule-based off-the-shelf sentiment analysis system that makes use of a list of adjectives with a positive or negative weight, based on human annotations [30]. Sentences are assigned a score between −1.0 and 1.0 by multiplying the scores of their adjectives. Bigrams like ‘horribly good’ are seen as one adjective, where the adjective ‘horribly’ increases the positivity score of ‘good’. We translated the polarity score into the discrete labels ‘Negative’, ‘Positive’ and ‘Neutral’ by using the training data to infer which threshold leads to the best performance on the ‘Negative’ category.

The second sentiment analysis system is the one offered by the aforementioned social media monitoring dashboard Coosto. We included this system as it is commonly used by organizations and companies for monitoring the public sentiment on social media regarding a given topic, and thereby is the main system to which our machine learning set-ups should be compared. As Coosto is a commercial product, there is no public documentation on their sentiment analysis tool.

Ensemble

Machine learning and Pattern’s rule-based sentiment analysis are two diverging approaches to detecting the stance towards vaccination on Twitter. We test if they are beneficially complementary, in terms of precision or recall, by means of an ensemble system that combines their output. We include a precision-oriented ensemble system and a recall-oriented ensemble system, that are both focused on the binary task of classifying a tweet as ‘negative’ towards vaccination or as something else. These systems will combine the predictions of the best ML system and Pattern, where the precision-oriented variant will label a tweet as ‘negative’ if both systems have made this prediction, while the recall-oriented variant will label a tweet as ‘negative’ if only one of the two has made this prediction.

Baselines

In addition to machine learning, sentiment analysis and an ensemble of the two, we applied two random baselines: predicting the negative class randomly for 50% of the messages and predicting the negative class randomly for 15% of the messages. The latter proportion relates to the proportion of vaccination-hesitant tweets in the strictly labeled data on which we test the systems. We regard these random baselines as a lowest performance boundary to this task.

Evaluation

We evaluate performance by means of ten-fold cross-validation on the strictly labeled data. In each of the folds, 90% of the strictly labeled data is used as training data, which are complemented with the laxly labeled data and/or the data labeled by one annotator, in three of the four training data variants. Performance is always tested on the strict data. As evaluation metrics we calculate the F1-score and the Area Under the ROC Curve (AUC) on predicting the negative stance towards vaccination in the test tweets.

Results

With respect to the machine learning (ML) classifiers, we alternated three aspects of the system: the labels to train on, the composition of the training data and the ML algorithm. The results of all ML settings are presented in Table 5, as the F1-score and AUC of any setting on correctly predicting tweets with a negative stance. Systems with specific combinations of the ML classifier and size of the training data are given in the rows of the table. The four types of labelings are listed in the columns.

Table 5 Machine Learning performance of correctly predicting the label of tweets with a negative stance (Clf = Classifier, NB = Naive Bayes, SVM = Support Vector Machines, AUC = Area under the curve)

The results show a tendency for each of the three manipulations. Regarding the ML algorithm, SVM consistently outperforms Naive Bayes for this task. Furthermore, adding additional training data, albeit less reliable, generally improves performance. Training a model on all available data (strict + lax + one) leads to an improvement over using only the strict data, while adding only the laxly labeled data is generally better than using all data. Adding only the data labeled by one annotator often leads to a worse performance. With respect to the labeling, the Polarity-sentiment labeling generally leads to the best outcomes, although the overall best outcome is yielded by training an SVM on Polarity labeling with strict data appended by lax data, at an area under the curve score of 0.66Footnote 10.

Table 6 displays the performance of the best ML system (with an F1-score of 0.36 and an AUC of 0.66) in comparison to all other systems. The performance of the random baselines, with F1-scores of 0.18 (50%) and 0.13 (15%), indicates that the baseline performance on this task is rather low. The sentiment analysis yields better performances, at an F1-score of 0.20 for Pattern and 0.25 for Coosto. The scores of the best ML system are considerably higher. Nevertheless, there is room for improvement. The best precision that can be yielded by combining rule-based sentiment analysis with the best ML system (SVM trained on Polarity labeling with strict data appended by lax data) is 0.34, while the best recall is 0.61.

Table 6 Performance of all systems on correctly predicting the label of tweets with a negative stance (for ML only the best performing system is displayed; Pr = Precision, Re = Recall, AUC = Area under the Curve)

To analyse the behavior of the best ML system, we present confusion tables of its classifications in Tables 7 (polarity labeling) and 8 (binary labeling). In the polarity predictions, the Irrelevant category is most often misclassified into one of the other categories, while the Positive and Negative categories are most often confused mutually. The classifier is possibly identifying features that denote a stance, but struggles to distinguish Positive from Negative. As for its performance on distinguishing the Negative label from any other label, Table 8 shows that the classifier mostly overshoots in its prediction of the Negative label, with 403 incorrect predictions, while the predictions of the Other category are mostly correct, with 182 predictions that were actually labeled as Negative.

Table 7 Confusion table of the classification of tweets in the best ML setting (SVM trained on Polarity labeling with strict data appended by lax data)
Table 8 Confusion table of the classification of tweets in the best ML setting (SVM trained on Polarity labeling with strict data appended by lax data), on the binary task of distinguishing negative tweets from any other tweet

To gain insight into the potential of increasing the amount of training data, we applied the best ML system (SVM trained on strict and lax data on the polarity labels) on 10% of the strictly labeled data, starting with a small sample of the data and increasing it to all available data (excluding the test data). The learning curve is presented in Fig. 1. It shows an improved performance until the last training data is added, indicating that more training data would likely yield better performance.

Fig. 1
figure 1

Learning curve of the best ML system

Comparison machine learning and rule-based sentiment analysis

Judging by the significantly increased precision or recall when combining ML and rule-based sentiment analysis in an ensemble system, the two approaches have a complementary view on tweets with a negative stance. To make this difference concrete, we present a selection of the messages predicted as Negative by both systems in Table 9. The first three are only predicted by the best ML system as Negative, and not by Pattern, while the fourth until the sixth examples are only seen as Negative by Pattern. Where the former give arguments (‘can not be compared...’, ‘kids are dying from it’) or take stance (‘I’m opposed to...’), the latter examples display more intensified words and exclamations (‘that’s the message!!’, ‘Arrogant’, ‘horrific’) and aggression towards a person or organization. The last three tweets are seen by both systems as Negative. They are characterized by intensified words that linked strongly to a negative stance towards vaccination (‘dangerous’, ‘suffering’, ‘get lost with your compulsory vaccination’).

Table 9 Examples of tweets that were classified by the best ML system and/or pattern as ‘Negative’ (for privacy reasons, user mentions are replaced with ‘@USER’)

Table 9 also features tweets that were predicted as Negative by neither the best ML-system nor Pattern, representing the most difficult instances of the task. The first two tweets include markers that explicitly point to a negative stance, such as ‘not been proven’ and ‘vaccinating is nonsense’. The third tweet manifests a negative stance by means of the sarcastic phrase ‘way to go’ (English translation). The use of sarcasm, where typically positive words are used to convey a negative valence, complicates this task of stance prediction. The last tweet advocates an alternative to vaccination, which implicitly can be explained as a negative stance towards vaccination. Such implicitly packaged viewpoints also hamper the prediction of negative stance. Both sarcasm and implicit stance could be addressed by specific modules.

Improving recall or precision

For monitoring the number of Twitter messages over time that are negative towards vaccination, one could choose to do this at the highest (possible) precision or at the highest (possible) recall. There are pros and cons to both directions, and choosing among them depends on the goal for which the system output is used.

Opting for a high precision would make it feasible to obtain an overview of the dominant themes that are referred to in tweets with a negative stance towards vaccination, for example by extracting the most frequent topical words in this set. Although part of these negative tweets are not included when focusing on precision, with a high precision one would not have to manually check all tweets to ensure that the dominant topics that are discussed are actually related to the negative stance. Thus, if the dashboard that provides an overview of the tweets with a negative stance towards vaccination is used as a rough overview of the themes that spur a negative stance and to subsequently monitor those themes, a high precision would be the aim. The disadvantage, however, is the uncertainty whether a novel topic or theme is discussed in the negative tweets that were not identified by the system. There is no possibility to find out, other than to manually check all tweets.

The main advantage of optimizing on system recall of messages with a negative stance is that it reduces the set of messages that are possibly negative in a certain time frame to a manageable size such that it could be processed manually by the human end user. Manually filtering all false positives (e.g.: messages incorrectly flagged as Negative) from this set will lead to a more or less inclusive overview of the set of tweets that refer negatively to vaccination at any point in time. The false negatives (messages with a negative stance that are not detected) would still be missed, but a high recall ensures that these are reduced to a minimum. This high recall is then to be preferred when the aim is to achieve a rather complete overview of all negative tweets in time, provided that there is time and personnel available to manually filter the tweets classified as Negative by the system. The manual effort is the main disadvantage of this procedure, making the usage of the dashboard more time-intensive. The Ensemble system optimized for recall identifies 1,168 tweets as Negative from a total of 2,886 (40%), which is a rather large chunk to process manually. On the other hand, the manual labeling could be additionally used to retrain the classifier and improve on its ability to identify tweets with a negative stance, which might reduce the future effort to be spent on manual labeling.

Apart from the use cases that should be catered for, another consideration to optimize for precision or recall is the gain and loss in terms of actual performance. We set out to inspect the trade-off between precision and recall on the strict data in our study, when altering the prediction threshold for the Negative category by the best-performing SVM classifier. For any given instance, the SVM classifier estimates the probability of all categories it was trained on. It will predict the Negative category for an instance if its probability exceeds the probabilities of the other categories. This prediction can be altered by changing the threshold above which a tweet is classified as Negative; setting the threshold higher will generally mean that fewer instances will be predicted as a Negative category (corresponding to a higher precision), whereas setting it lower will mean more instances will be predicted as such (corresponding to a higher recall). Thus, the balance between precision and recall can be set as desired, to favor one or another. However, in many cases, changing the threshold will not lead to a (strong) increase in overall performance.

Figure 2 presents the balance between recall and precision as a result of predicting the Negative category with the best ML system, when the threshold for this category is altered from lowest to highest. Compared to the standard recall of 0.43 at a precision of 0.29 for this classifier, increasing the recall to 0.60 would lead to a drop of precision to 0.21. The F1-score would then decrease to 0.31. In relation to the recall optimized ensemble system, with a recall of 0.61 and a precision of 0.18, altering the classifier prediction threshold is thus less detrimental to precision when yielding a similar recall. In contrast, a workable precision of 0.6 would combine with a rather low recall of around 0.05. Hence, with regard to the gain and loss in terms of performance we find that it would be more feasible in this domain to optimize on recall than to optimize on precision.

Fig. 2
figure 2

Balance between precision and recall of predicting tweets with a negative stance when applying the best ML system, alternating the prediction threshold for this category

Discussion

We set out to automatically classify Twitter messages with a negative stance towards vaccination so as to come to actionable insights for vaccination campaigns. In comparison to the sentiment analysis which is currently often used in dashboard environments, our system based on machine learning yields a considerable improvement. Although the optimal F1-score of 0.36 leaves much room of improvement, we show that the recall can be optimized to 0.60 which makes it feasible to use the system for pre-selecting negative messages to be reviewed manually by the human end user.

With an F1-score of 0.36, our system lags behind the 0.75 F1-score reported by Du et al.[3]. Several factors might have influenced this difference. A first factor is the low proportion of tweets with the label ‘Negative’ in our dataset. In the strict labeling condition, only 343 cases are labeled as negative by two annotators, against 2,543 labeled as positive – the negative cases only comprise 13% of all instances. In the study of Du et al., the anti-vaccination category comprises 24% of all instances (1,445 tweets). More (reliable) examples might have helped in our study to train a better model of negative tweets. Secondly, Du et al. [3] focused on the English language domain, while we worked with Dutch Twitter messages. The Dutch Twitter realm harbors less data to study than the English one, and might bring forward different discussions when it comes to the topic of vaccination. It could be that the senders’ stance towards vaccination is more difficult to pinpoint within these discussions. In line with this language difference, a third prominent factor that might have led to a higher performance in the study of Du et al.[3] is that they focus on a particular case of vaccination (e.g.: HPV vaccination) and split the anti-vaccination category into several more specific categories that describe the motivation of this stance. The diverse motivations for being against vaccination are indeed reflected in several other studies that focus on identifying discussion communities and viewpoints [18, 20, 22]. While splitting the data into more specific categories will lead to less examples per category, it could boost performance on predicting certain categories due to a larger homogeneity. Indeed, the most dominant negative category in the study by Du et al.[3], dubbed ‘NegSafety’ and occurring in 912 tweets (63% of all negative tweets), yielded the highest F1-score of 0.75. While two less frequent categories were predicted at an F1-score of 0.0, this outcome shows the benefit of breaking down the motivations behind a negative stance towards vaccination.

A major limitation of our study is that the agreement rates for all categorizations are low. This is also the case in other studies, like [9], who report an agreement of K=0.40 on polarity categorization. Foremost, this reflects the difficulty of the task. The way in which the stance towards vaccination is manifested in a tweet depends on the author, his or her specific viewpoint, the moment in time at which a tweet was posted, and the possible conversation thread that precedes it. Making a judgment solely based on the text could be difficult without this context. Agreement could possibly be improved by presenting the annotator with the preceding conversation as context to the text. Furthermore, tweets could be coded by more than two annotators. This would give insight into the subtleties of the data, with a graded scale of tweets that clearly manifest a negative stance towards vaccination to tweets that merely hint at such a stance. Such a procedure could likewise help to generate more reliable examples to train a machine learning classifier.

The low agreement rates also indicate that measuring stance towards vaccination in tweets is a too difficult task to assign only to a machine. We believe that the human-in-the-loop could be an important asset in any monitoring dashboard that focuses on stance in particular discussions. The system will have an important role in filtering the bigger stream of messages, leaving the human ideally with a controllable set of messages to sift through to end up with reliable statistics on the stance that is seen in the discussion at any point in time. In the section on improving recall or precision, we showed that lowering the prediction threshold can effectively increase recall at the cost of little loss of precision.

Our primary aim in future work is to improve performance. We did not experiment with different types of features in our current study. Word embeddings might help to include more semantics in our classifier’s model. In addition, domain knowledge could be added by including word lists, and different components might be combined to address different features of the data (e.g.: sarcasm and implicit stance). We also aim to divide the negative category into the specific motivations behind a negative stance towards vaccination, like in the study of Du et al. [3], so as to obtain more homogeneous categories. Parallel to this new categorization of the data, adding more labeled data appears to be the most effective way to improve our model. The learning curve that we present in Fig. 1 shows that there is no performance plateau reached with the current size of the data. An active learning setting [31], starting with the current system, could be applied to select additional tweets to annotate. Such a setting could be incorporated in the practical scenario where a human-in-the-loop judges the messages that were flagged as displaying a negative stance by the system. The messages that are judged as correctly and incorrectly predicted could be added as additional reliable training data to improve upon the model. We have installed a dashboard that is catered for such a procedureFootnote 11, starting with the machine learning system that yielded the best performance in our current study.

Conclusions

We set out to train a classifier to distinguish Twitter messages that display a negative stance towards vaccination from other messages that discuss the topic of vaccination. Based on a set of 8259 tweets that mention a vaccination-related keyword, annotated for their relevance, stance and sentiment, we tested a multitude of machine learning classifiers, alternating the algorithm, the reliability of training data and the labels to train on. The best performance, with a precision of 0.29, a recall of 0.43, an F1-score of 0.36 and an AUC of 0.66, was yielded by training an SVM classifier on strictly and laxly labeled data to distinguish irrelevant tweets and polarity categories. Sentiment analysis, with an optimal F1-score of 0.25, was considerably outperformed. The latter shows the benefit of machine-learned classifiers on domain-specific sentiment: despite being trained on a reasonably small amount of data, the machine-learning approach outperforms general-purpose sentiment analysis tools.

Availability and requirements

Project name: Prikbord Project home page:http://prikbord.science.ru.nl/Operating system: Linux Programming language: Python, javascript Other requirements: Django 1.5.11 or higher, MongoDB 2.6.10, pymongo 2.7.2 or higher, requests 2.13.0 or higher License: GNU GPL Any restrictions to use by non-academics: licence needed