Stance Detection Benchmark: How Robust is Your Stance Detection?

Stance detection (StD) aims to detect an author’s stance towards a certain topic and has become a key component in applications like fake news detection, claim validation, or argument search. However, while stance is easily detected by humans, machine learning (ML) models are clearly falling short of this task. Given the major differences in dataset sizes and framing of StD (e.g. number of classes and inputs), ML models trained on a single dataset usually generalize poorly to other domains. Hence, we introduce a StD benchmark that allows to compare ML models against a wide variety of heterogeneous StD datasets to evaluate them for generalizability and robustness. Moreover, the framework is designed for easy integration of new datasets and probing methods for robustness. Amongst several baseline models, we define a model that learns from all ten StD datasets of various domains in a multi-dataset learning (MDL) setting and present new state-of-the-art results on five of the datasets. Yet, the models still perform well below human capabilities and even simple perturbations of the original test samples (adversarial attacks) severely hurt the performance of MDL models. Deeper investigation suggests overfitting on dataset biases as the main reason for the decreased robustness. Our analysis emphasizes the need of focus on robustness and de-biasing strategies in multi-task learning approaches. To foster research on this important topic, we release the dataset splits, code, and fine-tuned weights.


Introduction
Stance Detection (StD) represents a wellestablished task in natural language processing and is often described by having two inputs; (1) a topic of a discussion and (2) a comment made by an author.Given these two inputs, the aim is to find out whether the author is in favor or against the topic.For instance, in SemEval-2016 Task 6 (Mohammad et al., 2016), the second input is a short tweet and the goal is to detect, whether the author has made a positive or negative comment towards a given controversial topic: Topic: Climate Change is a Real Concern Tweet: Gone are the days where we would get temperatures of Min -2 and Max 5 in Cape Town #SemST Stance: FAVOR The task has a long tradition in the domain of political and ideological online debates (Mohammad et al., 2016;Walker et al., 2012a;Somasundaran and Wiebe, 2010;Thomas et al., 2006).In recent years, it has been brought into the focus of attention by the uprising debates around fake news, where StD is an important pre-processing step (Pomerleau and Rao, 2017;Derczynski et al., 2017;Ferreira and Vlachos, 2016), as well as for other downstream tasks like argument search (Stab et al., 2018) and claim validation (Popat et al., 2017).As such, high performance in StD is a crucial step in successfully leveraging machine learning (ML) for argumentative information retrieval and fake news detection.
However, while humans are quite capable of assessing correct stances, ML models are often falling short of this task (see Table 1).As there are numerous domains to which StD can be applied, definitions of this task vary considerably.For instance, the first input can be a short topic, a claim, or sometimes is not given at all, while the second input can be another claim, an evidence, or even a full argument.Further, the second input can differ in length between a sentence, a short paragraph, and whole documents.The number of classes can also vary between 2-class problems (e.g.for/against)
and more fine-grained 4-class problems (e.g.comment/support/query/deny).Moreover, the number of samples varies drasticially between datasets (for our setup: from 2,394 to 75,385).While these differences are problematic for cross-domain performance, it can also be seen as an advantage, as it concludes in an abundance of datasets from different domains that can be integrated into transfer or multi-task learning approaches.Yet, given the decent human performance on this task, it is hard to grasp why ML models fall short of StD, while they are almost on par at related tasks like Sentiment Analysis2 and Natural Language Inference3 (NLI).Within this work, we provide foundations for answering this question.We empirically assess whether the abundance of differently framed StD datasets from multiple domains can be leveraged by looking at them in a holistic way, i.e. training and evaluating them collectively in a multi-task fashion.However, as we only have one task but multiple datasets, we henceforth define it as multi-dataset learning (MDL).And indeed, our model profits significantly from datasets of the same task via MDL with +4 percentage points (pp) on average, as well as from related tasks via transfer learning (TL) with +3.4pp on average.
However, while we gain significant performance improvements for StD by using TL and MDL, the expected robustness of these approaches is missing.We show this using a modified version of the Resilience score by Thorne et al. (2019) which reveals that TL and MDL models are even less robust than single-dataset learning (SDL) models.We investigate this phenomenon through low resource experiments and observe that less training data leads to an improved robustness for the MDL models, narrowing down the gap to the SDL models.We thus assume that lower robustness stems from dataset biases introduced by the vast amount of available training data for the MDL models, leading to overfitting.Consequently, adversarial attacks that target such biases have a more severe impact on models that had more biased training data and overfitted on these biases.
The contributions of this paper are as follows: (1) To the best of our knowledge, we are the first to combine learning from related tasks (via TL) and MDL, designed to capture all facets of StD tasks, and achieve new state-of-the-art results on five of ten datasets.(2) In an in-depth analysis with adversarial attacks, we show that TL and MDL for StD generally improves the performance of ML models, but also drastically reduces their robustness if compared to SDL models.(3) To foster the analysis of this task, we publish the full benchmark system including model training and evaluation, as well as the means to add and evaluate adversarial attack sets and low resource experiments. 1All datasets, the fine-tuned models, and the machine translation models can be automatically downloaded and preprocessed for consistent future usage.

Related Work
Stance Detection is a well-established task in natural language processing.Initial work focused on parliamentary debates (Thomas et al., 2006) and debating portals (Somasundaran and Wiebe, 2010), whereas latest work has shifted to the domain of Social Media, where several shared tasks have been introduced (Gorrell et al., 2019;Derczynski et al., 2017;Mohammad et al., 2016).With the shift in domains, the definition of the task also shifted: more classes were added (e.g.query (Gorrell et al., 2019) or unrelated (Pomerleau and Rao, 2017)), the number of inputs has changed (e.g.multiple topics for each sample (Sobhani et al., 2017)), or the definition of the inputs itself (e.g. from parliamentary speeches and debate portal posts to tweets (Gorrell et al., 2019), news articles (Pomerleau and Rao, 2017), or argument components (Stab et al., 2018;Bar-Haim et al., 2017)).In past years, the problem of StD has become a cornerstone for many downstream tasks like fake news detection (Pomerleau and Rao, 2017), claim validation (Popat et al., 2017), and argument search (Stab et al., 2018).Yet, recent work mainly focuses on individual datasets and domains.We, in contrast, concentrate on a higher level of abstraction by aggregating datasets of different domains and definitions to analyze them in a holistic way.To do so, we leverage the idea of TL and multi-task learning (in form of MDL), as they have not only shown increases in performance and robustness (Ruder, 2017;Weiss et al., 2016), but also significant support in low resource scenarios (Schulz et al., 2018).Latest frameworks for multi-task learning include the one by Liu et al. (2019), which scored a new stateof-the-art on the GLUE Benchmark (Wang et al., 2018a).In contrast to their work, we will use the  (Glockner et al., 2018;Minervini and Riedel, 2018), these stress tests have been applied to a wide range of tasks from Question-Answering (Wang and Bansal, 2018) to Natural Machine Translation (Belinkov and Bisk, 2017) and Fact Checking (Thorne et al., 2019).Unfortunately, preserving the semantics of a sentence while automatically generating these adversarial attacks is difficult, which is why some works have defined small stress tests manually (Isabelle et al., 2017;Mahler et al., 2017).As this is time (and money) consuming, other work has defined heuristics with controllable outcome to modify existing datasets and to preserve the semantics of the data (Naik et al., 2018).In contrast to previous work, we use and analyze some of these attacks for the task of StD to probe the robustness of our SDL and MDL models.

Stance Detection Benchmark: Setup and Experiments
We describe the dataset and models we use for the benchmark, the experimental setting, and the results of our experiments.For all experiments, we use and adapt the framework 4 provided by Liu et al. (2019).

Datasets
We choose ten StD datasets from five different domains to represent a rich environment of different facets of StD.Datasets within one domain may still vary by their number of classes and sample sizes.All datasets are shown with an example and their domain in Table 2.In addition, Table 3 displays the split sizes and the class distributions of each dataset.All code to preprocess and split the datasets is available online. 1 In the following, all datasets are introduced.arc We take the version of the Argument Reasoning Corpus (Habernal et al., 2018) that was modified for StD by Hanselowski et al. (2018).A sample consists of a claim crafted by a crowdworker and a user post from a debating forum.
argmin The UKP Sentential Argument Mining Corpus (Stab et al., 2018) originally contains topicsentence pairs labelled with argument for, argument against, and no argument.We remove all non-arguments and simplify the original split: we train on the data of five topics, develop on the data of one topic, and test on the data of two topics.fnc1 The Fake News Challenge dataset (Pomerleau and Rao, 2017)  snopes The Snopes corpus (Hanselowski et al., 2019) contains data from a fact-checking website 5 documenting (amongst others) rumours, evidence texts gathered by fact-checkers, and the documents from which the evidence originates.Besides labels for automatic fact-checking of the rumours, the corpus also contains stance annotations towards the rumours for some evidence sentences.We extract these pairs and generate a new data split.

Models
We experiment on all datasets in an SDL setup, i.e. training and testing on all datasets individually, and in an MDL setup, i.e. training on all ten StD datasets jointly.For this, we use the framework by Liu et al. (2019), as it provides the means to do both SDL and MDL.The SDL is based on the BERT architecture (Devlin et al., 2018) and simply adds a dense layer on top for the classification.
The MDL is also based on the BERT architecture, but each dataset has its own dataset-specific dense layer on top.While the layers of the BERT architec- All datasets are batched and fed through the architecture in a random order.As initial weights for SDL and MDL, we use either the pre-trained BERT (large, uncased) weights by Devlin et al. (2018) or the MT-DNN (large, uncased) weights by Liu et al. (2019).The latter uses the BERT weights and is fine-tuned on all datasets of the GLUE Benchmark (Wang et al., 2018a).By using the MT-DNN, we transfer knowledge from all datasets of the GLUE Benchmark to our models, i.e. apply TL in the form of pre-training.Henceforth, we use SDL and MDL to define the model architecture, and BERT and MT-DNN to define the pre-trained weights of the model architecture.This leaves us with four combinations of models: BERT SDL , BERT M DL , MT-DNN SDL , and MT-DNN M DL (see Figure 1).

Experimental Setting
For all experiments in this section, we set the batch size to 16, the number of epochs to 5, and we cut each input on 100 sub-words due to hardware limitations.Preliminary tests with the fnc1 dataset, which contains documents as one of the inputs, showed a minor drop in F 1 macro of less than 2pp when reducing the sequence length from 300 to 100.To compensate for variations in the results, we train over five different fixed seeds and report the averaged results.We run all experiments on a Tesla P-100 with 16 GByte of memory.One epoch with all ten datasets takes around 1.5h.We use the splits for training, development, and testing as shown in Table 3.The table also lists the classes and class distribution for each dataset.We use the F 1 macro (F 1 m + ) as a general metric, since the class balance for most datasets is skewed.The dataset training sizes vary from approx.42,500 to as low as 935 samples.

Results
We report the results of all models and datasets in Table 4.The last column shows the averaged F 1 m + for a row.We make three observations: (1) TL from related tasks improves the overall performance, (2) MDL with datasets from the same task shows an even larger positive impact, and (3) TL, followed by MDL, can further improve on the individual gains shown by ( 1) and ( 2).We show (1) by comparing the models BERT SDL and MT-DNN SDL , where a gain of 3.4pp due to TL from the GLUE datasets can be observed.While some datasets show a drop in performance, the average performance increases.We show (2) by comparing BERT SDL to BERT M DL (+4pp) and MT-DNN SDL to MT-DNN M DL (+1.8pp).The former comparison indicates that learning from similar datasets (i.e.MDL) has a higher impact than TL for StD.The latter comparison leads to observation (3); combining TL from related tasks (+3.4pp) and MDL on the same task (+4pp), can result in considerable performance gains (+5.1pp).However, as the individual gains from TL and MDL do not add up, it also indicates an information overlap between the datasets of the GLUE benchmark and the StD datasets.Lastly, while BERT SDL already outperforms five out of six state-of-the-art results, our BERT M DL and MT-DNN M DL are able to add significant performance increases on top.

Analysis
As the robustness of an ML model is crucial if applied to other domains or in downstream applications, we analyze this feature in more detail.First, we define adversarial attacks to probe for weak-nesses in the models.Second, we investigate the reason for detected weaknesses and a surprising anomaly in robustness between SDL and MDL models.

Adversarial Attacks: Definition
We investigate how robust the trained models are and whether TL from related tasks and MDL influence this property.Inspired by stress tests for NLI, we select three adversarial attacks to probe the robustness of the models and modify all samples of all test sets with the following configurations: Paraphrase We paraphrase all samples of the test sets.For this, we lean on the work of Mallinson et al. (2017) and train two machine translation models with OpenNMT (Klein et al., 2017): one that translates English originals to German and another one that backtranslates.Spelling Spelling errors are quite common, especially in data from social media or debating forums.We add two errors into each input of a sample (Naik et al., 2018): (1) we swap two letters of a random word and (2) for a different word, we substitute a letter for another letter close to it on the keyboard.We only consider words with at least four letters, as shorter ones are mostly stopwords.Negation We use the negation stress test proposed by Naik et al. (2018).They add the tautology "and false is not true" after each sentence, as they suspect that models might be confused by strong negation words like "not".We assume the same is also valid for StD.We add the tautology at the beginning of each sentence, since we truncate all inputs to a maximum length of 100 sub-words.
To measure the effectiveness of each adversarial attack a ∈ A, we calculate the potency score introduced by (Thorne et al., 2019) as the aver-  age reduction from a perfect score and across the systems s ∈ S:

Metrics (original)
with c a being the ratio of correctly transformed samples (test to adversarial) and a function f that returns the performance score for a system s on an adversarial attack set a.
The correct rate c a is calculated by taking 25 randomly selected samples from all test sets and comparing them to their adversarial counterpart.For the paraphrase attack, the first author checked whether the paraphrased and original sentences are semantically equal.We find that in 63% of the samples this is the case.This low result is mostly due to the three outlier datasets fnc1 (36%), snopes (36%), and arc (44%).Leaving out these three, 82% of the sentences are semantically correct paraphrases.As the changes through the spelling attack are minor and subjective to evaluate, we use the FleschKincaid grade level (Kincaid et al., 1975) to compare the readability of the original and adversarial sentences and label a sample as incorrectly translated if the readability of the adversarial sentence requires a higher U.S. grade level.For the negation attack samples, we assume a correctness of 100% (c a = 1.0) as the perturbation adds a tautology and the semantics and grammar are preserved.

Adversarial Attacks: Results and Discussion
We choose to limit the compared systems to BERT SDL and MT-DNN M DL , as the latter uses both TL from related tasks and MDL, whereas the former uses neither.The potencies for all attack sets are shown in Table 5 and ranked by the raw potency which assumes all adversarial samples to be correct (i.e.c a = 1).The results on the adversarial attack sets for both the SDL and MDL model are shown in Table 6.
The paraphrasing attack has the lowest raw potency of all adversarial sets and the average scores only drop by about 2.8-4.7%.Interestingly, on the datasets that turned out to be difficult to paraphrase (fnc1, arc, snopes), the score on the MT-DNN M DL only drops by about 5.7%, 6.4%, and 6.5% (see Appendix, Table 9), which is not much below average.This confirms Niven and Kao (2019) in that the BERT architecture, despite contextualized word embeddings, also primarily focuses on certain cue words and the semantics of the whole sentence is not the main criterion.
With raw potencies of 41.1% and 43.3%, the negation and spelling attacks have the highest neg-ative influence on both SDL and MDL (4.3% to 13.9% performance loss).We assume this to be another indicator that the models rely on certain key words and fail if the statistical occurrence of these words in the seen samples is changed.This is easy to see for the negation attack, as it adds a strong negation word.For the spelling attack, we look at the following original example from the perspectrum dataset: Claim: School Day Should Be Extended Perspective: So much easier for parents!Predict/Gold: support/support And the same example as spelling attack: Claim: School Day Sohuld Be Ectended Perspective: So much esaier for oarents!Predict/Gold: undermine/support Since all words of the original sample are in the vocabulary, Google's sub-word implementation WordPiece (Wu et al., 2016) does not split the tokens into sub-words.However, this is different for the perturbed sentence, as, for instance, the tokens "esaier" and "oarents" are not in the vocabulary.Hence, we get [esa, ##ier] and [o, ##are, ##nts].These pieces do not carry the same meaning as before the perturbation and the model has not learned to handle them.
However, the most surprising observation represents the much higher relative drop in scores between the test and adversarial attack sets for MT-DNN M DL as compared to BERT SDL .MDL should produce more robust models and support them in handling at least some of these attacks, as some of the datasets originate from Social Media and debating forums, where typos and other errors are quite common.On top of that, the model sees much more samples and should be more robust to paraphrased sentences.Hence, to further evaluate the robustness of the two systems, we leverage the resilience measure introduced by Thorne et al. ( 2019): It defines the robustness of a model against all adversarial attacks, scaled by the correctness of the attack sets.Surprisingly, the resilience of both the MDL (59.9%) and SDL (58.5%) model are almost on par.The score, however, only considers the absolute performance on the adversarial sets, but not the drop in performance when compared to the We calculate the score for all adversarial attacks separately, as well as the overall Resilience rel , and observe that the SDL model outperforms the MDL model in each case (see Table 7).For some datasets, the absolute F 1 m + of the MDL model even drops below that of the SDL model (see Appendix, Table 9).Our experiments show that performance-wise, we can benefit from MDL, but there is a high risk of drastic loss in robustness, which can cancel out the performance gains or, even worse, renders the model inferior in real-world scenarios.

Analysis of Robustness via Low Resource Experiments
To investigate the reasons why the MDL model shows a lower robustness than the SDL models on average, we conduct low resource experiments by training the MDL model and the SDL models on 10, 30, and 70% of the available training data.Dev and test sets are kept at 100% of the available data at all times and results are averaged over five seeds.
As is to be expected, the performance gap between BERT SDL and MT-DNN M DL on the test set grows with less training data (see Table 8).Here, the MDL shows its strength in low resource setups (Schulz et al., 2018).Even more so, while the  2b).As shown in Figure 2a, this is due to the MT-DNN M DL approaching the Resilience rel of the BERT SDL against the negation and paraphrase attack.
Our analysis reveals that the amount of training data has a direct negative impact on model robustness.As most (if not all) datasets inevitably inherit the biases of their annotators (Geva et al., 2019), we assume this negative impact on robustness is due to overfitting on biases in the training data.Hence, less training data leads to less overfitting on these biases, which in turn leads to a higher robustness towards certain attacks that target these biases.For instance, the word "not" in the negation attack can be a bias that adheres to negative class labels (Niven and Kao, 2019).Likewise, an overall shift in the distribution of some words due to the paraphrase attack can interfere with a learned bias.We argue that spelling mistakes are unlikely to be learned as a bias for stance detection classes and the actual reason for the performance drop of the attack is due to the split of ungrammatical tokens into several sub-words (see section 4.2).

Discussion and Future Work
We introduced a StD benchmark system that combines TL and MDL and enables to add and evaluate adversarial attack sets and low resource experiments.We include ten StD datasets of different domains into the benchmark and found the combination of TL and MDL to have a significant positive impact on performance.In five of the ten used datasets, we are able to show new state-ofthe-art results.However, our analysis with three adversarial attacks reveals that, contrary to what is expected of TL and MDL, they result in a severe loss of robustness on our StD datasets, with scores often dropping well below SDL performance.We investigate the reasons for this observation by conducting low resource experiments and conclude that one major issue is the overfitting on biases of vast amounts of training data in our MDL approach.
Reducing the amount of training data for both SDL and MDL models narrows down the robustness anomaly between these two setups, but also   lowers the test set performance.Hence, we recommend to develop methods that integrate de-biasing strategies into multi-task learning approaches-for instance, by letting the models learn which samples contain biases and should be penalized or ignored (Clark et al., 2019) to enhance the robustness, thus also being able to leverage more (or all) training data available to maintain the performance.We foster this work by publishing our dataset splits, models, and experimental code.
In the future, we plan to combine methods that cope with biased data (Clark et al., 2019;He et al., 2019) with MDL and to experiment with sampling methods which aim to reduce the training data to the samples that are necessary to learn the task (Prabhu et al., 2019;Ruder and Plank, 2017).In regard to adversarial attacks, we also aim to concentrate on task-specific adversarial attacks and use insights of adversarial attacks to build defences for the models (Pruthi et al., 2019;Wang et al., 2018b).

Figure 1 :
Figure 1: Models and their relation.Arrows symbolize training, their labels state the used training data.
Difference in Resilience rel between BERTSDL and MT-DNNMDL for all train data ratios.

Figure 2 :
Figure 2: Resilience rel over different train data ratios.

Table 2 :
All datasets, grouped by domain and with examples.Topics in parentheses signal implicit information.framework for MDL, i.e. combining only datasets of the same task to analyze whether StD datasets can benefit from each other by transferring knowledge about their domains.Furthermore, we probe the robustness of the learned models to analyze whether performance increases gained through TL and MDL are in accordance with increased robustness for StD.
Adversarial attacks describe test sets aimed to discover possible weak points of ML models.While much recent work in adversarial attacks aims to break NLI systems and is especially adapted to this problem (Chen et al., 2019)aset(Chen et al., 2019)contains pairs of claims and related perspectives, which were gathered from debating websites.We only take the data they defined for the StD task in their work and keep the exact split.
The topics are gathered from a debating database, the claims were manually collected from Wikipedia articles.We take the pre-defined train and test split and split an additional 10% off the train set for development.perspectrumsemeval2019t7TheSemEval-2019 Task 7 (Gorrell et al., 2019) contains rumours from reddit posts and tweets towards a variety of incidents like the Ferguson Unrest or the Germanwings crash.Similar to the scd dataset, the topics are not part of the actual dataset.

Table 3 :
Splits, classes, and class distributions for all used datasets.

Table 5 :
Potency of all adversarial attacks.

Table 6 :
Influence of adversarial attacks, averaged over all datasets on the BERT SDL and MT-DNN M DL model (in F 1 m + and relative to the score on the test set).

Table 7 :
Resilience rel of BERT SDL and MT-DNN M DL .

Table 8 :
Train data ratio performance on the test set.MDL model showed disencouraging performance w.r.t.adversarial attacks when trained on 100% of the data, we observe that with less training data, the MT-DNN M DL reduces the difference in overall Resilience rel to the BERT SDL from 3.8pp at 100% training data to 1.5pp at 10% training data (see Table Overall Resilience rel for all train ratios.