Are Authorities Denying or Supporting? Detecting Stance of Authorities Towards Rumors in Twitter

Several studies examined the leverage of the stance in conversational threads or news articles as a signal for rumor veriﬁcation. However, none of these studies leveraged the stance of trusted authorities . In this work, we deﬁne the task of detecting the stance of authorities towards rumors in Twitter, i.e., whether a tweet from an authority supports the rumor, denies it, or neither. We believe the task is useful to augment the sources of evidence exploited by existing rumor veriﬁcation models. We construct and release the ﬁrst Authority STance towards Rumors (AuSTR) dataset, where evidence is retrieved from authority timelines in Arabic Twitter. The collection comprises 811 (rumor tweet, authority tweet) pairs relevant to 292 unique rumors. Due to the relatively limited size of our dataset, we explore the adequacy of existing Arabic datasets of stance towards claims in training BERT-based models for our task, and the eﬀect of augmenting AuSTR with those datasets. Our experiments show that, despite its limited size, a model trained solely on AuSTR with a class-balanced focus loss exhibits a comparable performance to the best studied combination of existing datasets augmented with AuSTR, achieving a performance of 0.84 macro-F1 and 0.78 F1 on debunking tweets. The results indicate that AuSTR can be suﬃcient for our task without the need for augmenting it with existing stance datasets. Finally, we conduct a thorough failure analysis to gain insights for the future directions on the task. ∗∗ ∗∗ This article presents a major extension of a previous work published at ECIR 2023 [1]. Extensions include (1) expanding the dataset by doubling the number of examples, (2) proposing a new semi-automated approach for collecting the data, (3) studying the usefulness of two more Arabic stance datasets, (3) using in-domain data for training the models, (4) ﬁne-tuning our BERT models over diﬀerent hyper-parameters, and (5) investigating various loss functions to alleviate the class-imbalance issue


Introduction
Social media platforms (e.g., Twitter) have become a medium for rapidly spreading rumors along with emerging events [2].Those rumors may have a lasting effect on users' opinion even after it is debunked, and may continue influence them if not replaced with convincing evidence [3].Existing studies for rumor verification in social media exploited the propagation networks as a source of evidence, where they focused on the stance of replies [4][5][6][7][8][9], structure of replies [10][11][12][13][14][15], and profile features of retweeters [16].Recently, Dougrez-Lewis et al. [17] proposed augmenting the propagation networks with evidence from the Web, and Hu et al. [18] proposed exploiting both text and images retrieved from the web as sources of evidence.A large body of existing studies in the broader literature have examined exploiting the stance of conversational threads [19,20] or news articles [21,22] towards claims as a signal for verification.
However, to our knowledge, no previous research has investigated exploiting evidence from the timelines of trusted authorities for rumor verification in social media.An authority is an entity with the real knowledge or power to verify or deny a specific rumor [1,23].Therefore, we believe that detecting stance of relevant authorities towards rumors can be a great asset to augment the sources of evidence utilized by existing rumor verification systems.It can also serve as a valuable tool for fact-checkers to automate their process of verifying rumors from authorities.
In this work, we address the problem of detecting the stance of authorities towards rumors in Twitter, defined as follows: Given a rumor expressed in a tweet and a tweet posted by an authority of that rumor, detect whether the tweet supports (agrees with) the rumor, denies (disagrees with) it, or not (other).Figure 1 presents our perception of the role of detecting the stance of authorities in a typical pipeline of rumor verification over Twitter.Given a rumor expressed in a tweet, both the reply thread and the corresponding authority Twitter accounts are retrieved.The replies structure, the replies stance, and the authorities stance in addition to other potential signals will then be exploited by the rumor verification model to decide the veracity of the rumor.In our work, we assume that the authorities for a given rumor are already retrieved [23], and we only target the detection of the stance of those authorities towards the rumors.In particular, our model is supposed to do so over the tweet timelines of the corresponding retrieved authorities.While being very important source of evidence for rumor verification, it is worth mentioning that stance of authorities can compliment other sources, especially if authorities are automatically retrieved, thus not fully accurate.
A closer look at the literature on Arabic rumor verification in Twitter in particular reveals that utilizing signals for verification is under-explored; most existing studies relied on the tweet textual content to detect its veracity [24][25][26][27][28][29].Some notable exceptions are the work done by Albalawi et al. [30] (who exploited the images and videos embedded in the tweet), the study done by Haouari et al. [14] (who used the reply thread structure and reply network signals), and the work done by Althabiti et al. [31] (who proposed detecting sarcasm and hate speech in the replies for Arabic rumor verification in Twitter).
To fill this literature gap, we first introduce the problem of detecting the stance of authorities towards rumors in Twitter.We then construct the first dataset for the task, and release it along with its construction guidelines to facilitate future research.Moreover, we investigate the usefulness of existing Arabic stance datasets towards claims for our task.Finally, we explore the mitigation of the traditional class-imbalance issue in stance datasets by experimenting with various loss functions.Our experiments show that training a model with our dataset solely, despite being relatively very small, exhibits a performance that is (at least) on bar with training with other (combinations of) existing stance datasets, indicating that existing stance datasets are not really needed for the task.The contributions of this paper are as follows: 1. We introduce and define the task of detecting the stance of authorities towards rumors that are propagating in Twitter.2. We release the first Authority STance towards Rumors (AuSTR) dataset for that specific task1 targeting the Arabic language.3. We explore the adequacy of existing Arabic datasets of stance towards claims for our task, and the effect of augmenting our in-domain data with those datasets on the performance of the model.4. We investigate the performance of the models when adopting variant loss functions to alleviate the class-imbalance issue, and we perform a thorough failure analysis to gain insights for the future work on the task.
The rest of this paper is organized as follows.We present our literature review in Section 2 and define the problem we are targeting in this work in Section 3. In Section 4, we present our dataset construction approach.Our experimental approach is presented in Section 5. We discuss the experimental setup in Section 6 and thoroughly analyze the results and answer the research questions in Section 7. We conduct a failure analysis to gain insights for future directions and discuss the limitations of our study in Section 8. Finally, we conclude and suggest some future directions in Section 9.

Related Work
In this section, we briefly review the related studies to our work.Specifically, we review rumor debunking in social media studies in Section 2.1, we give an overview of studies for stance detection for claim verification in Section 2.2, and we review authorities for rumor verification studies in Section.2.3.

Rumor Debunking in Social Media
Several studies on rumors debunking in Twitter suggested exploiting online debunkers, i.e., users who share fact-checking URLs to stop the propagation of a circulating rumor [32][33][34][35][36][37].To encourage online debunkers in Twitter remain engaged in correcting rumors, some studies proposed fact-checking URLs recommender systems [32,36].Vo and Lee [33,35] proposed a fact-checking response generator framework to stop the propagation of fake news, and exploited the replies of users who usually debunk rumors in Twitter to implement their model.Vo and Lee [34] on the other hand introduced a multimodal framework to retrieve fact-checking articles to be incorporated into rumor spreaders conversations threads to discourage propagating rumors in social media.Differently, in our work we consider authorities as credible debunkers who may post tweets supporting or debunking a specific rumor circulating in Twitter.

Stance Detection for Claim Verification
A myriad of studies have investigated detecting the stance towards claims to identify its veracity [38].Some focusing on detecting the stance of conversation threads in social media [19,20,39], and others on the stance of news articles [21,22,40,41].Existing studies either considered the stance as an isolated module in the verification system [19][20][21]39], or considered the stance of the evidence towards the claim as the veracity label [42][43][44][45].Multiple approaches were proposed recently considering verification as stance detection, mainly targeting stance of articles towards claims, by either exploiting transformer-based models [22,45,46], or graph neural networks [47][48][49].In the other hand, studies considering stance detection as a standalone component in the verification pipeline are mainly targeting of stance of conversation threads towards rumors in social media.A plethora of models were proposed to detect the stance of conversation threads such as tree and hierarchical transformers proposed by Ma and Gao [50] and Yu et al. [7] respectively.
A few studies addressed stance detection for Arabic claim verification recently, where the evidence is either news articles [22,41] or manually-crafted sentences from articles headlines [46].In contrast, in our work, we define the task of detecting the authorities stance towards Arabic rumors where we consider it as a standalone component in the rumor verification pipeline, and we release the first dataset for the task.We study the usefulness of existing Arabic stance towards claims datasets for the task, and we evaluate the performance of the stance models when incorporating in-domain data for training the models.Finally, we investigate two loss functions who showed promising results to alleviate the class-imbalance issue identified as a major challenge for stance detection for rumor verification [51].

Authorities for Rumor Verification
A closer look to the literature on rumor verification in social media reveals that no study to date has examined exploiting evidence from authorities.Existing studies for rumor verification in social media exploited evidence from the propagation networks [8,9,13,14,16], Web [17], and stance of conversational threads [19,20,39].
Recently, Haouari et al. [23] introduced the task of authority finder in Twitter which they define as follows: given a tweet stating a rumor, retrieve a ranked list of authority accounts from Twitter that can help verify the rumor, i.e., they may tweet evidence that supports or denies the rumor.The authors released the first Arabic test collection for the task, and proposed a hybrid model that exploits both lexical, semantic, and user networks signals to find authorities.The authority finder task was then introduced as part of the CheckThat!2023lab shared tasks [52,53], and it was deployed as a system component as part of a live system for Arabic claim verification [54].Differently, in our work we assume that the authority is already retrieved, and the task is to detect the stance of her tweets towards a given rumor.

Overview of Our Work
Figure 2 shows an example of a rumor about an establishment of a new railway to connect the Sultanate of Oman and the United Arab of Emirates (UAE).We assume that the authorities for this rumor are retrieved by an "authority finding" model (here some of the highly relevant authorities are the ministry of transport in Oman, the Omani government communication center, and both Oman's and UAE's rails projects).The figure shows an example tweet from each of the timelines of the authorities that actually supports the rumor. 2n this work, we introduce the task of detecting the stance of authorities towards rumors in Twitter.Due to the lack of datasets for the task, we construct and release the first Authority STance towards Rumors (AuSTR) dataset (Section 4).We exploit both fact-checking articles and authority Twitter accounts to manually collect debunking, supporting, and other (rumor tweet, authority tweet) pairs.Additionally, we propose a semi-automated approach utilizing the Twitter search API to further expand our debunking pairs.Due to the limited size of our dataset, we investigate the usefulness of existing datasets of stance towards Arabic claims (Section 7.1 and Section 7.2).Adopting a BERT-based stance model, we perform extensive experiments using 5 variant Arabic

Stance of authorities' detector
Rumor Authorities Authorities' supporting tweets Fig. 2 An example of a rumor along with its corresponding authorities and a set of supporting tweets detected from the authorities timelines (The example is from our constructed AuSTR dataset).
stance datasets, where the target is a claim but the context is either an article, article headline, or a tweet, to investigate if the stance model trained with each of them is able to generalize to our task.We then explore the effect of augmenting our in-domain data with each of the Arabic stance datasets on the performance of the model (Section 7.3).To mitigate the class-imbalance issue, we explore variant loss functions replacing the cross-entropy loss (Section 7.4).Finally, we conduct a thorough error analysis to gain insights for the future improvements (Section 8.1).

Constructing AuSTR Dataset
To address the lack of datasets of authority stance towards rumors, in this work, we introduce the first Authority STance towards Rumors (denoted as AuSTR) dataset.Our focus is on Arabic, as it is one of the most popular languages in Twitter [55], yet it is under-explored for rumor verification.Our dataset consists of 811 pairs of rumors (expressed in tweets) and authority tweets related to 292 unique rumors.Tweets of authorities are labeled as either disagree, agree, or other, as defined earlier.To construct AuSTR, we collected the debunking pairs manually (details in Section 4.1) by exploiting fact-checking articles and adopting a semi-automated approach.Supporting pairs were collected by manually exploring authority accounts and the Twitter search interface, in addition to utilizing the fact-checking articles (details in Section 4.2).Finally, to collect our other pairs we manually examined the timelines of the authorities of our debunking and supporting pairs to select tweets that are neither agreeing nor disagreeing with the rumor, in addition to exploiting fact-checking articles (details in Section 4.3).

Collecting Debunking Pairs
Figure 3 depicts an overview of our approach to construct the debunking pairs of AuSTR.We leveraged both the fact-checking articles and a semi-automated approach which we propose in this work.

Exploiting Fact-Checking Articles
Fact-checkers who attempt to verify rumors usually provide, in their fact-checking articles, some examples of social media posts (e.g., tweets) propagating the specific rumors, along with other posts from trusted authorities that constitute evidence to support their verification decisions.For AuSTR, we exploit both examples of tweets: stating rumors and showing evidence from authorities as provided by those factcheckers.Specifically, we used AraFacts [56], a large dataset of Arabic rumors collected from 5 fact-checking websites.From those rumors, we selected only the ones that are expressed in tweets and for which the fact-checkers provided evidence in tweets as well. 3For false rumors, we selected a single tweet example of the rumor and all provided evidence tweets for it, which are then labeled as having disagree stances.
Adopting this approach, we ended up with 118 debunking pairs.

Scanning fact-checking articles
Fig. 3 Our approach for collecting AuSTR debunking pairs.

Exploiting Twitter Search
Additionally, we adopted a semi-automated approach to collect more debunking pairs using Twitter search.First, we used the Twitter Academic API 4 to collect potentiallydebunking tweets, i.e., tweets with denying keywords and phrases such as "fake news," "fabricated, rumors," and "denied the news."Specifically, we used 21 keywords/phrases5 to search Twitter to retrieve Arabic tweets from the period of July 1, 2022 to December 31, 2022.To narrow down our search and reduce the noisy tweets, we excluded retweets and the tweets of non-verified accounts.Given that fact-checkers usually use most of these keywords to debunk rumors, we also excluded tweets from verified Arabic fact-checking Twitter accounts.By adopting this approach, we were able to collect either debunking tweets from authorities themselves, or just pointer tweets from journalists or news agencies.For both types, we retrieved the rumor tweets by searching Twitter user interface using the main keywords in the debunked rumor by the authorities.For the later type, we manually examined the timelines of authorities to get the debunking tweets.Table 1 presents examples of debunking tweets from authorities along with the search keywords used to retrieve them.An example of automatically-retrieved pointer tweet and the manually-collected disagree pair is presented in Table 2.
Table 1 Examples of debunking authority tweets (and their English translations) collected using the semi-automated approach along with the search keywords.

Search keywords
Example of a collected tweet Incorrect @AymanNour: Statement from #Ghad El Thawra: One of the sites published incorrect news about the party's decision to call for the 11/11 movement ... Fake news @LebISF: Denying a fake news published by a Lebanese newspaper about the arrest of Major General Othman's brother Untrue @IraqiSpoxMOD: ... news about (the disappearance of an American citizen in central or southern Iraq, under mysterious circumstances, who works as a journalist).We confirm that this news is untrue ... Fabricated @AlAhlyTV: ...Al-Ahly's objection speech about Zamalek club uniforms in the super is fabricated... Rumors @DGSGLB: #Statement: rumors are circulating that the General Directorate of General Security arrested Sally Hafez, who broke into a bank in Beirut...
Table 2 An example of an automatically collected pointer debunking tweet along with its manually collected debunking pair (with their English translation).

Tweet type Tweet text
Pointer @naharkw: The Qatari Embassy in Tunisia: Incorrect..A Qatari was killed in the ancient city of Bizerte.[11-

Collecting Supporting Pairs
To collect supporting pairs, we adopted two approaches as presented in Figure 4. Given that fact-checkers focus more on false rumors than true ones, exploiting fact-checking articles was not sufficient to collect supporting tweets, as adopting this approach, we were able to collect only 4 agree pairs as opposed to 118 disagree pairs.Thus, we manually collected a set of governmental Arabic Twitter accounts representing authorities related to health and politics, such as ministries and ministers, embassy accounts, and Arabic Sports organizations accounts (e.g., football associations and clubs).Starting from 172 authority accounts from multiple Arabic countries, 6 we manually checked the timelines of those authorities from the period of July 1, 2022 to December 31, 2022.We selected check-worthy tweets, i.e, tweets containing verifiable claims that we think will be of general interest [57], and consider them as authority supporting tweets.We then used the main keywords in each claim to search Twitter through the user interface and selected a tweet propagating the same claim while avoiding near-duplicates.We ended up with 148 agree pairs in total.Table 3 shows an example of a supporting authority tweet along with a relevant rumor.

Collecting Other Pairs
For some rumors, fact-checkers provide the authority account in their fact-checking article, but they state that no evidence was found to support or deny the rumor.For this case, we selected one or two tweets from the authority timeline posted soon before the rumor time, and assigned the other label to those pairs.In reality, most of the tweets in authority timelines are neither supporting nor denying a given rumor.To get closer to that real scenario, for each agree and disagree pair, we manually examined the timeline of the authority within the same time period of the rumor, and selected at most two tweets, where we give higher priority to tweets related to the rumor's topic or at least have an overlap in some keywords with the rumor.A tweet of those is then labeled as other if it is either relevant to the rumor but is neither disagreeing nor agreeing with it, or it is completely irrelevant to it.We ended up with 466 other pairs.
It is worth noting that the evidence from authorities is not always expressed in the textual body of the tweet.We considered the case when some authorities may post evidence as an announcement embedded in an image or video.

Data Quality
We present our dataset statistics in Table 4.Our data was annotated by one of the authors, a PhD candidate and native Arabic speaker working on rumor verification in Twitter.To measure the quality of our data, we randomly picked 10% of the pairs and asked a second annotator, a PhD holder and native Arabic speaker, to label them.The computed Cohen's Kappa for inter-annotator agreement [58] was found to be 0.86, which indicates "almost perfect" agreement [59].

Experimental Design
Due to the limited size of AuSTR, one of the main objectives of this work is to study the adequacy of using existing datasets of stance towards claims in training models for our task.Specifically, the goal is to first study whether models trained with existing stance datasets perform well on detecting the stance of authorities in particular, then investigate whether augmenting them with AuSTR improve the performance of those models.Moreover, since a major challenge of stance classification is the class-imbalance problem in the data [51], we also aim to explore whether incorporating different loss functions can mitigate that issue to further improve the performance of the models.Accordingly, we aim to answer the following research questions: To what extent will stance models trained with existing stance datasets be able to generalize to the task of detecting the stance of authorities?• RQ2: What is the effect of combining all existing stance datasets for training?• RQ3: Will training a stance model with AuSTR solely be sufficient?will augmenting AuSTR with existing stance datasets for training improve the performance?• RQ4: Will adopting different loss functions mitigate the class-imbalance problem thus improve the performance?
To address those research questions, we design our experiments as follows: • Cross-domain experiments denote the case where existing datasets of stance towards claims are exploited for training.Each of the stance datasets is first used solely for training our models, then all datasets were aggregated and used for training.We refer to the datasets of stance towards claims as cross-domain datasets in the rest of the paper.• In-domain experiments denote the case where AuSTR is used solely for training.
We refer to AuSTR as in-domain dataset.• In-domain augmented experiments denote the case where AuSTR is augmented with existing datasets of stance towards claims.In those experiments, we study the effect of augmenting AuSTR with each of the cross-domain datasets separately, in addition to augmenting it with all of them.
• Class-Imbalance experiments denote the case where we adopt different loss functions, that showed promising results earlier in the literature, to alleviate the class-imbalance problem.

Experimental Setup
In this section, we present the setup we adopted to conduct our experiments.

Datasets
To study the adequacy of existing Arabic datasets of stance detection toward claims for the task of detecting the stance of authorities, we adopted the following five existing datasets in training: • ArCOV19-Rumors [14] consists of 9,413 tweets relevant to 138 COVID-19 Arabic rumors collected from 2 Arabic fact-checking websites.We considered the tweets expressing the rumor as supporting (agree), the ones that are negating the rumor as denying (disgree), and the ones discussing the rumor but neither expressing nor negating it as other.
• STANCEOSAURUS [60] consists of 4,009 (rumor, tweet) pairs.The data covers 22 Arabic rumors collected from 3 Arabic fact-checking websites along with tweets, collected by the authors, that are relevant to the rumors.The relevant tweets were annotated by their stance towards the rumor as either supporting (agree), refuting (disgree), discussing, querying, or irrelevant.In our work, we considered the last three labels as other.
• ANS [46] consists of 3,786 (claim, manipulated claim) pairs, where claims were extracted from news article headlines from trusted sources, then annotators were asked to generate true and false sentences towards them by adopting paraphrasing and contradiction respectively.The sentences are annotated as either agree, disagree, or other.• ArabicFC [41] consists of 3,042 (claim, article) pairs, where claims are extracted from a single fact-checking website verifying political claims about the war in Syria, and articles collected by searching Google using the claim.The articles are annotated as either agree, disagree, discuss, or unrelated to the claim.In our work, we considered the last two labels as other.
• AraStance [22] consists of 4,063 (claim, article) pairs, where claims are extracted from 3 Arabic fact-checking websites covering multiple domains and Arab countries.The articles were collected and annotated similar to ArabicFC.
Figure 5 presents the per-class statistics for each dataset (including AuSTR), and Table 5 shows an example of a debunking text from each of them.

Data Splits
Given that AuSTR constitutes only 811 pairs, we adopt cross-validation for evaluating our models.We randomly split it into 5 folds while assigning all pairs that are relevant to the same rumor to the same fold to avoid label leakage across folds.

ANS
The Moroccan judiciary issued a 20-year prison sentence for Zefzafi.

AraStance
The circulating video entitled "a mobile phone explosion in a person's pocket in a Dubai mall" is not true.Rather, it happened a few days ago in the city of Agadir in Morocco...For all of our models, whether AuSTR is exploited for training or not, we both tune and test only on folds from AuSTR; a single AuSTR fold (dev fold) is used for tuning the models and another (test fold) was used for testing.If AuSTR is used for training, the remaining 3 folds (training folds) are used for that purpose.When the cross-domain datasets are used for training, they are fully used for that purpose (and none of them are used for tuning nor testing).For each experiment, we train 5 models to test on the 5 different folds of AuSTR, and finally report the average performance of the five models.

Stance Models
To train our stance models, we fine-tuned BERT [61], following recent studies that adopted transformer-based models for stance detection [22,60,62,63] to classify whether the evidence agrees with the claim, disagrees with it, or other.We feed BERT the claim text as sentence A and the evidence as sentence B (truncated if needed) separated by the [SEP] token.Finally, we use the representation of the [CLS] token as input to a single classification layer with three output nodes, added on top of BERT architecture to compute the probability for each class of stance.Various Arabic BERT-based models were released recently [64][65][66][67][68]; we opted to choose ARBERT [68] as it was shown to achieve better performance on most of the stance datasets adopted in our work [22].All models were trained with a maximum of 25 epochs where 5 was set as an early stopping threshold.We tuned our models by adopting three variant learning rates (1e-5, 2e-5, 3e-5).The sequence length and batch size were set to 512 and 16 respectively.

Preprocessing
We processed all the textual content by removing non-Arabic text, special characters, URLs, diacritics, and emojis from the tweets.For STANCEOSAURUS, we extended the tweets with their context as suggested by the authors [60] who showed that extending the tweets with parent tweet text and/or embedded articles titles can improve the performance of the stance models. 7oss Functions We adopted the Cross Entropy (CE) loss in all our experiments.However, due the imbalanced class distribution, we also experimented with the Weighted Cross Entropy (W CE) loss, and Class-Balanced Focal (CBF ) loss [69] adopted by Baheti et al. [70] and Zheng et al. [60] to mitigate the issue for stance detection.For CBF , we set the Fig. 6 The performance of models trained using cross-domain vs. in-domain datasets.

Evaluation Measures
To evaluate our models, we report the average of macro-F 1 scores across the 5 folds of AuSTR, in addition to average per-class F 1 .Macro-F 1 is recommended to evaluate stance models [71] due to the class-imbalance nature of stance datasets.

Experimental Evaluation
In this section, we present and discuss the results of our experiments that address the research questions introduced in Section 5.

Leveraging Cross-domain Datasets for Training (RQ1)
To address RQ1, we used the five cross-domain datasets listed earlier for training.
For each of them, we train on the full cross-domain dataset, then fine-tune 5 stance models; each is tuned on one fold from AuSTR and tested on another fold.We report the average performance on testing on the 5 folds of AuSTR in Figure 6.
The figure reveals several observations.First, the performance on the Disagree class is notably worse that the other two classes in four out of the five training datasets.This indicates that detecting the disagreement is generally more challenging than the agreement or irrelevance.
Second, comparing the performance across the individual cross-domain datasets, it is clear that we have two categories of performance.The first, including AraStance and ArCOV19-Rumors, is performing much better than the other one, including the remaining three datasets.Among the superior category, the model trained on AraStance exhibits the best performance.
As for the inferior category, we speculate the rationale behind their performance.We note that ArabicFC is severely imbalanced, where the disagree class represents only 2.86% of the data, yielding a very poor performance on that class.Moreover, it covers claims related to only one topic , which is the Syrian war, making it hard to generalize.A similar conclusion was found by previous studies that used ArabicFC [22,41].As for ANS, evidence was manually/artificially crafted, which is not as realistic as tweets from authorities.As for STANCEOSAURUS, it covers tweets relevant to only 22 claims.
As for the superior category, we observe that AraStance and ArCOV19-Rumors achieved the highest F 1 on the disagree class compared to the other cross-domain datasets.ArCOV19-Rumors covers 138 COVID-19 claims in several topical categories.AraStance covers 910 claims, which are extracted from three fact-checking websites, covering multiple domains and Arab countries, similar to AuSTR, and the evidence is represented in articles written by journalists, not manually crafted.To further investigate their performance, we manually examined 20% of AraStance and ArCOV19-Rumors disagreeing training pairs.We found that about 68% and 59% of the examined examples of AraStance and ArCOV19-Rumors respectively share common debunking keywords, such as "rumors," "not true," "denied," and "fake;" similar keywords appear in some disagreeing tweets of AuSTR.
To further investigate the relation between the datasets and the performance of the corresponding models, we analyzed the lexical similarity between the datasets.We first constructed a 2-gram vector representation for each dataset (including AuSTR) using the preprossessed context8 (excluding the claims), then we performed a pairwise cosine similarity between the vectors to get insights about the similarity between the corresponding datasets.Figure 7(a) and 7(b) present heatmaps of similarity between the debunking contexts and overall contexts of the datasets respectively.It is clear that the performance of the cross-domain models is strongly related to the dataset similarities.In particular, AraStance has the highest similarity with AuSTR on debunking context (0.20) and overall context (0.25) respectively.That resulted in the best performing cross-domain model achieving a macro-F 1 of 0.771 and F 1 (disagree) of 0.687.Moreover, ArCOV19-Rumors has the second highest similarity with AuSTR on debunking context (0.10) and the second best performing cross-domain model achieving F 1 (disagree) of 0.621.It is worth noting that although ArabicFC has the second highest similarity on the overall context, the model trained on it did not perform well especially on the disagree class, with F 1 of 0.332, due to the severe imbalance as mentioned earlier.
In summary, we found that AraStance is the best existing stance dataset for training a model for the task, as it covers a large number of fact-checked claims spanning multiple Arabic countries and topics compared to the other datasets.To answer RQ1,  we conclude that some cross-domain stance datasets are somewhat useful for detecting the stance of authorities.However, motivated by the findings of Ng and Carley [63] who highlighted the potential benefit of aggregating datasets to enhance the stance detection, we were encouraged to conduct our subsequent experiments, in which we combine all cross-domain datasets for training.

Combining Cross-domain Datasets for Training (RQ2)
To address RQ2, we combined all cross-domain datasets and adopted the same setup mentioned previously, where we tune and test on AuSTR folds.As presented in Figure 6, we note that, overall, the combined model achieved a veryslightly better performance in terms of macro-F 1 over the best individual model, i.e., the model trained with AraStance only.However, considering the individual classes, it exhibited the best performance for the agree class with a big margin compared to AraStance model; but it fell short for the disagree class.We speculate the reason is that some of the datasets, namely ANS and ArabicFC, achieved low performance on the disagree class, thus when combined with other datasets it affected negatively the overall performance on the same class.
Finally, we observe that there is a clear discrepancy in the performance across different classes; considering the combined model, F 1 (agree) is 0.793, while F 1 (disagree) is 0.653.Moreover, it is clear that detecting the disagree stance is still challenging, for which we expect to benefit from introducing our in-domain data.We believe that one of the major reasons behind such results is the imbalanced nature of the combined data, where only 14.24% are disagree examples vs. 27.66%agree examples.
To answer RQ2, we found that combining all cross-domain datasets can slightly improve the overall performance compared to the best performing individual model (AraStance), but could not beat it on detecting debunking tweets.

Introducing In-domain Data for Training (RQ3)
To address RQ3, we first trained a stance model with in-domain data only, i.e., AuSTR.We then trained a model with in-domain data augmented with each of the cross-domain datasets separately and also with all cross-domain datasets combined.
As expected, the model trained with AuSTR only outperforms all models trained with cross-domain datasets across all evaluation measures, as shown in Figure 6.More specifically, it outperforms their best (i.e., the model trained with AraStance) by 15.3%, 7.1%, and 7.9% in F 1 (disagree), F 1 (agree), and macro-F 1 respectively, showing a clear need to in-domain data.
What if we augment AuSTR with the cross-domain datasets in training?Figure 8 illustrates that effect.For every single cross-domain dataset, when augmented with AuSTR, the resulted model outperforms the model trained only on the cross-domain data by a big margin, ranging from 6.8% to 35.6% in macro-F 1 .This re-emphasizes the effect of in-domain data.However, only the model trained on AuSTR+AraStance was able to outperform the AuSTR-only model in macro-F 1 and F 1 (agree) but not F 1 (disagree).It turned out that augmenting AuSTR with AraStance made the disagree class minority, constituting only 13.3% of the training examples compared to 24.3% of AuSTR training examples, which negatively affects the performance on that class.
Contrary to the results presented in Figure 6, augmenting AuSTR with all crossdomain datasets achieved the lowest macro-F 1 compared to augmenting AuSTR with individual cross-domain datasets.In fact, the combined training data becomes clearly dominated with the cross-domain data (24,313 vs. 811 examples), which leads to negligible effect of the in-domain data.
To answer RQ3, we conclude that in-domain data is needed for better detecting the stance of authorities.Moreover, augmenting AuSTR with AraStance improved the overall performance but at the expense of degrading the performance on detecting debunking tweets, which, we argue, is more crucial for the task.

Addressing the Class-Imbalance Problem (RQ4)
To address RQ4, we selected the best two models presented in Figure 8, namely the one trained with AuSTR only and the one trained with AuSTR augmented with AraStance.We then fined-tuned the stance models with the same previous setup but with two other loss functions, W CE and CBF , as described in Section 6.
As presented in Table 6, we observe that adopting W CE loss function could not improve the performance of the models compared to adopting CE.However, for the model trained with AuSTR, adopting CBF notably improved the performance over CE with about 4.2% on the agree class, which is the minority class in AuSTR data.However, it slightly degraded the performance on the disagree class.Overall, it improved macro-F 1 performance getting it closer to the performance of the model trained on AuSTR augmented with AraStance (0.843 vs. 0.845).
Surprisingly, that positive effect of CBF was not extended to the model trained on AuSTR augmented with AraStance; in fact, the performance degraded in all measures.We will leave the investigation of such result to future work.To answer RQ4, we conclude that adopting CBF in addition to training on AuSTR solely is on bar with the model trained on both AusTR and AraStance, nullifying the need for augmenting AuSTR with any cross-domain data for training.

Discussion
In this section, we discuss our evaluation results in terms of failure cases (Section 8.1) and limitations (Section 8.2).

Failure Analysis
We conducted a detailed error analysis on the 113 examples (constituting 14% of the data) that failed to be predicted correctly by the model trained with AuSTR and adopting CBF loss.We categorize the reasons behind these errors based on a thorough examination of the failed pairs.We found that the failures can be attributed to six main reasons which we discuss below.Some failed examples are presented in Table 7.
1. Implicit stance: When an authority indirectly agree or disagree with the rumor.For example, P 1 is an example of a rumor about the infection of Mahmoud Al-Khatib, the director of Al-Ahly Egyptian football club, with COVID19, and an authority tweet implicitly debunking the rumor mentioning that he is attending the training session of the team in the stadium.This failure type is the cause of 30.09% of all failures, which motivates the need to address this challenge using stance models that take this into consideration.2. Writing style: Where an authority is speaking about herself, e.g., P 2 .Based on our examination, 12.39% of the failures are due to this reason.3. Misleading debunking keywords: when an authority is either debunking another rumor that is relevant to the topic of the target rumor, or just including some debunking keywords in his tweets even when supporting a rumor.For example, in P 3 , the authority tweet mentions that the "information being posted on it today is false.",although it is agreeing with the rumor.We found that this constitutes 10.62% of the failures.4. Misleading relevant keywords: when an authority post tweets relevant to the topic of the rumor, the model may fail to predict the stance correctly, e.g., in P 4 .This constitutes 25.66% of the failed examples.5. Lack of context: when an authority debunks or supports a rumor by an announcement embedded in an image or a video, e.g, in P 5 .This motivates the need to consider the tweet multi-modality [30,72] at the processing step.Moreover, some rumors may need additional context in order to be considered relevant to the authority tweet.We observed that 6.19% of the failures are of this type.6. Arabic MSA by authorities vs. dialects by normal users: As opposed to English, working with Arabic language is very challenging as different dialects, i.e., informal languages, are used in different Arabic countries [73].These dialects may have different vocabulary than the Modern Standard Arabic (MSA) which is usually used in formal communications [74].Authority tweets are usually in formal language and written in MSA Arabic, while normal users may use their informal Arabic with variant dialects, e.g, in P 6 , which make detecting the stance more challenging.
We also observed other reasons, such as having multiple claims in the same tweet, which is causing the stance model to predict the authority tweet as other.Moreover, we noticed that some failures can be attributed to one or more of the reasons mentioned above.These challenges motivate further work on tweet pre-processing to consider embedded content within the tweets, and the need to propose stance models specific for the task.

Limitations of our study
The limitations of our work are related to both our data and the adopted stance models.We discuss these limitations below.

Data
For a portion of our data, we adopted a semi-automated approach, where we collected the disagree pairs starting from a collection of tweets containing debunking keywords.Although most of the debunking tweets automatically collected where just used as pointers to collect implicit debunking tweets, some were already posted by authorities themselves and hence were considered as part of our data.This may cause some kind of bias towards these keywords.Moreover, although AuSTR with its relatively small size yielded good performance, we believe enlarging the data with more rumors covering more topics can help the models generalize better on new emerging rumors.

Stance Models
In our work, we adopted a BERT-based stance model, but we did not experiment with other models, e.g., [75] which might improve the performance we achieved.Moreover, we only experimented with ARBERT [68] as it showed to perform well for Arabic stance detection on most of our adopted cross-domain datasets [22]; however, we did not experiment with other Arabic BERT models [76].[Agree] @malkassabi: Today, I had the pleasure of meeting with the Moroccan Prime Minister, Aziz Akhannouch, and we discussed strengthening our economic and commercial cooperation to meet the aspirations of the leadership of our two countries and our two brotherly peoples.[04-10-2022] [P 3 ] @USER: Hacking the account of the Libyan Ministry of Foreign Affairs on Twitter.[22-12-2022] [Agree] @USEmbassyLibya: The US Embassy understands that the Twitter account of the Libyan Ministry of Foreign Affairs has been hacked, and we confirm that the information being posted on it today is false.[20-12-2022] [P 4 ] @USER: A railway network to connect the port of Sohar in the Sultanate of Oman with the city of Abu Dhabi in the UAE.[15-10-2022] [Other] @Etihad Rail: Etihad Rail has made significant progress in expanding the network by successfully connecting the emirates of Sharjah and Ras Al Khaimah to the main line of the UAE National Rail Network.With this achievement, the network will extend from Sharjah and Ras Al Khaimah to Al Ghuwaifat.[12-10-2022] [P 5 ] @USER: World Cup 2022: Morocco officially protests the arbitration in the semi-finals against France.[15-12-2022] [Agree] @FRMFOFFICIEL: Announcement from the Royal Moroccan Football Federation [Embedded image with the content of the announcement].[15-12-2022] [P 6 ] @USER: The first person to have monkeypox in Egypt is 39 old .. we need two nuclear bombs to close the game.
[09-  [Agree] @mohpegypt: The Ministry of Health and Population announces a positive case of monkeypox virus (Mpox) for a 39-year-old person, taking preventive measures against the infected person and his close contacts, and transferring the patient to receive treatment in one of the hospitals affiliated with the Ministry... [08-12-2022]

Conclusion
In this work, we introduced the task of detecting the stance of authorities towards rumors in Twitter, which can be leveraged by automated systems and fact-checkers for rumor verification.We constructed (and released) the first Arabic dataset, AuSTR, for that task using a language-independent approach, which we share to encourage the construction of similar datasets in other languages.Due to the relatively limited size of our dataset, we explored the adequacy of existing Arabic datasets of stance towards claims in training models for our task, and the effect of augmenting our data with those datasets.Moreover, we tackled the class-imbalance issue by incorporating variant loss functions into our BERT-based stance model.Our experimental results suggest that adopting existing stance datasets is somewhat useful but clearly insufficient for detecting the stance of authorities.Moreover, when augmenting AuSTR with existing stance datasets, only the model trained with AuSTR augmented with AraStance outperformed the model trained with AuSTR solely, except on detecting the debunking tweets.However, when adopting the class-balanced focal loss instead of the cross entropy loss, the model trained with AuSTR solely achieved comparable results to that augmented model, indicating that AuSTR solely, despite the limited size, can be sufficient for detecting the stance of authorities.
Finally, out of our extensive failure analysis, we recommend further work on tweet pre-processing to consider context expansion, and exploring other stance models that can detect the implicit stance and take the authorities writing style into consideration.Since our study focused on Arabic data, examining the task in other languages is clearly a potential path for future work.

Fig. 1
Fig. 1 Positioning the stance of authorities detection task (highlighted in yellow) in the rumor verification pipeline.

Fig. 5
Fig.5Per-class statistics of cross-domain datasets adopted in our work, as well as AuSTR for comparison.

Fig. 8
Fig.8Performance of models trained using in-domain vs. in-domain-augmented data.

[
Pair] Rumor tweet [Post date] [Gold stance] Authority tweet [Post date] [P 1 ] @USER: Mahmoud Al-Khatib was infected with Corona!Is the Al-Ahly administration still insisting on completing the league?Or will it change its mind after Khatib was infected... [24-06-2020] [Disagree] @AlAhlyTV: Captain Mahmoud Al-Khatib is watching our morning team's training session at the Tetch Stadium.[25-06-2020] [P 2 ] @USER: On an official visit of 4 days.Commerce Minister Majid bin Abdullah Al-Kassabi heads a Saudi government delegation to the Kingdom of Morocco to discuss strengthening trade and investment relations.With the participation of officials from the government sector for 12 government agencies and representatives of the private sector for more than 60 Saudi companies.[03-10-2022] The Embassy of the State of Qatar in the Republic of Tunisia denies what was reported by the media that the victim in the Bizerte incident holds Qatari nationality, and expresses its condolences to the victim's family and relatives.

Table 3
An example of manually collected supporting authority tweet and a relevant rumor tweet expressing the same claim.
Authority@Moi kuw: A resident who tried to commit suicide by stabbing himself inside a mosque was first aided, and the person was kept and the necessary legal measures are being taken in the incident.[04-

Table 5
Debunking examples (and their English translations) from the cross-domain datasets.

Table 6
Training with different loss functions.Boldfaced and underlined numbers are the best and second best respectively per measure.

Table 7
Sample examples failed to be predicted correctly by our best model.Failure types are implicit stance, writing style, misleading debunking keywords, misleading relevant keywords, lack of context, and non-MSA Arabic in order.