1 Introduction

Stance detection is a significant research in sentiment analysis and text mining, which focuses on the stance (e.g., Favor, Against, or Neutral) expressed in text toward a given target [1,2,3]. It can be effectively applied to social opinion analysis [4], rumor detection [5], and other research fields by mining text opinions.

Traditional stance detection [3] has a limited range of applications since it requires training and testing under the same target and depends on a lot of labeled data to achieve excellent performance. However, topics on social media platforms are updated frequently and in great quantities, as well as manually labeling new targets is expensive and laborious, making it impossible to create a labeled dataset with all prospective targets [6]. Therefore, the study of zero-shot stance detection for unseen targets is essential and promising [7].

For the zero-shot stance detection task, existing works generally incorporate external knowledge as support for inference [8, 9] or introduce an attention mechanism to capture the relationships between targets [7]. However, none of these approaches consider explicit modeling of the transferable knowledge between source and destination targets. Some works employ adversarial learning to make the model learn the target invariant representation [10]. Still, their adversarial learning strategy is extremely unstable and prone to degrading prediction performance when the target distribution is unbalanced. As shown in Table 1, zero-shot stance detection identifies the stance of an unknown target by training on numerous targets with labels; for example, the test set may contain the target “Feminist Movement”, while the training set contains targets such as “Donald Trump” and “Hillary Clinton”. In order to effectively generalize to unknown targets, it is essential to learn transferable stance feature knowledge from the training data. Hence it is especially crucial to find appropriate and effective knowledge transfer methods. In addition, we find a certain correlation between sentiment information and stance detection [9]. For instance, when a document contains some positive words, it generally implies a Favor stance. Stance detection will perform better if some sentiment knowledge can be acquired concurrently.

Table 1 Examples of zero-shot stance detection

To address the above challenges, we propose an adversarial distillation adaptation model with sentiment contrastive learning. Specifically, since the training and test sets of zero-shot stance detection belong to different targets (domains), the domain adaptation method can be adopted to transfer knowledge. We employ an adversarial discriminative domain adaptation network [11]. By obfuscating a domain discriminator, the model is motivated to learn more target invariant features to ensure the transferability of information across different targets. Moreover, we consider that catastrophic forgetting occurs when the adversarial network is applied to the BERT model [21]. Knowledge distillation [22] can serve as a regularization method that maintains the information learned from the source data while being adaptable to the destination data. Supervised contrastive learning is also applied to generalize to unknown target stance detection by distinguishing stance category features in the potential distribution space. Given that stance detection is influenced by sentiment information, we employ the cross-attention module to inject the sentiment knowledge encoded by SentiBERT into BERT and adjust the fusion process according to the training loss of stance detection.

The contributions of our work can be summarized as follows:

  1. 1.

    We apply an adversarial discriminative domain adaptation network with knowledge distillation to solve the target knowledge transfer problem for zero-shot stance detection while improving the stability of the adversarial training.

  2. 2.

    The proposed model employs supervised contrastive learning to learn enhanced target invariant representations by learning correlations and differences between data with different stance labels. Sentiment information is extracted to assist in stance detection.

  3. 3.

    Experimental results on two datasets show that our method obtains competitive results compared to several strong baselines.

2 Related Work

2.1 Zero-Shot Stance Detection

Stance detection aims to identify the attitude of a text on a prescriptive target [1]. Most previous studies concentrated on intra-target stance detection, where the training and testing phases shared identical target sets [2, 3]. However, there is insufficient labeled data when new topics emerge. As a result, some studies explored cross-target stance detection [14,15,16], which involved training the model on one target and testing it on another related target. Xu et al. [16] presented a self-attentive model that extracted shared features learned from source targets to the destination target. Wei et al. [15] further exploited the hidden topics between targets as transferred knowledge. In contrast to cross-target settings, zero-shot stance detection does not require a prior assumption of target correlation. It is a more general study that can effectively deal with the reality that targets appear irregularly.

For zero-shot stance detection, Allaway et al. [7] created a dataset containing many targets and proposed a topic grouping attention model that implicitly captured the relationships between targets by generating generalized topic representations. Liu et al. [8] proposed a common sense knowledge augmentation graph model based on GCN and BERT, which utilized text information and relationship graph structure information to increase the generalization and reasoning capabilities of the model. Liang et al. [12] proposed a hierarchical contrastive learning model based on an agent task that distinguished the types of stance expressions to aid zero-shot stance detection.

2.2 Domain Adaptation

Domain adaptation can effectively deal with the problem of inadequate labeling data. It can compensate for the absence of label information in the destination domain by using sufficient label information in the source domain. The purpose of domain adaptation is to reduce domain differences and effectively transfer knowledge. Inspired by generative adversarial network (GAN) [17], adversarial loss methods have been commonly applied to domain adaptation. In the domain adversarial neural network (DANN) [18], a gradient inversion layer was presented to confuse the domain discriminator and enable the feature extractor to acquire domain invariant knowledge. Adversarial discriminative domain adaptation (ADDA) [11] used an adversarial framework that included discriminative models, unshared weights, and GAN loss.

Allaway et al. [10] regarded each target as a domain and modeled zero-shot stance detection as a domain adaptation problem, which successfully learned the target invariant representation. Inspired by the above works, we explore employing a more robust and efficient ADDA framework to handle zero-shot stance detection.

3 Methods

In this section, we introduce our proposed adversarial distillation adaptation network with sentiment contrastive learning for zero-shot stance detection (ADSC) in detail. As shown in Fig. 1, the model consists of two main parts.

  1. 1.

    Pretraining: we pretrain the source encoder with sentiment information and the classifier on the source labeled data while designing stance contrastive learning.

  2. 2.

    Adversarial distillation domain adaptation: we initialize the target encoder with the source encoder's parameters and train it via adversarial learning and knowledge distillation. The dotted box indicates that the parameters are fixed.

Fig. 1
figure 1

Overview of the ADSC model

3.1 Task Description

Suppose we are given a set of labeled data \({D}_{s}={\{({x}_{s}^{i},{t}_{s}^{i},{y}_{s}^{i})\}}_{i=1}^{{N}_{s}}\) from source targets and a set of unlabeled data \({D}_{d}={\{({x}_{d}^{i},{t}_{d}^{i})\}}_{i=1}^{{N}_{d}}\) from a destination target (unknown target), where \(x\) is a document, \(t\) and \(y\) are its corresponding target and stance label, respectively, and \(N\) is the number of examples. The purpose of zero-shot stance detection is to train the model according to the labeled data of multiple source targets to predict the stance labels of the unknown target examples.

3.2 Encoder with Sentiment Information

Considering that the stance of a text is influenced by sentiment information, we learn the sentiment knowledge of the text to increase prediction accuracy. Following Zhou et al. [19], we exploit a perceptual sentiment language model (SentiBERT) to extract sentiment knowledge.

The SentiBERT framework includes sentiment masking and several pretraining goals. We first mask some tokens, including ordinary words, sentiment words, and emoticons. Sentiment words and emoticons are masked with a higher probability than ordinary words to emphasize the sentiment information of the sentence. Therefore, sentiment information can be learned through recovery. The pretraining goal requires the encoder to reconstruct masked sentiment tokens and predict the sentiment ratings of the whole sentence.

Specifically, the masked corrupted text \(\widehat{x}\) is input to the BERT encoder to obtain the representation \({h}_{i}\) of each word and the final state \({h}_{[CLS]}\) as the sentence representation. The softmax function is used on \({h}_{i}\) to predict the probability, sentiment polarity, and emoticon probability of each word separately. The overall sentiment score of the text \(\widehat{x}\) is predicted using a softmax layer on \({h}_{[CLS]}\). Each task is jointly trained and optimized. SentiBERT performs well in the cross-domain sentiment analysis task after being trained on the Amazon Review dataset and the Yelp 2020 challenge dataset.

Therefore, we adopt a pretrained SentiBERT model and input the given document \(x\) and target \(t\) into the model in the form of “\([CLS]x[SEP]t[SEP]\)” to obtain a hidden vector representation \({h}_{s}\) with sentiment information.

$$\begin{array}{c}\begin{array}{c}{h}_{s}=SentiBERT\left(\left[CLS\right]x\left[SEP\right]t\left[SEP\right]\right)\end{array}\end{array}$$
(1)

SentiBERT can be utilized as an outstanding sentiment feature extractor since it has successfully learned sentiment knowledge in large-scale datasets. We fix the parameters of SentiBERT during the training process to keep sentiment information stable.

Moreover, to take advantage of the contextual information, we also adopt a pretrained BERT [13] model to jointly embed document \(x\) and target \(t\) to obtain a hidden vector representation \({h}_{b}\) of each example.

$$\begin{array}{c}\begin{array}{c}{h}_{b}=BERT\left(\left[CLS\right]x\left[SEP\right]t\left[SEP\right]\right)\end{array}\end{array}$$
(2)

Then \({h}_{b}\) and \({h}_{s}\) are concatenated, and the information of both is fused by the cross-attention module. Cross-attention can effectively capture the interdependencies between text and sentiment, facilitating the integration of knowledge and resulting in the generation of more accurate and meaningful features. The hidden state of the [CLS] token is used as the final output \({h}_{a}\):

$$\begin{array}{c}\begin{array}{c}{h}_{a}=CrossAttention\left(\left[{h}_{b},{h}_{s}\right]\right)\left[CLS\right]\end{array}\end{array}$$
(3)

3.3 Stance Contrastive Learning

Contrastive learning allows the feature representation of the anchor to be similar to the positive examples and dissimilar to the negative examples [11, 20]. A superior semantic representation space can be learned from the examples by using the pair-based contrastive loss function. Supervised contrastive learning can bring examples belonging to the same class closely together and push examples of different classes away from each other, effectively improving the quality of feature representation.

To improve the generalization ability of the stance representation, based on the stance label information of the examples, we perform contrastive learning on their hidden vectors. Specifically, given the hidden vectors \(H=\{{{h}_{i}\}}_{i=1}^{{N}_{b}}\) of a batch of examples (where \({N}_{b}\) is the size of the batch), for a specific anchor \({h}_{i}\in H\), if \({h}_{j}\in H\) and \({h}_{i}\) have the same stance label, i.e., \({y}_{j}{=y}_{i}\) (where \({y}_{j}\) and \({y}_{i}\) are the stance labels of \({h}_{j}\) and \({h}_{i}\) respectively), then \({h}_{j}\) is considered to be a positive example of \({h}_{i}\), while other examples \({h}_{k}\in H\) are considered to be negative examples of \({h}_{i}\). The final contrastive loss is calculated over all positive pairs, including (\({h}_{i}\), \({h}_{j}\)) and (\({h}_{j}\), \({h}_{i}\)) in a batch:

$$\begin{array}{c}{L}_{c}=\frac{1}{{N}_{B}}\sum\limits_{{h}_{i}\in H}l\left({h}_{i}\right)\end{array}$$
(4)
$$\begin{array}{c}\begin{array}{c}l\left({h}_{i}\right)=-\mathrm{log}\frac{{\sum }_{j=1}^{{N}_{b}}{1}_{\left[j\ne i\right]}{1}_{{[y}_{i}={y}_{j}]}\mathrm{exp}\left(sim\left({{\varvec{h}}}_{{\varvec{i}}}{,{\varvec{h}}}_{{\varvec{j}}}\right)/\tau \right)}{{\sum }_{k=1}^{{N}_{b}}{1}_{\left[k\ne i\right]}\mathrm{exp}\left(sim\left({{\varvec{h}}}_{{\varvec{i}}}{,{\varvec{h}}}_{{\varvec{k}}}\right)/\tau \right)}\end{array}\end{array}$$
(5)
$$\begin{array}{c}\begin{array}{c}sim\left({\varvec{m}},{\varvec{n}}\right)={{\varvec{m}}}^{\rm T}{\varvec{n}}/\left|\left|{\varvec{m}}\right|\right|\left|\left|{\varvec{n}}\right|\right|\end{array}\end{array}$$
(5)

where \({1}_{[i=j]}\in \{\mathrm{0,1}\}\) is an indicator function that evaluates to 1 iff \(i=j\). \(sim({\varvec{m}},{\varvec{n}})\) represents the cosine similarity of vectors \({\varvec{m}}\) and \({\varvec{n}}\).\(\tau \) denotes a temperature parameter.

3.4 Training

Since the source and destination data come from distinct targets (domains), directly applying the model trained on the source data to the destination data has poor performance because of domain bias. To achieve effective domain transfer, we must make predictions based on features that cannot tell the training (source) and testing (destination) domains apart. Therefore, we employ an adversarial learning-based domain adaptive approach to learn domain invariant information.

In the domain adaptation task, we have obtained the labeled source data \({D}_{s}={\{({x}_{s}^{i},{t}_{s}^{i},{y}_{s}^{i})\}}_{i=1}^{{N}_{s}}\) and the unlabeled destination data \({D}_{d}={\{({x}_{d}^{i},{t}_{d}^{i})\}}_{i=1}^{{N}_{d}}\), and they have identical label distribution spaces. We regard \({E}_{s}\) as the source encoder function and \({E}_{d}\) as the destination encoder function, and they map the input data \(d\) (including text \(x\) and target \(t\)) to the encoder output \(h\). \(C\) denotes the classifier function that converts the encoder output to the stance category. \(D\) denotes the discriminator function that converts the encoder output to the domain category (source or destination). We wish to learn a destination encoder \({E}_{d}\) and a destination classifier \({C}_{d}\) that can accurately predict the stance class of the destination examples in the absence of labels. As a result, we reduce the distance between the source and destination data representations through adversarial training. In this case, it can be considered that the source and destination domains have identical distributions in the mapped space. Then, the source classifier \({C}_{s}\) can be applied directly for stance detection on the destination data without learning a separate destination classifier. So we set \(C={C}_{s}={C}_{d}\).

3.4.1 Pretraining

We train the source encoder with sentiment information \({E}_{s}\) and the classifier \(C\) on the text, target and label pairs \(({x}_{s},{t}_{s}{,y}_{s})\in \{\mathrm{0,1},2\}\) from the source dataset \({D}_{s}\) in a supervised manner. Furthermore, we enable the encoder to learn a superior class representation by minimizing the stance contrastive loss \({L}_{c}\) (see Eq. 4) and improve the performance of the classifier by minimizing the standard cross-entropy loss \({L}_{cls}\). The final loss is the sum of the two losses:

$$ \begin{array}{*{20}c} {\begin{array}{*{20}c} {_{{E_{s} ,C}}^{{min}} L_{{cls}} = - E_{{\left( {x_{s} ,t_{s} ,y_{s} } \right)\sim D_{s} }} \sum\limits_{{k = 1}}^{K} {1_{{\left[ {k = y_{s} } \right]}} } log\left( {C\left( {E_{s} \left( {x_{{s,}} t_{s} } \right)} \right)} \right)} \\ \end{array} } \\ \end{array} $$
(7)
$$\begin{array}{c}\begin{array}{c}{L}_{all}={L}_{cls}+{L}_{c}\end{array}\end{array}$$
(8)

where \(\mathrm{k}\) is the specific category and \(K\) is the number of categories. \({1}_{\left[k={y}_{s}\right]}\in \{\mathrm{0,1}\}\) is an indicator function that evaluates to 1 iff \(k={y}_{s}\). The parameters of the source encoder and source classifier are fixed at the end of pretraining.

3.4.2 Adversarial Adaptation with Distillation

We initialize the destination encoder \({E}_{d}\) with the parameters of the pretrained source encoder. We fix the source encoder during adversarial training and use it as a reference to make the target representation match the source distribution as closely as possible. The following loss \({L}_{{adv}_{E}}\) can be optimized to produce a fantastic target encoder.

$$ \begin{array}{*{20}c} {\begin{array}{*{20}c} {_{{E_{d} }}^{{min}} L_{{adv_{E} }} = - E_{{\left( {x_{d} ,t_{d} } \right)\sim D_{d} }} {\text{log}}\left( {D\left( {E_{d} \left( {x_{{d,}} t_{d} } \right)} \right)} \right)} \\ \end{array} } \\ \end{array} $$
(9)

The domain discriminator \(D\) is designed to differentiate whether the data feature representations originate from the source or destination domain. \(D\) is optimized according to the standard supervised loss \({L}_{{adv}_{D}}\), where the labels point to the origin domain.

$$ \begin{array}{*{20}c} {_{D}^{{min}} L_{{adv_{D} }} = - E_{{\left( {x_{s} ,t_{s} } \right)\sim D_{s} }} log\left( {D\left( {E_{s} \left( {x_{{s,}} t_{s} } \right)} \right)} \right) - E_{{\left( {x_{d} ,t_{d} } \right)\sim D_{d} }} log\left( {1 - D\left( {E_{d} \left( {x_{{d,}} t_{d} } \right)} \right)} \right)} \\ \end{array} $$
(10)

Although the destination encoder contains unbound weights from the source encoder, this offers it more flexibility to learn features of the destination domain while also preventing it from learning degenerate solutions. However, as new domains are added during the training process, the previously learned domain features are gradually forgotten, thus overfitting the target data. The inaccessibility of class labels and the difference from the original task lead to random classification performance [21].

To improve the stability of adversarial training and prevent pattern collapse, we employ a regularization approach to mitigate catastrophic forgetting. Knowledge distillation can provide the model with flexible adversarial adaptation and the capability to keep class information at high values of temperature \(t\) [21]. The loss of knowledge distillation is as follows:

$$\begin{array}{c}\genfrac{}{}{0pt}{}{{L}_{kd}={-t}^{2}\times {E}_{\left({x}_{s},{t}_{s}\right)\sim {D}_{s}} \sum\limits _{k=1}^{K}softmax\left({f}_{k}^{s}/t\right)}{\times \mathit{log}\left(softmax\left({f}_{k}^{d}/t\right)\right)}\end{array}$$
(11)

where \({f}^{s}=C({E}_{s}({x}_{s},{t}_{s}))\),\({f}^{d}=C({E}_{d}({x}_{s},{t}_{s}))\). We sequentially feed the data into the encoder and classifier to obtain the probability distribution of the stance and normalize it with the softmax function. Thus, the loss function for training the destination encoder is:

$$\begin{array}{c}\begin{array}{c}{L=\alpha L}_{{adv}_{E}}+{\beta L}_{kd}\end{array}\end{array}$$
(12)

where \(\alpha \) and \(\beta \) are tuning hyperparameters. All methods minimize the source and destination representation distances by alternating between the destination encoder and the discriminator. We conduct adversarial adaptation by learning the destination encoder so the discriminator cannot accurately predict the domain labels of the source and destination examples based on their feature representations.

3.5 Testing

We utilize the destination encoder obtained after adversarial domain adaptation and the classifier with fixed parameters to predict the stance of the destination examples.

$$\begin{array}{c}\begin{array}{c}\widehat{{y}_{d}}=\mathit{arg}\mathit{max}\left(C\left({E}_{d}\left({x}_{d}{,t}_{d}\right)\right)\right)\end{array}\end{array}$$
(13)

4 Experiment

4.1 Datasets

SEM16 [3] is a Twitter dataset that contains six targets for stance detection, including the Feminist Movement (FM), Legalization of Abortion (LA), Donald Trump (DT), Hillary Clinton (HC), Atheism (A), and Climate Change is a Real Concern (CC). Each text in the dataset contains a stance (favor, against, neutral) for a specific target.

WT-WT [23] is a stance detection dataset in the financial domain. The dataset contains four targets, including CVS_AET(CA), CI_ESRX (CE), ANTM_CI (AC), and AET_HUM (AH). Every example involves a stance label of refute (against), support (favor), comment (neutral), and irrelevant opinion. We eliminate text labeled as irrelevant to ensure consistency with other datasets.

Following[12], we utilize the data from one target as the test set and the remaining targets as the training set. Table 2 represents the statistics of the two datasets.

Table 2 The statistics of the SEM16 and WT-WT datasets

4.2 Experimental Implementation

4.2.1 Training Settings

We employ the pretrained SentiBERT model provided by Zhou et al. as well as the pretrained uncased BERT as the encoder, and their maximum sequence length is 85. The batch size is 32. In the pretraining phase, the source encoder and classifier are trained for 3 epochs using the Adam optimizer [24] with a learning rate of 5e−5, \({\beta }_{1}\) = 0.9 and \({\beta }_{2}\) = 0.999. In the adversarial domain adaptation phase, we also use the unlabeled data from the destination domain to train the destination encoder and discriminator for 3 epochs with a learning rate of 1e−5. The temperature value \(t\) for knowledge distillation is set to 20. We also apply a gradient clip to a target encoder with a gradient norm of 1.0 and a discriminator with a clip value of 0.01 to increase the stability of the adversarial training [21]. The temperature parameter for the contrastive loss is 0.07.

4.2.2 Evaluation Metric

For the SEM16 dataset, following [10], we report the \({F}_{avg}\): the average of F1 for favor and against. For the WT-WT dataset, following [23], we report the Macro F1 scores of each target.

4.3 Baselines

To demonstrate the validity of the proposed model, we compare the ADSC with several strong baselines.

  • BiCond [2] A model that utilizes two BiLSTM layers to encode topic and text separately.

  • CrossNet [16] A BiCond-based model for adding topic-specific self-attentive layers.

  • TOAD [10] A BiCond-based model with adversarial learning.

  • BERT [13] A powerful pretrained language model for NLP tasks.

  • BERT-GCN [8] A BERT-based model using GCN for node information aggregation.

  • TGA Net [7] A topic-group attention model.

  • TPDG [14] A GCN-based model for designing target-adaptive pragmatic dependency graphs.

In addition, we designed several variants of the ADSC model to conduct ablation studies to verify the validity of different components.

  1. 1.

    “w/o \({L}_{c}\)” denotes without stance contrastive learning loss.

  2. 2.

    “w/o SentiBERT” denotes that SentiBERT is not utilized to extract sentiment information.

  3. 3.

    “w/o \({L}_{kd}\)” denotes without knowledge distillation loss.

4.4 Main Results

The results of the comparison experiments are shown in Table 3. It can be observed that our proposed ADSC model achieves competitive and stable performance on most of the target sets, which validates the effectiveness of our approach to this task. Specifically, BiCond and CrossNet perform the worst overall, and BERT and BERT-GCN perform similarly poorly since they do not consider the targets' invisibility to learn transferable information. Despite adopting an adversarial strategy as well, the TOAD model is generally inferior to our method. It is demonstrated that we utilize a sophisticated adversarial domain adaptation network and add knowledge distillation to enhance the stability of adversarial training while ensuring the effective transfer of the target knowledge. In contrast to the attention-based model, our method effectively generalizes the stance representation learned from known targets to unseen targets by exploring contrastive learning.

Table 3 Experimental results on two datasets

4.5 Ablation Study

We further conduct ablation studies to analyze the impact of different components of ADSC. As shown in Table 4, the experimental results show that removing stance contrastive learning (“w/o \({L}_{c}\)”) significantly decreases the model’s performance. This suggests that supervised contrastive learning during the pretraining phase assists the encoder in learning better class representations, improving generalizability. The removal of sentiment information (“w/o SentiBERT”) reduces model performance, implying that the model may learn the potential relationship between sentiment and stance and make judgments on the stance with the help of sentiment information. For example, the model learns a strong association between positive sentiment words and support stances and weak associations between negative sentiment words and support stances. The effect of removing knowledge distillation (“w/o \({L}_{kd}\)”) becomes worse, which indicates that some source information is forgotten during adversarial training. So regularization of knowledge distillation is useful in improving performance.

Table 4 Experimental results of the ablation study

4.6 Analysis of Contrastive Learning

To further analyze the effectiveness of stance contrastive learning in the model, we use T-SNE [25] to visualize the intermediate layer embedding. The visualization results without and with contrastive learning are shown in Fig. 2. It can be observed that the representation distributions without using contrastive learning have great overlap, especially for the favor and against stances. This suggests that contrastive learning may effectively separate the representations of different stances and learn a better potential space, further demonstrating its effectiveness and significance.

Fig. 2
figure 2

Visualization of intermediate embeddings. The left figure is the visualization with contrastive learning, and the right figure is the visualization without contrastive learning. Purple dots indicate favor examples, yellow dots indicate against examples, and green dots indicate neutral examples

4.7 Analysis of Adversarial Domain Adaptation

To further understand the influence of adversarial domain adaptation on the zero-shot stance detection task, we employ t-SNE to visualize the feature distribution encoded by the destination encoder. Domain invariance is determined by the degree of overlap between features. As shown in Fig. 3, we employ the destination encoder to encode both the source and destination data. Domain adaptation makes the domain overlap more prominent. This demonstrates that adversarial domain adaptation may align the source and target domain feature distributions as nearly as feasible, resulting in significant target invariant features.

Fig. 3
figure 3

Visualization of the distribution of features. The source domain features are represented by 1, and the destination domain features are represented by 0

4.8 Case Study

We conduct a case study to illustrate the validity and perform error analysis. We select three cases from the test data of SEM16 and compare our results to the predictions of BERT and TOAD. Table 5 reports these results.

Table 5 Three cases of the predictions by BERT, TOAD and OUR MODEL

In the first case, TOAD with adversarial learning and our model accurately forecast the outcome while BERT predicts it incorrectly. This is primarily because BERT does not learn transferable knowledge for unknown targets, whereas exploring adversarial domain adaptation approaches can effectively learn target invariant information and increase generalization ability. In the second case, only our method makes the correct prediction. This demonstrates that depending only on contextual information is insufficient and adding sentiment information strengthens the model's comprehension of texts with the sarcastic sentiment. In the third case, all three methods make incorrect predictions. We speculate that the model does not understand the hidden relationship between “NBC” and “Donald Trump”, and it is difficult to make correct predictions for sentences that contain underlying ideas or require more profound understanding. Thus, domain knowledge is beneficial to the model. In the future, we will explore the introduction of common sense knowledge of the destination domain, which may significantly improve the model's generalizability.

5 Conclusion

This paper proposes an adversarial distillation adaptation framework (ADSC) with sentiment contrastive learning to perform zero-shot stance detection. We employ an adversarial discriminative domain adaptation network to transfer stance knowledge from training data to unknown targets, use stance contrastive learning to increase the model's generalizability, introduce sentiment information to aid stance detection, and add knowledge distillation to prevent catastrophic forgetting during training. The results on two benchmark datasets show that our model achieves competitive performance on some unseen targets. In future work, we will introduce some domain knowledge to improve the performance of the stance detection model.