1 Introduction

Argument pair extraction (APE) is a widely researched area in the field of argument mining. It involves identifying arguments from dialogical documents with two sides of views and pairing them accordingly. APE is a composite task that entails discovering arguments from raw text and linking-related arguments. There is a strong demand for APE task in real-life. Its application is mature and many works has studied it for years, such as finding paired arguments in online forums [56], persuasive essays [10, 53], and student essays [46]. Recently, there are works applying APE to analyze the peer-review process [5, 15, 22, 29]. The peer-review content containing reviews and rebuttals often holds significant academic value but can be challenging to comprehend. APE can effectively structure the original document, making it easier to understand. The paper by Cheng et al. [15] has compiled labeled data from peer-reviews and presents the APE task as shown in Table 1.

Table 1 An example of review–rebuttal passage

Argument pair extraction from dialogical documents is a difficult task with two challenges. We introduce the challenges and our solutions as follows.

  • Challenge of lacking context: When presenting an argument, individuals often overlook the underlying context and provide an abbreviated textual representation. This lack of background information presents a challenge for effectively modeling the argument’s semantics. For instance, argument pairs with extension or reply relationships may have limited common words, leading to a disparate appearance in text. To address this issue, incorporating the context of arguments is crucial in enhancing their semantic understanding. Current works of Bao et al. [3], Cheng et al. [17], and Bao et al. [5] have proposed many methods to address the APE task, but they often suffer from the limitation of inadequate context information. Bao et al. [4] constructs a homogeneous graph utilizing sentences as nodes and employs a graph neural network (GNN) model to introduce context in the analysis of semantic representation of sentences which cannot model the detailed argument relation. Glaese et al. [24], Kojima et al. [34], Ouyang et al. [45] find that the popular GPT models often experience the forgetting problem when processing lengthy texts, causing the model to lose the long-distance context of the argument. In addressing the lack of context, we introduce the heterogeneous graph attention network (Heter-GAT) model, which is a module of HGMN. Firstly, we present a novel graph definition designed for dialogical argumentation. This graph representation encompasses multiple node types, each corresponding to specific semantic granularities, such as entity nodes, sentence nodes, and topic nodes. Then, the Heter-GAT model is proposed to learn and extract meaningful information from the text graph.

  • Challenge of complex argument structure: The argumentative logic in dialogical documents is complex, with numerous viewpoint transitions, resulting in difficulties in determining the alignment of core semantics between two arguments. For example, two arguments may have many similar keywords but actually have vastly different semantics due to the logic turns. In such cases, common text matching models can easily misjudge the relationship based on the textual similarity of the arguments. As such, current works of Cheng et al. [15, 16] on the APE task heavily rely on textual similarity and are prone to misinterpreting the relation between arguments. Cheng et al. [16] enhances the capability of the text matching model by capturing the detail signals between arguments. The authors employed the 2D-GRU model to facilitate more comprehensive interaction of underlying information between the argument texts. Nonetheless, this approach faces the issue of being unable to capture the complex logical structure in arguments. In our HGMN, a multi-granularity graph matching model is proposed to convert the text matching problem into a graph matching problem. A certain argument can be mapped to multiple subgraphs from the large heterogeneous graph, each of which records the structural information of the text from different semantic granularities, e.g., subgraph of argument and topic, subgraph of argument and entity, subgraph of argument and argument.

Overall, to address the APE task, we propose the heterogeneous graph matching networks (HGMN) model, which consists of the Heter-GAT model and the multi-granularity graph matching model. Our contributions are as follows:

  1. 1.

    In the modeling of semantic expression of arguments, we have designed a heterogeneous graph structure suitable for argumentative texts and have proposed a heterogeneous graph learning model. This better incorporates contextual information and structural information from the text into the modeling of argumentative text. The improvement enriches the fundamental features for the argument extraction task and enhances the APE task’s performance.

  2. 2.

    In the modeling of argument relation, we innovatively transform the text matching problem into a graph matching problem and design a multi-granularity graph matching network. The generated matching signals are able to capture semantic structure information and complex logical correspondences between arguments at multiple levels, thereby improving the accuracy of the APE task.

  3. 3.

    Experimental results show that HGMN significantly outperforms current models on APE tasks. Further analysis of experiments reveals the parameter sensitivity and the effectiveness of the HGMN’s important modules.

2 Related work

2.1 Argumentation on dialogical documents

Argument has been a promising research topic in recent years and has many sub-directions. Many studies research on the monological argumentation, which indicates that their work focuses on the argumentation of a single passage. The Stab and Gurevych [53] and Swanson et al. [56] focused on extracting arguments from online forum posts and persuasive essays, respectively. Persing and Ng [46] extracted arguments from students’ essays. These pioneer works proposed a two-step approach for argument extraction, consisting of argument mining and argument similarity calculation. Recently, there are more prosperous works on the argumentations, including argument mining [51, 58, 59], argumentation structure parsing [1, 35, 42, 54], argument quality scoring [25, 52, 62], argument relation detection [28, 49], and argument generation [29, 50].

Several studies research argumentation on dialogical documents. The paper [20] mining arguments in online discussion forums and judge whether a certain chain of arguments can lead to persuasion. The authors in Morio and Fujita [40] propose a pointer network to predict the arguments’ roles as well as their relations. The paper [11] employs multilevel of argument mining, including micro-level and macro-level, when parsing the arguments’ structure on the online discussing forum. Yuan et al. [68] constructs argumentation knowledge graph and uses GCN to learn the path representation, so as to leverage external knowledge to enhance the interactive arguments. The authors in Cheng et al. [15] presents the new task APE, which is more challenging than the former arguments relation prediction task. In the APE task, there are documents containing two roles, for example, reviewer and rebuttal, each role have multi-arguments, the target of task is to mine the arguments from each role and find the corresponding arguments between the two roles. In Cheng et al. [15]’s work, a joint-training model is also presented, which uses the CRF model to do argument mining and a classification model to do arguments matching. The paper [16] explores the APE task further by using the 2D-GRU model to make the sentences of different arguments fully interact in the bottom of model, which enhances the accuracy of arguments’ matching. The authors in Bao et al. [3] improve the APE task by adopting graph model to import context information to sentences and using the sequence labeling task instead of the classification task to do argument match. The model in Bao et al. [5] adopts thought of machine reading comprehension. An argument mining (AM) query is used to identify all arguments in two documents and then each identified argument is used as a new query to extract relevant arguments from another document.

2.2 Graph neural networks on NLP tasks

GNN models are often applied in NLP tasks. With the help of constructing a text graph, the context information can be added into sentences to enhance the modeling. With the introduction of GNN, the tasks like text classification, [48, 66], text summarization [65, 67] and question answering [47, 60] are refined to a great extent. Recently, in the text matching task, there are several works [13, 37]) have been applying the graph matching models to do text match. These works let the two text graph’s node level information interact with each other to get the “cross graph embedding,” which captures the structural matching information of both text and refined the matching result. In addition, some works have applied GNN into argumentation tasks, and Morio and Fujita [41] use a syntactic graph convolutional network to do argument component identification. Huang et al. [30] constructed the heterogeneous text graph and applied an attention network to do the argument persuasiveness prediction. Li and Cheng [36] combines graph convolutional network (GCN) and pre-trained model BERT to do text classification on deceptive reviews. Sun et al. [55] use enhanced GCN to reason complex semantic relations among entities in a document.

In our work, we proposed heterogeneous graph matching networks (HGMN) on arguments to solve with the APE task. The heterogeneous graph attention network is proposed to model the argument text with sufficient context information and the multi-granularity graph matching model is proposed to extract the argument pairs.

3 Task definition

Following current works [3, 15, 17], the target of the APE task is to automatically extract the related arguments in pair from the raw dialogical discourse, which contains two sides of views. There exists some arguments that have no related argument on the other side. Given a review passage \(V=(s^{v}_1,s^{v}_2,...,s^{v}_m)\) consisting of m sentences \(s^{v}\), rebuttal passage is \(b=(s^{b}_1,s^{b}_2,...,s^{b}_n)\), which is also consisting of n sentences \(s^{b}\). There are two sub-tasks of APE. One sub-task is for extracting the argument from V and b; each argument will consist of several sentences. The reviewer passage is denoted as \(X^v\), and the rebuttal passage is denoted as \(X^b\)These passages consist of lists of arguments, represented as \(\{a^v_1,a^v_2,...,\}\) and \(\{a^b_1,a^b_2,...,\}\), respectively. The argument pair extraction task is to identify the related argument pairs between \(x^v\) and \(x^b\), which are \(P=\{(a^v_1,a^b_1),a^v_2,a^b_2\}\).

4 Model

In the section of learning text node embeddings (Sect. 4.1), we utilize the heterogeneous graph attention networks to calculate a robust latent representation for text, which incorporates contextual information in the modeling. The input for this model is the passage text, and the output is the text node embeddings. We design the global text graph to encode the long-distance relationships and the local dependency graph to capture the syntactic structure. Then, a heterogeneous graph neural network model (Heter-GAT) is proposed to study the information contained within the text graph (Fig. 1).

Furthermore, the argument mining (Sect. 4.2) utilizes the sentence node’s latent representation as input and trains a conditional random field (CRF) model to label whether a sentence is an argument.

Finally, the argument matching (Sect. 4.3) process takes the extracted arguments as input and produces the matched argument pairs as output. We propose a multi-granularity graph matching model to do the argument matching.

Fig. 1
figure 1

Overview of our model. a is the framework, which comprises three parts from left to right, learning text node of embedding, argument mining, and argument matching. b is the Heter-GAT model and c is the multi-granularity graph matching model, which are the sub-modules of HGMN

4.1 Learning text node embedding


Argument text presents a significant challenge to natural language processing (NLP) models due to its complex representation. Existing approaches [4, 15, 16] have limited ability to capture contextual information. We employ (1) global text graph construction, (2) local sentence dependency graph, and (3) heterogeneous graph attention network to capture the contextual and syntactic information of sentences. Note that with the raw text as the input to Big Bird [69], the text embedding Emb is the CLS token embedding from the Big Bird output layer.Footnote 1

With the help of the text graphs, our GNN model can capture contextual information in the longtext modeling. We provide an example in Fig. 2 to illustrate the principle of involving the long-distance contextual information in our model. For example, the sentence A is far away from B and C in passage, which makes it difficult for the text model to find their relations. When building graph, sentence A, B, and C have same topic, and we can let them connect on the text graph. Sentence B and C are two-hop neighbors with sentence A. The graph neural networks, such as GCN, GAT, and GraphSage, are able to aggregate the mulit-hop neighbors information to Sentence A. As a result, when modeling the semantics of sentence A, we can refer to the semantics of the long-distance sentence B and C.

Fig. 2
figure 2

Illustration of the principle of capturing the long-distance contextual information in HGMN

4.1.1 Global text graph

The graph structure is illustrated in Fig. 3(1). Its nodes can connect with each other across the passage and link external knowledge graphs like Wikipedia.

  • Node definition: A node is a fundamental unit in a graph structure, and the node types in this study include sentence s, entity \(\epsilon\), and topic \(\tau\). The entity node \(\epsilon\) is formulated by converting sentence words into 1-gram/2-gram phrases and looking up the phrases by WordNet and Wikipedia API TagMe [21]. Additionally, its one-hop synonyms neighbors are added into graph as entity nodes. The topic nodes \(\tau\) are extracted by the Latent Dirichlet al.location (LDA) [27] model to identify the topic words of each paragraph.

  • Edge definition: For sentence-entity edge \(e_{s\epsilon }\) and sentence-topic edge \(e_{s\tau }\), the connection criteria are the words contained within the sentence, and the edge weight is derived from the weighted sum of text embedding similarity and the word’s TF-IDF score. The sentence-entity edge weight and sentence-topic edge weight are as follows:

    $$\begin{aligned} w^e_{s\epsilon }=\, & {} \alpha \cdot {\text {cos}}(Emb_s,Emb_\epsilon ) + \beta \cdot tfidf(\epsilon ), \end{aligned}$$
    (1)
    $$\begin{aligned} w^e_{s\tau }=\, & {} \alpha \cdot {\text {cos}}(Emb_s,Emb_\tau ) + \beta \cdot tfidf(\tau ), \end{aligned}$$
    (2)

    where \(w^e_{s\epsilon }\) is the edge weight of \(e_{s\epsilon }\) and \(w^e_{s\tau }\) is the edge weight of \(e_{s\tau }\), and \(\beta\) is the hyperparameter. For topic-topic edge \(e_{\tau \tau }\) and entity-entity edge \(e_{\epsilon \epsilon }\), an edge is built when two words have a relation in the Wikipedia knowledge graph with weight as embedding similarity. For sentence-sentence edge \(e_{ss}\), the criteria are sentences in the same paragraph or having embedding similarity higher than \(\gamma\). The edge weight is the sentences’ embedding cosine similarity.

Fig. 3
figure 3

Illustration of the two modules of text graph: (1) global text graph and (2) local dependency graph

4.1.2 Local dependency graph

The local dependency graph is employed to structurally record the syntax information for each sentence, as illustrated in Fig. 3(2). The graph can model the complex logic in arguments. The dependency analysis tool [12] is adopted to extract the dependency tree from the sentence, and the graph is constructed from the dependency tree. Each word in the dependency tree is a node, and the dependency relation forms an edge between words w1 and w2 in graph with the weight \(w^e_{w_1w_2}\) as follows:

$$\begin{aligned} w^e_{w_1w_2}= \frac{tfidf(w_1)+ tfidf(w_2)}{2}, \end{aligned}$$
(3)

4.1.3 Heterogeneous graph attention network (heter-GAT)

Learning a good latent embedding from a heterogeneous graph is challenging. To address this, we propose the Heter-GAT to train the node embeddings on the text graph. The model diagram is shown in Fig. 4. The heterogeneous text graph in this study contains multiple subgraphs, each with distinct edge meanings. When learning node embeddings on the heterogeneous graph, it is important to consider each type of edge separately in order to preserve the unique information contained within each subgraph. The text graph can be divided into five subgraphs based on the different semantic relations represented by its edge types. The Heter-GAT incorporates two loss functions to ensure optimal performance. The first loss, referred to as the Graph Reconstruction Loss, aims to study the information contained within a single subgraph. The second loss, the Graph Align Loss, serves to align the embedding spaces of nodes across the different subgraphs.

Fig. 4
figure 4

Model of heter-GAT

  • Graph reconstruction loss aims to study each subgraph’s representation. Some of unsupervised graph learning methods [26] use the existing node pair with links as positive instances and randomly sample node pairs as negative instances to train the model, while others [32] employ an autoencoder framework where the encoder embeds the graph and the decoder reconstructs the graph structure. We adopt the random negative sampling method due to its robustness in training subgraphs with various sizes. To enhance the capacity of the neural network, the embedding layer is substituted with a graph attention network (GAT) [61] layer. The latent representation of node i is \(h_i\), and the graph attention (GAT) layer is as follows:

    $$\begin{aligned} z_{ij}= \,& {} \text {LeakyReLU}(W_a[W_q h_i; W_kh_j]),\end{aligned}$$
    (4)
    $$\begin{aligned} atten_{ij}=\, & {} \frac{\text {exp}({z_{ij}}\cdot w_{ij})}{\sum _{l\in e}\text {exp}({z_{il}} \cdot w_{il})},\end{aligned}$$
    (5)
    $$\begin{aligned} \mu _i= \,& {} \sigma \left(\sum _{j\in e}{\text {atten}}_{ij}W_vh_j\right), \end{aligned}$$
    (6)

    where i and j are node indices, \(atten_{ij}\) and \(w_{ij}\) are the attention weight and edge weight, respectively. The edge weight \(w_{ij}\) can prevent overfitting of the GAT model. \(W_a\), \(W_q\), \(W_k\), and \(W_v\) are trainable weights, and \(\mu _i\) is the output latent embedding of node i. Then with edge set E, the graph reconstruction loss \(L_{GC}\) is as follows:

    $$\begin{aligned} L_{GC} = \sum ^{i,j}_{i,j\in E}{ \left\| \mu _i-\mu _j \right\| ^2_F}, \end{aligned}$$
    (7)
  • Graph align loss aims to co-train the subgraphs with joint nodes. Since each subgraph is trained separately, their node embedding spaces may become inconsistent. This inconsistency can make it challenging to generate a comprehensive graph embedding. Moreover, by refining the embeddings, related subgraphs can supplement each other’s information. For example, sentence \(s_i\) is contained in both subgraph \(g_{s\tau }\) and subgraph \(g_{s\epsilon }\). In \(g_{s\tau }\), the sentence node \(s_i\) has studied the information of neighboring topic nodes, which can add useful information to the \(s_i\)’s neighboring entity nodes \(\epsilon\). So that we propose the align loss to make the joint nodes’ embeddings in different subgraphs consistent.

    $$\begin{aligned} L_{GA} = \sum ^{i,j}_{i,j\in N^{(x\rightarrow y)}}{ \left\| \mu ^{(x)}_i-\mu ^{(i\rightarrow y)}_j \right\| ^2_F}, \end{aligned}$$
    (8)

    where \(N^{(x\rightarrow y)}\) is the intersection nodes between subgraph x and subgraph y. The total loss of learning text node embedding is by minimizing the two losses.

    $$\begin{aligned} L_{Emb} = L_{\text {GC}} + L_{\text {GA}}, \end{aligned}$$
    (9)

4.2 Argument mining

After getting sentence embedding from Sect. 4.1, the sequence labeling model CRF is adopted to mine the arguments from each side due to the fact that the argument is always continuous multi-sentences. The CRF model outputs whether a sentence is the begin/inside/outside/end/single (BIESO). In the CRF model, the probability of predicting sentence s as label y is:

$$\begin{aligned} p(y\mid s) = \frac{\text {exp}({\text {score}}(s,y))}{\sum _y{\text {exp}(({\text {score}}(s,y))}}. \end{aligned}$$
(10)

where the score(s, y) is defined as a linear function in traditional CRF models. As shown in the following equation, it is calculated by the sum of transition scores along the label sequence y and the scores from the neural networks:

$$\begin{aligned} score(s,y)=\sum ^{n}_{i=0}A_{y_i},A_{y_{i+1}}+\sum ^{n}_{i=1}F_{\theta _1}(s,y_i). \end{aligned}$$
(11)

where \(A_{y_i}\),\(y_{i+1}\) represents the transition parameters between two labels, and \(F_{\theta _1}(s,y_i)\) indicates the score of \(y_i\) obtained from the neural network encoder parameterized by \(\theta _1\). \(y_0\) and \(y_{n+1}\) represent the “START” and “END” labels, respectively. We aim to minimize the negative log-likelihood loss \(L_{AM}\) for our dataset \(D_1\).

$$\begin{aligned} L_{AM}(D_1)=-\sum _{(s,y)\in D_1}\text {log}(p(y\mid s)), \end{aligned}$$
(12)

where \(D_1\) is the dataset of sentence’s tag, the candidate of y is BIESO. Finally, we decode through the Viterbi algorithm to obtain a possible set of tags.

$$\begin{aligned} y*=\text {argmax }_{y\in Y}p(y\mid s). \end{aligned}$$
(13)

4.3 Argument match

This section proposes the multi-granularity graph matching networks to identify the argument relations. Existing works [3, 15,16,17] have typically employed the text matching model to do the argument matching, which fail to capture structural logic information. To address this issue, raw text matching is converted into argument text graph matching, and the argument text graphs involve the structural information to model the logic. Additionally, arguments contain multiple granularities of semantics, such as topic, entity and so on. Capturing the semantic correlations of different granularities helps to better identifying the relationships between arguments. In the heterogeneous text graph, different subgraphs correspond to different granularities of arguments. Therefore, we refine the matching to the subgraph level and use the matching between subgraphs to represent the matching of semantics at different granularities. The model diagram is shown in Fig. 5.

Fig. 5
figure 5

Model of multi-granularity graph matching networks

Firstly, an argument \(a_i\) maps its text to the nodes of heterogeneous text graph and produce several subgraphs. In the construction of subgraphs, an effort is made to extract the most relevant nodes from the original graph while avoiding the inclusion of irrelevant nodes. Two principles guide the construction of these subgraphs:

  • In the global text graph, the first-order neighbors of the sentence are extracted to produce four types of subgraphs: sentence-topic graph \(G^{s\tau }\), sentence-entity graph \(G^{se}\), and sentence-sentence graph \(G^{ss}\). The different granularities of the graph correspond to different semantic levels.

  • In the local sentence dependency graph, the whole subgraph \(G^{ww}\) of the sentence is extracted.

Moreover, when comparing two arguments, we match the subgraphs with the same type. The existing graph matching models [13, 37] employ pooling to obtain the embedding representation of each graph and then calculate the similarity between the embeddings of two subgraphs. However, this approach overlooks the bottom-level interactions between nodes in the match. Inspired by Ling et al. [37] and Tay et al. [57], we adopt the co-attention mechanisms, which is to calculate the similarity between every pair of nodes in two subgraphs and use a pooling method to produce the graph-level match score. This approach captures more detailed matching signals. With the node embedding \(\mu\) from Sect. 4.1, the node’s similarity matrix M is as follows:

$$\begin{aligned} m_{ij}=F(\mu ^p_i)^\top \cdot F(\mu ^q_j)^\top , \end{aligned}$$
(14)

where F is a one-layer feed-forward layer, and p and q are the subgraph indices. For each node i of subgraph p, we use attention pooling of subgraph q to produce its latent embedding \(\overline{\mu ^p_i}\):

$$\begin{aligned} \overline{\mu ^p_i}=\sum _{j=1}^{\Vert q\Vert }\frac{{\text {exp}}(m_{ij})}{\sum _{k=1}\Vert q\Vert {\text {exp}}(m_{ik})}\mu ^q_j, \end{aligned}$$
(15)

The embedding \(\overline{\mu ^q_j}\) can be deduced similarly. The subgraph’s embedding is computed by averaging its node embeddings. We use the match embedding \(U_{G}^{pq}\) to represent the relation between subgraph p and subgraph q, which is obtained by concatenating the graph embeddings and feeding them into the MLP layer.

$$\begin{aligned} \overline{\mu ^p}=\, & {} {\text {mean}}(\overline{\mu ^p_j}), \end{aligned}$$
(16)
$$\begin{aligned} \overline{\mu ^q}= \,& {} {\text {mean}}(\overline{\mu ^q_j}), \end{aligned}$$
(17)
$$\begin{aligned} U_{G}^{pq}={\text{concat}}\left(\sum _i{\overline{\mu ^p_i}}, \sum _j{\overline{\mu ^q_j}}\right), \end{aligned}$$
(18)

Similarly, the matching embeddings of four subgraphs (Sect. 4.1) can be calculated, which are \(U_G^{ww}\), \(U_G^{s\epsilon }\), \(U_G^{ss}\), and \(U_G^{s\tau }\). The four subgraphs’ matching embeddings are concatenated and fed into a linear layer to make predictions.

$$\begin{aligned} U_{\text{pair}}=\, & {} {\text{concat}}(U_G^{ww},U_G^{s\epsilon },U_G^{ss},U_G^{s\tau }). \end{aligned}$$
(19)
$$\begin{aligned} L_2(D_2)= & {} -\sum _{(U_{\text{pair}},z)\in D_2 }(z\text {log}p(z=1\mid U_{\text{pair}}))+(1-z)\text {log}p(z=0\mid U_{\text{pair}}). \end{aligned}$$
(20)

where the \(L_2(D_2)\) is the loss, \(D_2\) is the labeled dataset of the arguments pair. In addition, the number of positive instances is much bigger than that of negative instances, which introduces severe training bias. To balance the instances’ label ratio, we use the negative sampling method to add the negative instances. For each review argument, we randomly select k rebuttal arguments and add them into dataset \(D_2\). After minimizing the loss function, the predicted relevance of each pair is \(p(z\mid U_{pair})\). When predicting the related argument of one argument, we select the highest matching score of the arguments on another side as Eq. 21.

$$\begin{aligned} z^{*}=\text {arg max}_{z\in {0,1}}p(z\mid U_{pair}). \end{aligned}$$
(21)

By designing the multi-granularity graph matching networks and integrating into the framework, our HGMN model is able to better capture the structural information and the semantic correlations of different granularities, leading to improved performance in identifying argument relations.

5 Experiments

We manage to address the following questions in the experiments.

  • Q1: Whether the HGMN model outperform existing methods?

  • Q2: How to apply the GPT\(-\)3.5 model to the APE tasks and whether the HGMN model outperform the GPT\(-\)3.5 models and GPT4 model?

  • Q3: Are the key modules in the HGMN model designed appropriately?

  • Q4: What is the contribution of each HGMN model module to performance?

  • Q5: How do the hyperparameters affect the HGMN model’s performance?

  • Q6: Does the model have interpretability?

  • Q7: How to explain the good cases and bad cases of HGMN?

5.1 Experiment setup

5.1.1 Experimental dataset

The argument pair extraction experiments are conducted on the Review–Rebuttal dataset, which is firstly proposed by the authors in Cheng et al. [15] and labeled from the paper’s peer-review process. The dataset contains 4764 ICRL review–rebuttal passage pairs from 2013 to 2020. The paper [16] refines the dataset to two versions, which are RR-submission-v2 and RR-passage (Table 2). The average word count of an input sample is 743.5, and the median word count of an input sample is 829. In addition, we calculate the GPT token count distribution following the method of OpenAI.Footnote 2 The samples in our dataset have a median token count of 1093 and an average token count of 1217. The difference between the two versions is that in RR-submission-v2, a paper’s peer-review will only be present in one of the train, test, or validation datasets. In contrast, RR-passage has a few papers whose data are across the training and test datasets. Although the RR-submission-v2 dataset is more reasonable due to no information leakage from the training dataset to the test dataset, RR-passage is also practical. Because the response from different reviewers would not arrive simultaneously, some review responses come late, while others come early. As a result, the early arrived responses can be utilized for models to judge the later responses, which is simulated by the RR-passage dataset. Our experiments are conducted on both datasets.

Table 2 Statistics of Review–Rebuttal (RR) dataset

5.1.2 Implementation details

Our work is implemented in PyTorch. All models are run with V100 GPU. We use the Adam optimizer [31] with an initial learning rate of 0.01 and update parameters with a batch size of 10. In the edge weight formula, we set the embedding similarity’s weight higher than the tfidf-score’s weight, which are \(\alpha = 0.8\) and \(\beta = 0.2\). Note that the sentence pairing evaluation in Cheng et al. [15] is abandoned because the argument pairs are extracted directly.

5.1.3 Evaluation metrics

Following the evaluation metrics of Cheng et al. [15], Bao et al. [3], Cheng et al. [16], and Bao et al. [5], we use a F1 score, a precision score, and a recall score as the main indicator for performance comparison. Argument mining (AM) and argument pair extraction (APE) are our evaluation tasks. Each model is run five times to get the average score on the test set. The precision score, recall score, and F1 score are computed as follows:

$$\begin{aligned} P= & {} \frac{\text {TP}}{{\text {TP}}+{\text {FP}}}, \end{aligned}$$
(22)
$$\begin{aligned} R= & {} \frac{\text {TP}}{{\text {TP}}+{\text {FN}}}, \end{aligned}$$
(23)
$$\begin{aligned} F1= & {} \frac{2\times P \times R}{P+R}, \end{aligned}$$
(24)

where TP is the true-positive, FP is the false-positive, and FN is the false-negative. Our main evaluation target is about the argument pair extraction task. To each review argument, the rebuttal argument with the highest score will be selected to formulate the extracted argument pair. In addition, we also evaluate the intermediate task of AM, which helps to analyze the performance of the model. In addition, we also evaluate the intermediate task of AM, which helps to analyze the performance of the model.

5.2 Models for comparison

In this study, we employ three approaches to perform a comparative analysis. Firstly, we conducted a comparative evaluation of our model by examining current works to ascertain its strengths and limitations. Secondly, we compared our model with the GPT\(-\)3.5 model. Finally, we make variance to the sub-module of the HGMN model to determine the plausibility of each sub-module. In addition, we calculate the model size and make a comparison in Appendix A.

5.2.1 Comparison with existing works

This section aims to address question Q1.

  • PL-H-LSTM-CRF [15]: By training the sequential labeling model and the sentence relation model, the arguments’ relation is predicted in the pipeline.

  • Two-Step [3]: A variation of the PL-H-LSTM-CRF model, where the sentences are first put into arguments and then the relation between arguments is calculated, instead of using the sentence relation model.

  • MT-H-LSTM-CRF [15]: This model introduces a multi-task framework to improve the sequence labeling task and sentence relation task simultaneously.

  • MLMC [16]: Compared to MT-H-LSTM-CRF [15], MLMC adopts the 2D-GRU model to facilitate comprehensive interaction between the arguments of each side from the bottom of the neural network.

  • MGF [3]: This model formulates a homogeneous graph to enhance text representation, and then employs two sequential labeling tasks: one to determine whether a sentence is an argument, and another to identify the argument-argument relation based on the argument result and the sentence from the opposite side.

  • MRC-APE [5]: The model adopts the thought of machine reading comprehension. A query is used to identify all arguments in two documents, and then each identified argument is used as a new query to extract relevant arguments from another document.

  • MRC-APE-Sep [5]: The difference between this model and MRC-APE is that it trains argument mining and argument pair extraction separately, rather than using joint-training.

  • HGNM(LongFormer): To provide a fair comparison, we substitute the Big Bird encoder with LongFormer in this model since MRC-APE uses Longformer as an encoder.

5.2.2 Comparison with GPT\(-\)3.5 models and GPT4 model

These experiments aim to answer the question Q2. The GPT\(-\)3.5-turbo, text-davinci-003Footnote 3 and GPT4, developed by OpenAI, have been fine-tuned using Reinforcement Learning with Human Feedback (RLHF). This approach offers improved accuracy and reduced instances of hallucination [24, 45]. These models are believed to have acquired some human knowledge, enabling their application to various tasks with task-specific prompts and minimal examples [8, 34].Footnote 4

We follow current works [2, 19, 39] and design various zero-shot prompts, few-shot prompts, and chain-of-thought prompts [64] [34]) for experimentation. In the chain-of-thought prompts, we have innovatively decomposed the APE task into a multi-step chain of thoughts to guide the GPT models. To the best of our knowledge, we are the first to study how to model long argument pair extraction using GPT. Experiment setup for zero-shot prompt: Referring to the previous work [7], we find that for large language models (LLM) with huge parameters, fine-tuning at small sample sizes generally does not improve the models’ performance; therefore, we adopt a zero-shot learning approach. An elaborately designed prompt is necessary to tackle the complex argument pair extraction (APE) task. After testing many prompts, three that demonstrated the best performance are chosen in Table 12. Experiment setup for few-shot prompt: Few-shot learning, a type of in-context learning, provides examples to guide the model for improved performance [8]. It has been observed that large language models (LLMs) can improve their performance with a few demonstrations (i.e., few-shot samples). According to [39], there are several important principles to consider when using few-shot learning:

  • The distribution of demonstration text should be consistent with real data.

  • The format should align with the test data.

  • The distribution of labels in demonstrations should follow the distribution in real data, even if some labels are incorrect.

In addition to these principles, we have adopted some prompt techniques from other sources.Footnote 5,Footnote 6 Specifically, we implement the few-shot strategy by selecting demonstration examples with similar length and label distribution to the target example. Furthermore, we calculate token counts using the official OpenAI cookbook.Footnote 7 Our dataset samples have a median token count of 1093 and an average token count of 1217. This approach allows for few-shot learning while maintaining the input length within the GPT models’ input limits. The few-shot prompts for the argument pair extraction task and argument mining task can be found in Table 13.

Experiment setup for chain-of-thought prompt: Upon analyzing the experimental results of the few-shot experiment, we find that GPT models exhibit a bias in understanding the concept of an argument. Without properly grasping the notion of an argument, GPT models struggle to perform well in argument matching tasks. When humans tackle complex argument matching, they typically break the task down into two chained thoughts: argument mining and then matching within the identified arguments. The few-shot experiment requires GPT models to output matching sentences end-to-end, skipping the intermediate step of thinking about what constitutes an argument, which increases the difficulty of judgment. Inspired by the chain-of-thought approach proposed in Wei et al. [64] and Kojima et al. [34], we guide the model to think step by step. Specifically, we adopt a few-shot chain-of-thought prompt. We design two thinking steps: In the first step, the model identifies arguments within the passage, and in the second step, the model refers to the identified arguments to extract and combine sentences for argument pairs. An example is shown in Appendix C, Table 13.

We conduct experiments using three different OpenAI language models: text-davinci-003, gpt\(-\)3.5-turbo, and GPT4. There are over eight experiments, and our quota is limited; we randomly selected one-third of the test set samples (456 samples) for each experiment setting. Due to the unstable nature of generative models, we set the temperature to 0 and discard invalid outputs, such as “I cannot find the argument pairs.” Additionally, the results of other GPT models have all converged, indicating that our experiment has thoroughly tested the capability of GPT models and the result is reliable.

5.2.3 Comparison with the variants of HGMN

To answer the question Q3, we compare our method with various HGMN variants by replacing some of HGMN’s modules with state-of-the-art models.

Firstly, after extracting the individual arguments (section 4.2), we use graph matching to do argument matching (Sect. 4.3). In comparison, the existing works typically use pre-trained models to calculate similarity [6, 38, 69]. To evaluate the benefits of HGMN’s argument matching module, we propose a series of self-comparison experiments to compare it with multiple pre-trained text matching models.

  • Argument-BERT: Argument text serves as input for the BERT model [18], with the CLS token used to classify argument relations.

  • Argument-Longformer: In the BERT model, some arguments are truncated due to input length constraints. In contrast, Longformer [6] is suitable for long text. The model used in paper is longformer-large-4096.Footnote 8

  • Argument-Big Bird: Big Bird [69] is suitable for long text as input, and it has improved transformer mechanics, which lead to better performance in popular evaluation tasks compared with Longformer. The model used in this work is bigbird-roberta-large,Footnote 9 which is warm-started with RoBERTa’s checkpoint [38].

Secondly, argument pair extraction is transformed to the argument’s graph matching in our work. We also propose the multi-granularity graph matching model to do the graph matching. We compare the HGMN model and current graph matching models on the APE task.

  • ArgumentGraph-Graph embedding: Mean pooling on nodes to get the graph embedding, concatenating the two graph embeddings, and feeding them into an MLP layer to output the graph matching results.

  • ArgumentGraph-GMN shorttext: GMN [13] aggregates the node embedding of another graph by an attention unit to get the “cross embedding” and calculates the similarity score between the node’s initial embedding and its cross embedding. The similarity score is concatenated with the node’s embedding representation to formulate the feature of the node. The graph embedding is then calculated by attention pooling all nodes’ features, and the graph match is based on the similarity of the graph embeddings.

  • ArgumentGraph-GMN multilevel [37]: Differing from GMN shorttext [13], GMN multilevel [37] uses BiLSTM to aggregate another graph’s node embeddings.

5.3 Results

5.3.1 Results of comparison with current methods

Table 3 lists comparisons with the existing methods, including the state-of-the-art result. Analyzing the results, we have several valuable findings.

Table 3 Comparison with current method

HGMN(Big Bird) is our final model. On the evaluation of the APE task in the RR-submission-v2, the F1 of HGMN has an increase of +1.72% compared with MRC-APE. HGMN outperforms MRC-APE in precision (+2.59%) and recall (+1.02%). The improved precision is attributable to HGMN’s multi-granularity graph matching approach, which considers the syntax and structural information of arguments at multiple levels. Heter-GAT supplements context for the original argument text, resulting in improved recall. MRC-APE uses comprehension model to predict relevance, has a benefit in recall but is weak in precision due to its lack of matching details. In contrast, HGMN is more balanced, as it leverages text graphs to enhance the text representation and graph matching to capture matching details. Moreover, the context-aware graph matching enables HGMN to outperform other models such as MGF, MLMC, and MT-H-LSTM-CRF.

On the argument mining task, HGMN achieves a 1.17% higher F1 score than MRC-APE. The improvement is attributed to Heter-GAT, which enhances long-distance dependencies between sentences and merges information from different subgraphs with less noise.

On the RR-passage dataset, the HGMN model surpasses MRC-APE by 2.10% in F1 score. The HGMN’s F1 score on RR-passage’s is higher than on RR-submission, likely due to some arguments in both the train and test datasets belonging to the same papers, thus exposing the model to similar examples during the test process.

5.3.2 Results of comparison with GPT models


The results in Table 4 show that the few-shot performance is significantly better than the zero-shot performance, and the chain-of-thought technique refines the few-shot results, particularly for the text-davinci-003 model. GPT4’s F1 score is considerably higher than that of gpt\(-\)3.5-turbo and text-davinci-003. However, the best F1 score, achieved by the Chain-of-thought & GPT4 combination, remains much lower than that of HGMN. We attempt to analyze why the performance of GPT series of models is inferior to the fine-tuned results as follows.

Table 4 Results of comparison with GPT models

Firstly, some studies [14, 33] have found that OpenAI’s ChatGPT series models underperform compared to fine-tuned models on numerous natural language inference tasks. Moreover, [33] observed that ChatGPT models’ performance declines as tasks become more complex. The state-of-the-art performance in argument pair extraction remains low, indicating the task’s difficulty.

Upon inspecting the samples, we discovered that GPT models have difficulty identifying arguments, possibly because the concept of “argument” learned during the pre-training process differs from that in argumentative texts. This discrepancy makes it challenging for GPT models to learn the correct decision boundaries through few-shot learning, a finding that aligns with the observations reported in Chen et al. [14]. In contrast, fine-tuned models can extract patterns from extensive training data to determine decision boundaries.

Furthermore, many related arguments exhibit nonlinear thinking, which makes the responses challenging to understand. While our model can learn and remember these relationships from training data, GPT models may make errors when attempting to comprehend them, even with the assistance of the chain-of-thought technique [9]. Additionally, some researchers argue that GPT models’ emergent abilities arises from information compression. We hypothesize that under lengthy argumentative text, there may be lossy compression, which obstructs the model’s capacity to capture complex argument relations. The case analysis is in Appendix B, Table 11.

5.3.3 Results of comparison with variants method


From the Table 5, we have several valuable findings. The graph matching component of the HGM model outperforms the text matching model in terms of accuracy and recall, attributable to its structured graph matching mechanism and integration of external knowledge. In the heterogeneous graph matching component of the HGM model, we only match subgraphs of the same type. This is equivalent to introducing prior constraints that excluding the noise caused by the matching of different types of subgraphs. In comparison, models such as BERT, Longformer, and ERNIE-doc rely on the underlying transformer. It can be considered as modeling a fully connected graph for each word in the argument, which will increase noise and result in lower accuracy. The improvement in recall is due to the incorporation of external knowledge, such as entity relationships, and the inclusion of long-distance relationship dependencies in the text. In the text matching model, the performance of Longformer and ERNIE-doc is better than that of BERT, mainly because BERT has a limitation on the length of the text, and the excessive length of the argument is truncated, resulting in a loss of performance.

Table 5 Variance of HGMN

The series of graph matching models, such as GMN shorttext and GMN multilevel, exhibit better performance than text matching models like BERT, Longformer, and Big Bird. Among the graph matching models, the F1 score of HGMN is higher than GMN shorttext and GMN multilevel, which benefits from our two improvements, matching graph from multilevel and matching the subgraph with the same type. In GMN shorttext and GMN multilevel, the different types of subgraphs are not considered separately, and each node would aggregate all nodes from another graph, which will introduce noise. For example, the entity nodes may wrongly aggregate word nodes in a sentence syntax graph, which will confuse the semantics of the entity nodes. On the contrary, the HGMN aligns subgraphs of the same type, which reduces the noise in graph matching.

5.4 Ablation study


To address question Q4, we conduct an ablation study, the results of which are presented in Table 6. This table illustrates the contribution of each HGMN module to the overall performance. The modules of the HGMN can be divided into two parts: graph learning modules for learning text node embeddings (Heter-GAT), and graph matching. In the sub-modules of graph learning modules, the sentence-sentence graph, which adds the long dependency information of the document, has the significant impact on the effect. By removing it, the F1 score decreases by 2.78%. Further, removing the entity-sentence graph, sentence-topic graph, and sentence syntax graph will all cause a decrease in the F1 score, which indicates the effectiveness of these subgraphs. In graph matching, the co-attention model is most crucial, which will cause the F1 score to decrease by 3.96% if removed. Additionally, the results demonstrate that the removal of each subgraph’s matching signal leads to a decrease in the F1 score, suggesting that the matching of each subgraph contributes to the overall performance.

Table 6 Ablation study

5.5 Hyperparameter analysis and model interpretability

This section manages to answer the questions Q5 and Q6. To further analyze the principles of HGMN, we adjust the important parameters of the model and observe their impact on the results.

5.5.1 The number of GAT layers

The results in Fig. 6a and b demonstrate that the two-layer GAT performs the best in both the AM task and APE task. As the number of layers increases, the F1 score of APE decreases due to the over-smoothing problem, a phenomenon in which a high number of GAT layers leads to nodes with similar embeddings, resulting in the loss of distinct features. This problem typically arises when aggregating nodes with different labels or irrelevant nodes. In the argument’s text graph, one-hop or two-hop neighbors are always related to the original node. Aggregating these neighbors’ information will enhance the APE’s F1 score, so that the two-layer GAT is most effective. In contrast, the GAT with more than N layers aggregates N-hop neighbors (\(N>2\)) that are far away from the original node’s semantics, which exacerbates the over-smoothing problem.

5.5.2 The negative sampling number in the APE task

When all sentences are used as negative samples, the ratio of positive to negative samples becomes imbalanced at 1:15, resulting in poor results. In the experiment shown in Fig. 6, when the ratio of positive to negative samples is 1:3, 1:4, or 1:5, the model performs well, and the best F1 score is achieved with a ratio of 1:4. This is because if too many instances are labeled as negative, some correct instances and those that are marginally relevant will also be erroneously labeled as negative. Such wrongly labeled samples introduce biases into the model training process, which can lead to the model learning to classify positive samples as negative.

Fig. 6
figure 6

With different layers of GAT model, a is the F1 score of the argument pair extraction task on RR-submission-v2 dataset and RR-passage dataset respectively, b is F1 score of the argument mining task on RR-submission-v2 dataset and RR-passage dataset respectively. With different negative sample numbers, c is the F1 score of the argument pair extraction task on RR-submission-v2 dataset and RR-passage dataset respectively

5.5.3 The interpretability of model

We calculate the embedding similarity between sentences of review and rebuttal passage in Fig. 7. It allows us to observe that sentences of related arguments exhibit strong matches, while sentences of unrelated arguments have lower match scores. With the assistance of this visualization, we can gain a better understanding of the model’s results by debugging the match score matrix and identifying the underlying reasons.

Fig. 7
figure 7

Embedding similarity between sentences of review and rebuttal passage. We manually put the red square to represent the related arguments, for example, sent_0, sent_1, sent_2, sent_3 in review passage form an argument and pair with sent_0, sent_1 in rebuttal passage

Additionally, the node embeddings obtained from Heter-GAT serve as the foundational features of HGMN, and are also utilized as attention mechanisms to measure the relevance of arguments in the multi-granularity matching model. The relationships between text node embeddings can provide insight into the model’s output. Therefore, to enhance the interpretability of HGMN, we visualize the relation between node embeddings by calculating their cosine similarity, as shown in Fig. 8.

Fig. 8
figure 8

Presented matrix shows the embedding similarity of four argument pairs, with the x-axis and y-axis representing the two arguments’ texts, respectively. The varying colors correspond to the different levels of similarity between the words

5.6 Robustness checking

To evaluate the robustness of our model, we follow established methods for robustness checking in NLP tasks. A widely used strategy for robustness testing involves synthetically perturbing the input without changing the semantics and assessing the model’s ability to correctly predict the label [63]. This approach is often referred to as an adversarial attack. Adversarial attacks typically consist of three components [43]:

  • Employing an iterative search algorithm to effectively identify the best perturbations.

  • Generating these perturbations through a transformation module that modifies the text input from x to x0.

  • Applying a set of constraints to filter out undesirable x0, ensuring the preservation of the original text’s semantics and fluency.

Table 7 Hyperparameters in adversarial attack

We adopt an adversarial attack model [23], a well-established adversarial attack model designed for pre-trained models, published in EMNLP 2020. Given that our HGMN model is also built upon a pre-trained model, this method is suitable for evaluating its robustness. To facilitate our evaluation, we use the framework provided by [44] and apply it to the APE task. To the best of our knowledge, our study represents the first attempt at conducting robustness checks in this specific area. The setup of robustness checking is shown in Table 7.

Table 8 Robustness checking of HGMN model. The indicator is F1 score

The results are presented in Table 8. From the findings, we observe that our model’s F1 score decreases by 9.8% in the Argument Mining task and 7.0% in the Argument Pair Extraction task. Compared to the models in [23], our model demonstrates good robustness. One reason is that the APE dataset contains relatively long texts, with an average token count of 1217. When dealing with the argumentative input, the HGMN focuses on capturing the document-level semantics of long texts, making it less sensitive to perturbations at the word level. Moreover, our graph modeling in HGMN allows for capturing the meaning of document/passage and encoding more abstract semantics. These features exhibit strong robustness and effectively handle adversarial attacks.

5.7 Case study

This section aims to address Q7 by providing a more in-depth analysis of our method. Specifically, we randomly select samples and manually examine the prediction results to gain a deeper understanding of our model’s performance. A selection of these samples is presented in Table 9.

Table 9 A case study on three examples of test dataset

In Case 1, the review argument and rebuttal arguments are related. HGMN’s prediction in this case is accurate, while MRC-APE’s prediction is inaccurate. This can be attributed to HGMN’s ability to construct multiple graphs, such as the entity-sentence graph and the sentence-sentence graph, which enable it to expand the semantic representation of text and capture the essential meaning of the argument, including key words like “comparison” and “experiment.” Additionally, HGMN employs the entity-entity graph, which incorporates alias and abbreviation as edges between nodes. This strategy facilitates the model’s ability to identify relationships between words, such as those between ESNs and Echo State Networks or NBT and NoBackTrack.

In Case 2, the review argument and rebuttal argument are unrelated, HGMN predicts correctly, while MRC-APE wrongly predicts them as related. Although the two arguments share a high degree of raw text similarity-with common words such as “program synthesis” and related words such as “benchmarks” and “experiment,” their original sentences convey contrasting meanings. This discrepancy highlights the fact that the meaning of a sentence is not solely determined by its constituent words, but also by its syntactic structure. In contrast to MRC-APE, which tends to predict relatedness based on text similarity, HGMN preserves syntactic information through the use of a syntax graph and graph matching, allowing it to identify the differences between the two arguments.

In Case 3, the review argument and rebuttal argument are related, but both HGMN and MRC-APE wrongly predict them as unrelated. The reason for this is that the raw text between the arguments is significantly different, and much background knowledge has been left out. As a result, there is not enough information for the models to establish the relationship between the arguments. To address this issue, besides the argument’s text information, high-level topic information should be considered during the modeling process. This is an area for our future work.

6 Conclusion

The Argument Pair Extraction (APE) task aims to identify related arguments. This task has significant potential for practical applications, such as detecting argument pairs in the review–rebuttal process of academic papers, facilitating debates in online forums, and analyzing arguments in student essays. The APE task has a long history of successful implementation. This study introduces an efficient Heterogeneous Graph Matching Networks (HGMN) model for the APE task, yielding promising results based on the following key findings: (1) The proposed heterogeneous graph attention network (Heter-GAT) in HGMN addresses the issue of context deficiency; and (2) The multi-granularity graph matching networks in HGMN transform text matching into graph matching, incorporating complex argument structure information and enhancing the results. Our academic contribution lies in the successful modeling argumentative text pairing with heterogeneous graph neural networks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, including the existing works in the field and the series large models of GPT. Nonetheless, our study has a limitation: it primarily focuses on lengthy text argumentation, making it less suitable for shorttext argumentation. In future research, we plan to investigate generative models as a potential solution to the challenge of shorttext argumentation.