KGAnet: a knowledge graph attention network for enhancing natural language inference

Natural language inference (NLI) is the basic task of many applications such as question answering and paraphrase recognition. Existing methods have solved the key issue of how the NLI model can benefit from external knowledge. Inspired by this, we attempt to further explore the following two problems: (1) how to make better use of external knowledge when the total amount of such knowledge is constant and (2) how to bring external knowledge to the NLI model more conveniently in the application scenario. In this paper, we propose a novel joint training framework that consists of a modified graph attention network, called the knowledge graph attention network, and an NLI model. We demonstrate that the proposed method outperforms the existing method which introduces external knowledge, and we improve the performance of multiple NLI models without additional external knowledge.


Introduction
Natural language inference (NLI), also known as recognizing textual entailment, is a challenging and fundamental task in natural language understanding. Its aim is to determine the relationship (entailment, neutral, or contradiction) between a premise and hypothesis.
In the past few years, large annotation datasets, such as the Stanford NLI (SNLI) dataset [1] 1 and the Multi-Genre NLI (MultiNLI) corpus [2], 2 have been provided, which has made it possible to train quite complex neural networkbased models that are suitable for a large number of parameters to better solve NLI problems. These models are divided into two main categories, sentence-encoding and inter-sentence models.
Sentence-encoding models use the Siamese structure [3] as a reference, for encoding premises and hypotheses into sentence vectors and then for comparing the distance of the sentence vectors to obtain the relationship categories. Talman et al. [4] used hierarchical biLSTM and a max pooling architecture to encode sentences into vectors. Nie et al. [5] used shortcut-stacked sentence encoders to perform multi-domain semantic matching. Shen et al. [6] applied a hybrid of hard and soft attention and reinforcement learning for modeling sequences. Im and Cho [7] proposed a distance-based self-attention network, which considers the word distance using a simple distance mask. Yoon et al. [8] designed dynamic self-attention by modifying the dynamic routing in a capsule network [9] for natural language processing. Encoders include convolutional neural networks (CNNs) [10], recurrent neural network variants [1,5,11], and self-attention networks [12].
In contrast to the above methods, inter-sentence models apply cross-attention to increase the interaction between sentences. Among them, Parikh et al. utilized a decomposition matrix to reduce the number of cross-attention parameters in their proposed model, called DecAtt [13]. Gong et al. [14] introduced the interactive inference network, which is able to achieve a high-level understanding of a sentence pair by hierarchically extracting semantic features from the interaction space. Tan et al. [15] designed four attention functions to match words in corresponding sentences. Chen et al. proposed an enhanced sequenceencoding model (ESIM) [16]. In addition, Wang et al. obtained inter-sentence relationship features from multiple perspectives (BiMPM) [17]. Finally, Kim et al. [18] proposed a densely connected convolutional network that enables the original and the co-attentive feature information from the bottommost word embedding layer to be preserved in the uppermost recurrent layer.
Using inter-sentence relationships to handle the occurrence of unknown relationships in the corpus, Chen et al. proposed the knowledge-based inference model (KIM) [20], which uses the word-pair relationship features extracted from WordNet [19]. In their work, the model consists of knowledge-enriched co-attention, local inference collection with external knowledge, and knowledge-enhanced inference composition. Specifically, taking Fig. 1 as an example, after extracting the triple (round, synonymy, circular) from WordNet, the synonymy representation is added to the three components. Meanwhile, they obtained a certain improvement by establishing a Co-hyponyms relationship for two entities that have the same hypernym but do not belong to the same synset in the WordNet. The effectiveness of KIM verifies the validity of external knowledge. However, the two following problems exist: First, compared to subgraphs containing entities and relationships, triples in the knowledge graph are just a very simple structure. There is still great potential on the way to distill the knowledge graph (sufficiency). Second, this method cannot be used directly in other models. In the field of NLI, we need a general approach to introducing knowledge, without dealing with graph data and heavily modifying the internal structure of the model (applicability).
Inspired by KIM [20] and the use of graph networks to extract knowledge graphs in other fields (elaborated in related work), we propose a framework to provide external knowledge to NLI models, that is, to integrate the information of the whole subgraph using a graph attention network (GAT) [21], and to train the GAT and NLI model jointly.
Moreover, the GAT is applied to homogeneous networks such as citation networks [22]. To make the GAT more suitable for extracting information from knowledge graphs, we add a relationship importance attribute to the GAT and call it a knowledge GAT (KGAnet).
In experiments, we demonstrate that KGAnet outperforms the previous work and improves the performance of multiple NLI models without external knowledge. Finally, we verify the correctness of adding relationship attributes.

Graph convolutional networks
A CNN [23] is a neural network with powerful functionality in computer vision and natural language processing, because it can extract the spatial features of Euclideanstructured data. However, many data are non-Euclidean in structure, such as social networks and molecular structures. A graph convolutional network is used to process such data. The function of the graph convolutional network is to fuse the nodes and the surrounding information to obtain richer and more accurate representations of the nodes. In recent years, many researchers have conducted studies in the field of graph convolution, and its methods fall into two main categories: spectral methods and non-spectral methods.
Non-spectral methods The non-spectral methods [24] directly define the convolution on a graph using an adjacency matrix to summarize the node features of all spatial neighboring matrices. The main challenge of non-spectral methods comes from the dynamic size of the neighborhood. Duvenaud et al. [25] proposed a CNN that runs directly on the original molecular graphs and learns a specific weighted matrix for each node. To enable traditional CNNs to be directly used with graph input, they select fixed-size neighbors and normalize these nodes [26]. Recently, Hamilton et al. [27] introduced the task of inductive node classification with the aim of classifying nodes that have not been seen during training. This method takes a fixed-size neighborhood of the node and then uses a specific aggregator (such as the mean operator or LSTM [28]) to learn the neighborhood features. It has achieved impressive results on several large-scale inductive benchmark tests.
Spectral methods Bruna et al. [29] first proposed the definition of a graph convolution network in the Fourier domain. Because the convolution definition contains a Laplacian feature decomposition of the graph, the calculation process is quite arduous. Subsequently, a smooth parametric spectral filter [30], Chebyshev polynomial [31], Fig. 1 Example from the SNLI [1] dataset. The right graph is a subgraph extracted from WordNet [19]. Words in red are entities appearing in the subgraph, and the edge in red indicates the relationship between them (color figure online)

Neural Computing and Applications
and Cayley polynomial [32] were introduced to improve the computational efficiency. Then, Kipf and Welling [33] greatly simplified the convolution operation using a firstorder Chebyshev polynomial and achieved good results on node classification. Although the above methods significantly improve the computational efficiency, they depend on the structure of the graph itself and cannot be directly used for graphs of different structures. A GAT [21] introduces an attention mechanism based on graph convolutional networks and calculates the weights of different nodes in the central node domain according to node similarity, which allows the model to accept inputs of different sizes.
However, in a knowledge graph, to enrich the entity expressions, it is obviously not enough to consider the similarity of entities. In contrast to general graph structure data, the more important a relationship is to the central entity, the more important the connected neighbor entities are. It is reasonable for the property of the relationship to participate in the calculation of weights between entities.

Using graph convolutional networks to introduce external knowledge for other tasks
Story generation In the field of story generation, Jian et al. [34] aligned the entities that appear in the story and Con-ceptNet [35] and used Graph Attention to generate entity representations that enhanced the model's understanding of the entire story.
Dialogue system Hao et al. proposed a CCM [36] (commonsense knowledge-aware conversational model). Given a user post, the model retrieves the relevant knowledge graph from ConceptNet and then uses the static graph attention mechanism to increase the semantic information of the post. Then, during the word generation process, the model reads the retrieved knowledge graph and produces better generation through the dynamic graph attention mechanism.
CommonsenseQA Bill et al. proposed a knowledge-aware graph network module named KagNet [37], a text reasoning framework for commonsense answering questions, which effectively utilizes ConceptNet to perform explainable reasoning.
Recommendation Xiong et al. termed the hybrid structure of knowledge graph and user-item graph as collaborative knowledge graph (CKG). They developed a knowledge graph attention network (KGAT) [38], which achieved high-order relation modeling in an explicit and end-to-end manner under the graph neural network.
One common feature of their works is they all used graph neural networks with edge information to inject external knowledge for their professional fields. Inspired by these works, in this paper, we try to use a similar structure to design a general way to introduce external knowledge for the NLI domain.

Using external knowledge to enhance natural language inference
In the NLI field, a lot of works have been devoted to studying the effectiveness of external knowledge and how to introduce external knowledge. KIM [20] was the first NLI model to introduce external knowledge. Chen et al. [20] enhanced NLI models with external knowledge in co-attention, local inference collection, and inference composition components. They first explored how to introduce external knowledge and proved the effectiveness of external knowledge in the NLI field.
Wang et al. [39] proposed a ConSeqNet system combining text and graph models: This system uses Con-ceptNet as an external knowledge source to solve natural language inference (NLI) problems. At the same time, they also compared the effects of three external knowledge (wordnet, DBPedia, and ConceptNet) and tried on the diversity of external knowledge.
Annervaz et al. [40] used attention to automatically align entities and words and used convolutional neural network-based methods to extract external knowledge to reduce the scale of the attention space. They demonstrated that model training with the help of external knowledge can converge with fewer labeled samples.
These works of introducing external knowledge in NLI practice were performing well. Unlike them, our work verifies the effectiveness and flexibility of graph convolutional networks in the NLI field.

KGAnet framework
In this section, we define a KGAnet and introduce the combination of KGAnet and NLI models.

KGAnet
For each entity, our aim is to enable the NLI model to fully access external knowledge, so we use a KGAnet to generate representation vectors made up of the entity's neighbors and relationships. In the knowledge graph, we call the first-order neighbor entity j of an entity i a neighbor entity and we call i the core entity. When processing a knowledge graph, one layer of the KGAnet combines the information of the first-order subgraph into the representation of the core entity. Furthermore, k layer of a KGAnet combines the input features of the k-order subgraph into the representation. In this study, we only consider first-order neighbors; the effect of the k-layer neighbors will be considered in future work. The KGAnet inputs are a set of entity features h ¼ fh 1 ; h 2 ; . . .; h N g and a set of relationship features r ¼ fr 11 ; r 12 ; . . .; r NN g. Specifically, h i 2 R F is a representation of entity i, and r ij 2 R F is the relationship between entities i and j. The output is a new set of entity features h 0 ¼ fh where N is the number of entities of graph, F is the number of dimensions of the input entity features, and we denote the output node feature dimension as F 0 .
First, we calculate the degree of importance of the neighboring entity with respect to the core entity. The weight calculation involves two parts: the importance of the relationships and the similarity between entities. We, respectively, define a relationship's importance and entities similarity as follows: In these formulas, W r 2 R F ; W h 2 R F 0 ÂF and W a 2 R F 0 are linear transformations. We use W h to filter entity features, W r to filter relationship features, and W a to calculate the correlation of entity features. We use I ij to denote a score function that determines the importance of the relationship r ij and use S ij to denote a score function for the similarity between core entity i and entity j. When both I ij and S ij are small, we want to get a small value of E ij . When I ij and S ij are both large, we want the value of E ij to be large. So we multiply the two scores to get the weights E ij between the entities as follows: There is no I involved in the calculation in GAT [21]. The entity i and its neighbor entity j form a subgraph. Entity i and entity j in E ij may not be associated with each other. We need the adjacency matrix A of the graph to act as a mask, erasing the relationships that do not exist in E ij . The specific operation is: A ij is the adjacency matrix of the graph. If there is an relationship between entity i and entity j, the value of A ij is 1; otherwise, the value is 0. When A ij is 0, the value of E ij is negative infinity to obtain a value close to 0 after normalization.
To make the weights of all neighbor entities of i easy to compare, we normalize E ij as follows: After a ij obtained, we use it to sum over all neighbor entities according to importance as follows: where N i is some neighborhood of node i in the graph and M j is the neighbor feature after filtering. The new features h 0 i of entity i fuse the information about its surrounding entities and relationships. r is a sigmoid function. During the training process, parameters W r , W a , and W h are all trained: W r and W a learn how to provide a reasonable score for the relationship importance and entities similarity and W h acquires the ability to select beneficial features from neighbor entities.

Basic components of NLI models
At present, for inter-sentence methods, cross-attention is widely used in NLI models and is regarded as the basic component of such models. Note that cross-attention can align words that are related between sentences. If the input word vector contains an association, then this association feature will be mined in the cross-attention. Below, we will briefly introduce the basic framework of all current intersentence models and mainly describe what the cross-attention is and its function. Figure 2 shows a typical inter-sentence model structure. The premise a ¼ fw 1 a ; w 2 a ; . . .; w l a a g and hypothesis b ¼ fw 1 b ; w 2 b ; . . .; w l b b g are the input of the model. We denote the ith word in the premise as w i a and the jth word in the hypothesis as w j b . Cross-attention finds the degree of association of each word in two sentences. Using the degree of association, each sentence uses the other sentence's word information to enhance its own word representation, completing a soft alignment of the two sentences. In this part, the relationships of the entities are extracted to enhance the interaction of the two sentences. In the cross-attention layer, we can calculate the similarity of words between sentences using the following equation: Then, we need to align the similar words in the two sentences as follows: expðb ij Þ P l p k¼1 expðb kj Þ ; 8j 2 ½1; 2; . . .; l h : ð11Þ Next, we can perform down-sampling on each sentence through max pooling and average pooling to obtain vectors V p and V h as follows: V a ¼ ½V a;ave ; V a;max ; ð14Þ We concatenate V a , V b , their difference, and their elementwise product into a vector V, as shown in Eq. 16. Finally, we feed V into a multilayer perceptron classifier (MLP). MLP includes two layers of feedforward neural network, which have Hidden Size and 3 neurons, respectively. The activation function of the last layer is softmax, and the output is the relationship between the two sentences.

KGAnet and NLI models
To obtain external knowledge, it is convenient to concatenate the output of KGAnet and the input of NLI and jointly train them. The combined method is shown in Fig. 3. The vector Q k 2 R Q of each word is composed of a pre-trained D-dimensional word vector w k and a new entity feature h 0 i obtained by KGAnet, as shown in Eq. 17, where W q is a linear transformation.
Considering that not every word will find related entities in the graph, so some words in the sentence will correspond to empty entities. For these entities, we assign the following values: After Q k is obtained, it is used as the new input data for the NLI model and is passed to various encoders. For instance, in DecAtt [13], encoder is a fully connected layer. In ESIM [16] and BiMPM [17], encoder is bidirectional LSTM [28].

Experimental setup
In the experiment, all of our models use the Adam optimizer [41]. The word vector is a pre-trained GloVe [42] 3 300-dimensional vector. For words that are out-of-vocabulary, we assign them a 300-dimensional zero vector. Table 1 shows the hyper-parameters used for all experiments.

Effect of hyperparameters
We design an experiment with three hyperparameters

Dataset
In the experiment, the performances of all models were evaluated on the SNLI dataset [1] and the MultiNLI corpus [2]. In the SNLI dataset, there are three possible relationships between a premise and hypothesis: contradiction, neutral, and entailment. This dataset consists of a 549,367sample training set, 9842-sample development set, and a 9842-sample test set. The MultiNLI dataset consists of a 392,703-sample training set. The test and development sets are split into 10,001-sample in-domain (matched) and 10,001-sample cross-domain (mismatched) sets.
To closely retain the scale of the knowledge graph used in the previous work [20], we chose WordNet [19] to provide the knowledge graph data, and we preprocessed the knowledge graph as follows: (1) we selected 14,216 entities including entities in wordnet that appear in the snli and mnli training sets and their neighbors. (2) We selected six relationships in WordNet: entailment, member meronymy, synonymy, antonymy, hypernymy, and similarity. TransE [43] is a graph embedding technology that we used to embed the 14,216 entities and their relationships as 300-dimensional vectors.

Performance of KGAnet
In view of the problems we attempt to address in Sect. 1, we evaluated two aspects of our approach: sufficiency and applicability. The sufficiency is demonstrated by comparing the testing accuracies and training processes of KGAnet and KIM [20], which is the previous method for acquiring external knowledge. The applicability is reflected by the improvements obtained by adding KGAnet to other NLI models without external knowledge.

Sufficiency
KIM [20], which enriches ESIM [16] with external knowledge, is the first and powerful model using a knowledge graph in the NLI field, so we use it as our comparison model for KGAnet, which we combine with ESIM. Table 2 shows that, on the SNLI dataset, the joint framework of KGAnet and ESIM outperforms KIM by 0.28% with respect to test accuracy. In Table 3, we can see that, on the MultiNLI dataset, the joint framework of KGAnet and ESIM outperforms KIM by 0.1% with respect to accuracy on the mismatch dataset. Figure 4 displays the training performance of KIM and our model (ESIM ? KGAnet) on the SNLI dataset. Our model shows a strong generalization capability early in the training period. Although both models start to converge at nearly the same time, our model is much more accurate than KIM on the development set during epochs 0-3.
Similarly, Fig. 5 shows that on the MultiNLI dataset, KGAnet's matched accuracy is always higher than that of KIM. Moreover, overfitting occurs later than for KIM. Therefore, with respect to both training performance and test accuracy, our method feeds more knowledge to the NLI model.

Applicability
To verify the applicability of KGAnet, we added it to multiple NLI models in this experiment. We used DecAtt [13], BiMPM [17], and ESIM [16] as NLI model examples, as shown in Tables 2 and 3. They are all previously proposed NLI models that perform very well but do not incorporate external knowledge. When we incorporate KGAnet, they all improve to some extent. On the SNLI dataset, DecAtt's test accuracy increased by 0.8-86.53%, BiMPM's test accuracy increased by 0.5%, and ESIM's test accuracy increased by 1.0%. Moreover, on the Mul-tiNLI dataset, DecAtt's mismatched set accuracy increased by 0.8-74.96%, BiMPM's mismatched set accuracy increased by 0.6%, and ESIM's mismatched set accuracy increased by 0.7%. These results verify that other NLI

Comparison of GAT and KGAnet
In Sect. 3.1, we considered the importance of entity relationships based on GAT when we designed KGAnet. Therefore, we individually added GAT and KGAnet to the same base model (ESIM) to verify whether the relationship is important. The comparison results are shown in Tables 4 and 5. The results show that KGAnet yields a substantially larger improvement than GAT.
Moreover, Fig. 6 shows the attention weights of the three models for an example from the SNLI test set. ESIM ? KGAnet successfully predicts it, whereas ESIM ? GAT and ESIM fail in this instance. The attention weights map shows the degree of association between the words from the premise and hypothesis in the NLI model. The word-pairs corresponding to a higher degree of association are indicated by darker cells. Whether it is    necessary to consider the importance of the relationship can be determined by observing the difference between the attention maps. We hence have the following findings: First In Fig. 6b, c, man is matched with male, but in Fig. 6a, they are not related. We found in WordNet that man and male have the common neighbor lover. This indicates that GAT and KGAnet capture not only direct relationships, but also the relationships between two indirectly connected nodes.
Second There are no direct or indirect relationships between is; uses h i, is; laying h i , lying; laying h i , sideways; laying h i , and round; horizontal h iin WordNet, which indicates that in Fig. 6b, GAT produces a lot of noise. In contrast, in Fig. 6c, we can see that only the wordpair lying; laying h iexists. We believe it is probable that the relationship importance function in KGAnet filters information about some unimportant neighbor entities in the subgraph.
Third The orange outline indicates the contribution round; circular h i , which has an association only in Fig. 6c. This demonstrates that in KGAnet, the relationship synonymy between round; circular h iis indeed added to the calculation and has a certain influence on the word-pair relationships.
In summary, KGAnet reduces the noise, strengthens the association of words-pairs, and greatly improves the effect of NLI model because it considers the importance of the relationship.

Conclusion and future work
In this paper, we proposed a novel framework to incorporate external knowledge into an NLI model, i.e., by jointly training the proposed KGAnet model and NLI models.
With respect to the previous work (KIM), we use a layer of graph neural network to implement the function of introducing external knowledge and there is no preprocessing work on the graph data. And in the experiment, it is verified that KGAnet more sufficiently exploits external knowledge and is more flexible in applying to multiple NLI models.
Although KGAnet has achieved improvements to some extent, some problems remain to be solved. For instance, KGAnet greatly increases the input of the NLI model, which means we must reduce the width of the NLI model during experiments. Of course, if hardware resources are not limited, increasing the scale of the entire graph may yield better results. In addition, only a single-layer KGAnet was considered in this study. In future work, we plan to explore the influence of a k-layer KGAnet and explore more possibilities for introducing external knowledge.