Introduction

Intelligent human–robot interaction provides a convenient way for the communication between human and the robots. Question answering over knowledge base (KBQA) is one of the important technologies of intelligent human–robot interaction. It aims at using the given knowledge base to answer users’ natural language question by cognitive computing [6]. The development of semantic web and the improvement of information acquisition technology promote the establishment and application of large-scale knowledge graph (KG), e.g., Freebase [2], DBpedia [13], etc. The massive information contained in knowledge graph further promotes the research and application of KBQA. Therefore, recent years have witnessed an increasing demand for conversational question answering agent that allows user to query a large-scale knowledge base (KB) in natural language [1].

It is a long-standing problem which aims to answer user’s natural language question using a structured knowledge base. A typical KB can be viewed as a knowledge graph consisting of entities, properties, and relations between them [18]. Historically, KBQA can be divided into two mainstreams [7]. The first branch, namely, the semantic parser method (SP-based method), tries to parse the natural language question into a logical form that can be used to query the knowledge base, e.g., SPARQL, \(\lambda \)-DCS [14] and \(\lambda \)-calculus. However, SP-based method heavily depends on data annotation and hand-crafted templates. The second branch treats KBQA as information retrieval problem, namely, information retrieval method (IR-based method). This approach encodes the question and each candidate as high-dimension vectors in a continuous semantic space and a ranking model is used to predict the correct answers. Recently, deep learning also leads an upward trend for IR-based methods. These approaches range from simple neural embedding-based models [4], to attention-based recurrent model [9] and then to memory-augmented neural controller architectures [5, 7, 11].

Fig. 1
figure 1

An example of multi-relation question over knowledge graph from WorldCup2014 [31]. The rounded rectangles represent the entities in KG and the solid arrows represent the relations between entities. The dot arrows represent the attention flow in the reasoning process. The entity “L_MESSI” is the first part to focus on, the phrase “play professional in” next and “country” finally

More recent work [25, 32, 33] focuses on enhancing the reasoning capability for multi-hop question. To be specific, multi-hop question means that the question has multiple relations and needs more steps inference to get the final answer. For example in Fig. 1, considering the question “which country does L_MESSI play professional in ?”, where more than one relations (i.e., “plays_in_club” and “is_in_country”) are involved. Due to the variety and complexity of semantic information and the large scale of knowledge graph, multi-hop question answering over knowledge base is still a challenging task. It remains an open question how to improve the knowledge representation. Generally, there are two challenges need to be addressed.

First, the triplets have implicit relationship as some of them share entities or relations. From the way of humans’ thinking, we often find associated information from related notions. Take for example the knowledge base in Fig. 1, “FC_Barcelona” and “Real_Madrid_CF” share the same relation “is_in_country” and tail entity “Spain”, which would enhance our memory that the two clubs are located in the same country. Therefore, the graph context between triplets needs to be modeled to improve the representation of entities and relations [17]. However, previous work only considers the individual triplet and local information, and the explicit graph context of knowledge base has not been fully explored.

Second, the multi-hop question has more complicated semantic information. The tokens of the question have different influence on the triples selection in each reasoning step. As an example the question in Fig. 1, the entity “L_MESSI” is the first part that should be focused on, the phrase “play professional in” next and “country” finally. Accordingly, the model should dynamically pay attention to different parts of the question during reasoning. However, current model often takes the question as a whole and ignores the priority information in it.

Considering the aforementioned challenges, we enhance key-value memory neural network with KG embedding and question-aware attention, named QA2MN (Question-Aware Memory Network for Question Answering), to improve the representation of tokens in question and the entities and relations in knowledge base. Specifically, to address the first challenge, we utilize KG embedding model to pre-train the embedding of entities and relations. For the triplets are modeled and scored independently in general KG model, we integrate graph context into the scoring function to enrich the semantic representation of entities and relations. To address the second challenge, we use question-aware attention to update the focus on question timely during reasoning process. The question-aware attention can dynamically change attention to different parts of the question in each reasoning steps.

To summarize, we have threefold contributions: (i) incorporating graph context information into KG embedding model to enhance the representation of entities and relations; (ii) proposing a question-aware attention in the reasoning process to enhance the query update mechanism in key-value memory neural network; (iii) achieving state-of-the-art Hits@1 accuracy on two representative datasets and the ablation study demonstrates the interpretability of QA2MN.

The rest of the paper is structured as follows. We first give a review of related work in “Related work”. Then, background is showed in “Background” and the detailed model is followed in “Proposed model”. Experimental setups and results are reported in “Experiments”. Finally, we end the paper with conclusion and future work in “Conclusion”.

Related work

Traditional SP-based models heavily depend on predefined templates instead of exploring the inherent information in knowledge graph [1, 24]. Yih et al. [30] proposed query graph method to effectively leverage the graph information by cutting the semantic parsing space. For multi-hop question, Xu et al. [29] used key-value memory neural network to store the graph information, and a new query update mechanism was proposed to remove the key and value that had been located in the query when updating. Therefore, the model could better pay attention to the content that needs reasoning in the next step. The SP-based methods give logic form representation of natural language question and the query operation is followed to get the final answer. However, the SP-based methods more or less rely on feature engineering and data annotation. In addition, they are demanding for researchers to master the syntax and logic structures of data, which poses additional difficulties for non-expert researchers.

The IR-based methods treat KBQA as information retrieval problem by modeling questions and candidate answers with ranking algorithm. Bordes et al. [4] first employed embedding vectors to encode the question and knowledge graph into high-dimension semantic space. Hao et al. [9] presented a novel cross-attention-based neural network model to consider the mutual influence between the representation of questions and the corresponding answer aspects, where attention mechanism was used to learn the dynamically relevance between answer and words in the question to effectively improve the matching performance. Chen et al. [7] proposed bidirectional attentive memory network to capture the pairwise correlation between question and knowledge graph information and simultaneously improved the query expression by the attention mechanism. However, those models are not enough to handle multi-relation questions due to the lack of multi-hop reasoning ability. Zhou et al. [33] proposed an interpretable, hop-by-hop reasoning process for multi-hop question answering. The model predicted the complete reasoning path till the final answer. However, considering the cost of data collection, it is scarcely possible to be generalized to other domains. Therefore, weak-supervisionFootnote 1 with the final answer labeled is better suited to current needs. The IR-based method converts the graph query operation into a data-driven learnable matching problem and can directly get the final answer by end-to-end training. Its advantages is that it reduces the dependence on hand-crafted templates and feature engineering, while the method is blamed for poor interpretability.

Recent work [19, 32] also formulated multi-hop question answering as a sequential decision problem. Zhang et al. [32] treated the topic entity as a latent variable and handled multi-hop reasoning with variational inference. Qiu et al. [19] performed path search with weak-supervision to retrieve the final answer. The model proposed a potential-based reward shaping strategy to alleviate the delayed and sparse reward problem.

Aforementioned work mainly focused on the reasoning ability. Some works like [10, 20, 28] take advantage of the structure and relation information preserved in the KG embedding representation to advance the KBQA task. [10, 28] used knowledge graph embedding to handle simple one-hop questions. Saxena et al. [20] leveraged knowledge graph embedding to perform multi-hop KBQA. However, the learned embeddings were only required to be compatible within each individual fact, without considering the graph context information. To bridge the gap, we pre-train the knowledge graph embedding with graph context and use it to initialize the QA2MN model and allow it to be fine-tuned in the training process.

Table 1 The important symbols and their definitions used in the paper

Background

Task description

For the given structured knowledge graph \({\mathscr {G}}\) with entity set \({\mathscr {E}}\) and relation set \({\mathscr {R}}\), each triplet \(T = (h, r, t) \in {\mathscr {G}}\) represents an atomic fact, where \(h\in {\mathscr {E}}\), \(t\in {\mathscr {E}}\), \(r\in {\mathscr {R}}\) denote head entity, tail entity, and the relation between them, respectively. Given a natural language question X, the task is to reason over \({\mathscr {G}}\) and predict Y to answer X. Generally, the possible answers include (i) an entity from the entity set \({\mathscr {E}}\), (ii) the numerical results of arithmetic operations, such as SUM or COUNT, and (iii ) one of the possible Boolean values, such as True or False [6]. In this paper, we mainly focus on the first problem of entity-centroid natural language question. To facilitate understanding, we summarize the important symbols used in the paper in Table 1.

Fig. 2
figure 2

A simple illustration of KG embedding. \(W_{e2r}\) is a projection matrix from the entity space to the relation space. Please refer to “KG embedding with graph context” for more details

Preliminary

KG embedding

KG embedding converts symbolic representation of knowledge triples into continuous semantic spaces by embedding entities and relations into high-dimension vectors [26]. It can effectively improve the downstream tasks such as KG completion [3], relation extraction [27], and KBQA [20].

For a triple (hrt), KG embedding first maps it into continuous hidden representation \((E_h,E_r,E_t)\). Then, a scoring function \(\psi (\cdot )\) assigns a score to the possible triple to measure its plausibility, as illustrated in Fig. 2. The triplets existed in \({\mathscr {G}}\) tend to have higher score than those not. To learn those entities and relations representation, an optimization method is used to maximize the total plausibility of observed triplets.

Fig. 3
figure 3

The architecture of memory neural network and key-value memory neural network. K is the reasoning hop. q is the query vector and o is the output vector. Please refer to “KG reasoning” for more details

Memory neural network

The memory neural network [22] is well known for its multiple hop reasoning ability and has been successfully applied in many natural language processing applications, such as question answering [7] and reading comprehension [22]. A memory neural network is often stacked with multi-layers; each layer has two independent embedding matrices to transform the supporting facts into input memory representation and output memory representation. As shown in Fig. 3a, given the query vector, it first finds the supporting memories from the input memory representation and then produces output features by a weighted sum over the output memory representation.

Key-value memory neural network generalizes the standard memory network by dividing the memory arrays into two parts, i.e., the key slot and the value slot, as shown in Fig. 3b. The model learns to use the query to address relevant memories with the keys, whose values are subsequently returned for output computation. Compared to the flat representation in standard memory network, the key-value architecture gives more flexibility to encode prior knowledge via functionality separation and is more applicable to complex structured knowledge sources [16, 29].

Fig. 4
figure 4

The architecture of QA2MN, which consists of three components (i) KG embedding, (ii) Question Encoder, and (iii) KG reasoning, see Proposed model 4 for details. denotes the head and tail entity of the concerned triplet, denotes head-related context for “L_MESSI”, and denotes tail-related context for “FC_Barcelona”

Proposed model

The proposed QA2MN has three main components, i.e., KG Embedding, Question Encoder and KG Reasoning, and Fig. 4 illustrates the architecture. First, we exploit the graph context information in knowledge base by pre-training KG embedding model. Then, we use Bi-directional Gated Recurrent Unit (\(\text {BiGRU}\)) to encode the question into continuous hidden representation. Finally, we use a question-aware key-value memory network to reason over the knowledge graph.

KG embedding with graph context

We adopt translational distance model [15] to train the embedding of entities and relations. For each fact \(T_i = (h_i, r_i, t_i) \in {\mathscr {G}}\), we apply translational distance constraint for the entities and the relations by the following equation:

$$\begin{aligned} W_{e2r}E_{h_i} + E_{r_i} = W_{e2r}E_{t_i}, \end{aligned}$$
(1)

where \(E_{h_i}\in \mathbf {R}^{d_{ent}}\), \(E_{r_i}\in \mathbf {R}^{d_{rel}}\), and \(E_{t_i}\in \mathbf {R}^{d_{ent}}\) are the embeddings of head entity, relation, and tail entity, respectively, \(W_{e2r}\in \mathbf {R}^{d_{rel}\times d_{ent}}\) is a projection matrix from the entity space to the relation space. In our implementation, \(d_{ent}\) is equal to \(d_{rel}\). Then, we obtain the translational distance score by

$$\begin{aligned} d_i^{t} = \Vert W_{e2r}E_{h_i} + E_{r_i} - W_{e2r}E_{t_i}\Vert , \end{aligned}$$
(2)

where \(\Vert \cdot \Vert \) denotes the \(l_2\) norm of variable.

To explore the implicit context information of knowledge graph, we integrate graph context into the distance scoring to improve the representation of entities and relations. For the triplet \(T_i\), we consider two kinds of context information: (i) head-related context: all the triples share the same head with \(T_i\), i.e., \(C_h(T_i) = \{T_j|T_j=(h_j, r_j, t_j) \in G, h_j=h_i\}\); (ii) tail-related context: all the triples share the same tail with \(T_i\), i.e., \(C_t(T_i) = \{T_j|T_j=(h_j, r_j, t_j) \in G, t_j=t_i\}\).

First, we integrate the head-related context with \(E_{h_i}\) by taking average over the triplets from \(C_h(T_i)\)

$$\begin{aligned} \tilde{E}_{h_i}&=\frac{\sum _{((h_j,r_j, t_j) \in C_h(T_i)}\hat{E}_{h_j}}{|C_h(T_i)|} \nonumber \\ \hat{E}_{h_j}&= E_{t_j} - W_{e2r}^{-1} E_{r_j}, \end{aligned}$$
(3)

where \(|C_h(T_i)|\) is the number of head-related context triplets, and \(-1\) is the inverse operator. Then, we compute the distance of the head-related context representation and \(E_{h_i}\) by

$$\begin{aligned} d_i^{C_h} = \Vert E_{h_i}-\tilde{E}_{h_i}\Vert . \end{aligned}$$
(4)

In the same way, we compute the tail-related context representation \(\tilde{E}_{t_i}\) as the average from triplets in \(C_t(T_i)\)

$$\begin{aligned} \tilde{E}_{t_i}&= \frac{\sum _{((h_j,r_j, t_j) \in C_t(T_i)}\hat{E}_{t_j}}{|C_t(T_i)|} \nonumber \\ \hat{E}_{t_j}&= E_{h_j} + W_{e2r}^{-1} E_{r_j}, \end{aligned}$$
(5)

where \(|C_t(T_i)|\) is the number of tail-related context triplets. Correspondingly, the distance of the tail-related context representation and \(E_{t_i}\) is computed by

$$\begin{aligned} d_i^{C_t} = \Vert E_{t_i}-\tilde{E}_{t_i}\Vert . \end{aligned}$$
(6)

Question encoder

We use \(\text {BiGRU}\) [8] to encode the question to keep the token-level and sequence-level information. With a question \(X = [x_{1},x_{2},...,x_M]\), where M is the total number of tokens in X. We feed X into the \(\text {BiGRU}\) encoder, which is computed as follows:

$$\begin{aligned}&\overrightarrow{h}_{x_i} = \text {GRU}(E_{x_i}, \overrightarrow{h}_{x_{i-1}}) \nonumber \\&\overleftarrow{h}_{x_i} = \text {GRU}(E_{x_i}, \overleftarrow{h}_{x_{i+1}}), \end{aligned}$$
(7)

where \(\text {GRU}\) is the standard Gated Recurrent Unit, \(E_{x_i} \in \mathbf {R}^{d_{emb}}\) is the embedding of token, \(\overrightarrow{h}_{x_i}\in \mathbf {R}^{d_{hid}/2}, \overleftarrow{h}_{x_i}\in \mathbf {R}^{d_{hid}/2}\) is the hidden representation, \(d_{emb}\) is the token embedding size, and \(d_{hid}\) is the hidden size. Then, we obtain the hidden representation for each token, \(H_{X} = [h_{x_1},h_{x_2},...,h_{x_M}]\), where \(h_{x_i}\) is the concatenation of \(\overrightarrow{h}_{x_i}\) and \(\overleftarrow{h}_{x_i}\), i.e., \(h_{x_i}=[\overrightarrow{h}_{x_i}, \overleftarrow{h}_{x_i}]\).

KG reasoning

The focus of vanilla key-value memory neural network is about understanding the knowledgeable triplets in the memory slots. It often encodes the question as a whole vector and ignores the priority information. It is relatively enough for single-relation question but insufficient for complex multi-hop question. To improve the reasoning ability of key-value memory neural network, we introduce QA2MN to dynamically pay attention to different parts of the question in each reasoning step. In implement, QA2MN consists of five parts, i.e., key hashing, key addressing, value reading, query updating, and answer prediction.

Key hashing

Key hashing uses the question to select a list of candidate triplets to fill the memory slot. In our implementation, we first detect core entity as the entity mentioned in the question and find out its neighboring entities within K hops relation. Then, we extract all triplets in \({\mathscr {G}}\) that contains any one of those core entities as the candidate triplets, denoted as \(T_C =\{T_1,T_2, ...,T_N\}\), where N is the number of candidate triplets. All the entities in \(T_C\) are extracted as candidate answers, and we denote it as \(A_C = \{A_1,A_2,...,A_L\}\), where L is the number of candidate answers. For each candidate triplet \(T_i = (h_i, r_i, t_i) \in T_C\), we store the head and relation in the i-th key slot, which is denoted as

$$\begin{aligned} \varPhi _K(k_i) = W_k(W_{e2r}E_{h_i} + E_{r_i}). \end{aligned}$$
(8)

Correspondingly, the tail is stored in the ith value slot, denoted as

$$\begin{aligned} \varPhi _V(v_i) = W_v E_{t_i}, \end{aligned}$$
(9)

where \(W_k \in \mathbf {R}^{d_{hid}\times d_{rel}}\) and \(W_v \in \mathbf {R}^{d_{hid}\times d_{ent}}\) are trainable parameters.

At the zth reasoning hop, QA2MN makes multiple hop reasoning over the memory slot by (i) computing relevance probability between query vector \(q_z \in \mathbf {R}^{d_{hid}}\) and the key slots, (ii) reading from the value slots, and (iii) updating the query representation based on the value reading output and the question hidden representation.

Key addressing

Key addressing computes the relevance probability distribution between \(q_z\) and \(\varPhi _K(k_i)\) in the key slots

$$\begin{aligned} p_i^{qk} = \text {softmax}(q_z \varPhi _K(k_i)). \end{aligned}$$
(10)

Value reading

Value reading component reads out the value of each value slot by taking the weighted sum over them with \(p_i^{qk}\)

$$\begin{aligned} o_z = \sum _{i=1}^{N} p_i^{qk} \varPhi _V(v_i). \end{aligned}$$
(11)

Query updating

The value reading output is used to update the query representation to change the query focus for next hop reasoning. First, we compute the attention distribution between the value reading output \(o_z\) and the hidden representation of each token in the question

$$\begin{aligned} p_i^{vq} = \text {softmax}(o_z h_{x_i}). \end{aligned}$$
(12)

Then, we update the query vector by summing the value reading output \(o_z\) and the weighted sum over tokens in question with \(p_i^{vq}\)

$$\begin{aligned} q_{z+1} = o_z + \sum _{i=1}^{M} p_i^{vq} h_{x_i}. \end{aligned}$$
(13)

Answer prediction

We initialize the query \(q_1\) with the self-attention of the question representation \(H_{X}\)

$$\begin{aligned} q_1 = \sum _{i=1}^{M}{\text {softmax}(h_{x_M}^\top h_{x_i})h_{x_i}}, \end{aligned}$$
(14)

where \(h_{x_M}=[\overrightarrow{h}_{x_M},\overleftarrow{h}_{x_1}]\) is the integrated representation of question, and \(\top \) is the transposition operator. After Z hops of reasoning over the memories, the final value representation \(o_Z\) is used to perform the final prediction over all candidate answers. Finally, we compute the matching score between final value representation \(o_Z\) and candidate answers and normalize it into the range of (0,1)

$$\begin{aligned} P(y) = \text {softmax}(W_p o_{Z}), \end{aligned}$$
(15)

where \(W_p \in {\mathbb {R}}^{L\times d_{hid}}\) is a trainable parameter. Finally, the candidate answers are ranked by their score.

Training

The training process can be divided into two stages. We first pre-train the KG embedding for several epochs, and then, we optimize the parameters of QA2MN and KG embedding iteratively. We combine the three distance score stated in Eqs. (2, 4, 6) as the loss function for KG embedding training

$$\begin{aligned} L_{KGE} = \sum _{T_i \in G}{d_i^{t} + d_i^{C_h} + d_i^{C_t}}. \end{aligned}$$
(16)

As for QA2MN optimization, we use the cross-entropy to define the loss function. Given an input question X, we denote y as the gold answer and P(y) as the predicted answer distribution. We compute the cross-entropy loss between y and P(y) by

$$\begin{aligned} L_{QA} = -\sum _{X}{y\cdot log P(y)}. \end{aligned}$$
(17)

Experiments

Dataset

We evaluate QA2MN and the baselines on PathQuestion [33] and WorldCup2014 [31], two representative datasets for complex multi-hop question answering.

Table 2 Dataset statistics of PQ, PQL, and WC
  • PathQuestion (PQ): It is a manually generated dataset with predefined templates and its knowledge base is adopted from subset of FB13 [21]. PathQuestion-Large (PQL) is more challenging with less training instances and larger scale of knowledge base adopted from Freebase [2]. Both contain two-hop relation questions (2H) and three-hop relation questions (3H).

  • WorldCup2014 (WC): The dataset is based on the knowledge base about soccer players that participated in FIFA World Cup 2014. It contains single-relation questions (1H), two-hop relation questions (2H), and conjunctive questions (C); M denotes the mixture of 1H and 2H. The statistics of PathQuestion and WorldCup2014 are listed in Table 2.

The complete KG setting in original dataset is too ideal, since the model has sufficient supportive information to answer the questions. However, there are often missing links in practical application. The model should also be able to work on an incomplete KG setting. Following [33], we simulate an incomplete KG setting, named PQ-50, by randomly removing half of the triples from the PQ-2H dataset.

Evaluation metric

Following [19], we measure the performance of models by Hits@1, which is the percentage of examples the predicted answer exactly matches the gold one. When a question has multiple possible answers, the predicted answer would be correct if matching any one of them.

Implementation detail

We use ADAM [12] to optimize the trainable parameters. Gradients are clipped when their norm is bigger than 10. We partition the datasets in the proportion of 8:1:1 for training, validating and testing. The batch size is set to 48. The relation hop K is set to 3 and the reasoning hop Z is set to 3. The learning rate is initialized to \(10^{-3}\) and exponentially annealed in the range of [\(10^{-3}\), \(10^{-5}\)] with a decay rate of 0.96. The entity embedding dimension and the relation embedding dimension are set to 100. The token embedding dimension and hidden size are also set to 100. To increase model generalization, dropout mechanism is adopted by randomly masking 10% of the memory slots.

For the pre-training of KG embedding, we set the same optimizer and embedding dimension as above. We set the batch size to 64 and pre-train KG embeddings for 20 epochs.

Baseline

We have six baselines for comparison, including current state-of-the-art model. All of them are listed as follows:

  • Seq2Seq [23]. It is an encoder–decoder model, adopting an LSTM to encode the input question sequence and another LSTM to decode the answer path.

  • MemNN [22]. It is an end-to-end memory network that stores the KG triplets in memory arrays by bag-of-words representation.

  • KV-MemNN [16]. It uses a key-value memory neural network to generalize the original memory network by dividing the memory arrays into two parts. For each triplet, the head and the relation are stored in the key slot, and the tail is stored in the value slot.

  • IRN [33]. It proposes an interpretable, hop-by-hop reasoning process to predict the complete intermediate relation path. The answer module chooses the corresponding entity from KB at each hop and the last selected entity is chosen as the answer.

  • IRN-weak [33]. IRN needs label the complete paths from topic entities to gold answers, which need extra annotation for the dataset. IRN-weak is a variant of IRN which only utilizes supervision from the final answer.

  • SRN [19]. SRN formulates multi-relation question answering as a sequential decision problem. The model performs path search over the knowledge graph to obtain the answer and proposes a potential-based reward shaping strategy to alleviate the delayed and sparse reward problem caused by weak-supervision.

Table 3 Hits@1 accuracy of QA2MN and the baselines on the two datasets

Experimental result

The results are shown in Table 3. QA2MN outperforms or shows comparable performance to all the baselines on the two datasets, which demonstrates that QA2MN is effective and robust in face with different datasets and questions. Seq2Seq shows the worst performance on the two datasets, indicating that multi-hop question answering is a challenging problem and the vanilla Seq2Seq model is not good at the complex reasoning process. KV-MemNN always outperforms MemNN, confirming that the key-value architecture of KV-MemNN gives more flexibility to encode the triplet in KG and is more applicable to the multi-hop reasoning problem. After further observations, we draw the following conclusions:

(1) QA2MN shows robustness on both simple and complex question.

We classify the simple question as dataset with less hops and larger data scale, including PQ-2H and WC-1H. Correspondingly, complex question has more hops and smaller data scale, including PQ-3H, PQL-3H, WC-2H, and WC-C.

As can be seen from Table 3, QA2MN performs similar to prior state-of-the-art model in case of simple question, since it has less challenge to predict the correct answer as the answer is directly connected to the core entity. For the complex question, IRN and SRN significantly lag behind QA2MN, showing that multi-hop reasoning is a challenging task, even to prior state-of-the-art model. IRN initializes the question by adding the token embeddings as a whole vector, which would loss the priority information in question. The action space of SRN would exponentially growth as the reasoning hop increasing. Therefore, the performance drop is inevitable for IRN and SRN as the question becomes more complex. On the other hand, the highest scores on PQ-3H, PQL-3H, WC-2H, and WC-C reveal that QA2MN is able to precisely focus on the proper position of the question and infer the correct entity from the candidate triplets. Therefore, the results suggest that QA2MN is more robust when facing with complex multi-hop questions.

(2) QA2MN is effective on incomplete KG setting.

In the incomplete KG setting, only half of the original triples are reserved. Current model like IRN requires a path between the core entity and the answer entity. On the other hand, QA2MN uses dropout mechanism to randomly mask triplets in the memory slot to prevent it from over-fitting. QA2MN can implicitly explore the observed and unobserved paths around the core entity, which greatly improve the robustness of model to deal with the incomplete setting. Therefore, even there is no path between the core and answer entity, QA2MN can work to predict the answer.

(3) QA2MN meets current demand with weak-supervision learning.

IRN outperforms IRN-weak for IRN need full-supervision along the whole intermediate relation and entity path. However, full-supervised method need large amount of data annotation which is cost and impractical for most case [19]. That is to say, weak-supervised or unsupervised method is more suitable to the current demand.

QA2MN and SRN achieved the best and second-best performance on the two datasets, which confirms that weak-supervised method has great potential to explore the inherent semantic information in knowledge graph.

Ablation study

To further verify the significance of the graph context-based knowledge graph embedding and the question-aware query update mechanism, we do model ablation to explore the following two questions: (i) is KG embedding necessary for model training? (ii) is the question-aware query update mechanism helpful for reasoning over the knowledge graph? We use two ablation models to answer them.

  • QA2MN\(\backslash \)KE. It removes the pre-training of KG embedding .

  • QA2MN\(\backslash \)QA. It removes the question-aware query update mechanism and replaces it with standard key-value memory neural network.

Table 4 Hits@1 accuracy of ablation models on PQ and PQL

We evaluate the ablation models on PQ and PQL dataset and take KV-MemNN for comparison. As shown in Table 4, comparing with QA2MN, the performance obviously dropped after removing any one of the two components, which confirms that both the knowledge graph embedding and the question-aware query update mechanism are effective for improving the model performance.

QA2MN\(\backslash \)QA always outperforms KV-MemNN, which approves that KG embedding adds context information from knowledge base to improve the representation of entities and relations. QA2MN\(\backslash \)KE outperforms KV-MemNN, as well, which confirms that the question-aware query update mechanism could improve the model to deal with more complex questions. To account for the performance improvement, we visualize the weight distributions on the question during the reasoning process in the next subsection.

Visualization analysis

To illustrate how QA2MN allocates the attention hop-by-hop in the reasoning process, we choose a testing example from PathQuestion and visualize the attention distributions on the question in each reasoning step.

Fig. 5
figure 5

Attention weight heat-map of question “what is the archduke_johann_of_austria -s mother -s father -s religious belief ?”. The columns are the tokens in question and rows are the attention weight in each reasoning step

Figure 5 shows the attention heat-map of question “what is the archduke_johann_of_austria -s mother -s father -s religious belief ?”. A core entitiy (i.e., “archduke_johann_of_austria”) and three relations ( i.e., “mother”, “father”, and “religious belief”) are contained in the question contains. To answer the question, three triplets, i.e., (archduke_johann_of_austria, parents, maria_louisa_of_spain), (maria_louisa_of_spain, parents, charles_iii_of_spain), and (charles_iii_of_spain, religion, catholicism) are needed to do reasoning. From Fig. 5, we find that QA2MN can focus on the correct position during reasoning process as human do. The question-aware attention detects relation “mother” initially. Then, the attention turns to “father” and focuses on “religious belief” finally.

Previous work often uses bag-of-word representation or RNN/LSTM/GRU to encode the question into an integrated vector, resulting in the loss of inherent priority information in the sentence. In the reasoning process, the integrated vector is used to retrieve and rank the candidate triplets. It is challenging for the coarse-grained semantic representation to do complex reasoning. Figure 5 intuitively illustrates the fine-grained information brought from question-aware attention, which is also the main reason for performance improvement. That is to say, question-aware attention can effectively explore the priority of the question, and utilize the fine-grained information for precisely reasoning.

Conclusion

Multi-hop question answering over knowledge bases is a challenging task. There are two main aspects need to be addressed. First, multi-hop questions have more various and complicated semantic information. Then, the triplets have implicit relation as some of them share the heads or tails. We propose QA2MN to dynamically focus on different parts of the questions during reasoning steps. In addition, KG embedding is incorporated to learn the representation of entities and relations to extract the context information in knowledge graph. Extensive experiments demonstrate that QA2MN achieves state-of-the-art performance on two representative datasets.

In application, there are more complex questions which need arithmetic function or Boolean logical operation. Furthermore, user may ask sequential questions continuously, which would lead to co-reference resolution problem. We would explore these problems in future work.