Question-Aware Memory Network for Multi-hop Question Answering in Human-Robot Interaction

Knowledge graph question answering is an important technology in intelligent human-robot interaction, which aims at automatically giving answer to human natural language question with the given knowledge graph. For the multi-relation question with higher variety and complexity, the tokens of the question have different priority for the triples selection in the reasoning steps. Most existing models take the question as a whole and ignore the priority information in it. To solve this problem, we propose question-aware memory network for multi-hop question answering, named QA2MN, to update the attention on question timely in the reasoning process. In addition, we incorporate graph context information into knowledge graph embedding model to increase the ability to represent entities and relations. We use it to initialize the QA2MN model and fine-tune it in the training process. We evaluate QA2MN on PathQuestion and WorldCup2014, two representative datasets for complex multi-hop question answering. The result demonstrates that QA2MN achieves state-of-the-art Hits@1 accuracy on the two datasets, which validates the effectiveness of our model.


Introduction
Intelligent human-robot interaction provides a convenient way for the communication between human and the robots [1,2,3,4]. Question answering over knowledge base (KBQA) is one of the important technology of intelligent human-robot interaction. It aims at using the given knowledge base to answer users' natural language question by cognitive computing [5]. The development of semantic web and the improvement of information acquisition technology promote the establishment and application of large-scale knowledge graph (KG), e.g. Freebase [6] , DBpedia [7], etc. The massive information contained in knowledge graph further promotes the research and application of KBQA. Therefor, recent years have witnessed an increasing demand for conversational question answering agent that allows user to query a large-scale knowledge base (KB) in natural language [8].
It is a long-standing problem aims to answer user's natural language question using a structured knowledge base. A typical KB can be viewed as a knowledge graph consisting of entities, properties, and relations between them [9,10]. Historically, KBQA can be divided into two mainstreams [11]. The first branch, namely, the semantic parser method (SP-based method), tries to parse the natural language question into a logical form that can be used to query the knowledge base, e.g. SPARQL, λ-DCS [12] and λcalculus. However, SP-based method heavily depends on data annotation and hand-crafted templates. The second branch treats KBQA as information retrieval problem, namely, information retrieval method (IR-based method). This approach encodes the question and each candidate as high-dimension vectors in a continuous semantic space and a ranking model is used to predict the correct answers. Recently, deep learning also leads an upward trend for IR-based methods. These approaches range from simple neural embedding based models [13], to attention based recurrent model [14], then to memoryaugmented neural controller architectures [11,15,16].
More recent work [18,19,20,21] focuses on enhancing the reasoning capability for multi-hop question. Lei et al. [22,23] proposed the multiturn conversational recommendation series work under the human-computer interaction mode, which promoted the application and development of the human-computer interaction mode in NLP-related tasks such as recommendation and question answering. Specifically, multi-hop question means the question has multiple relations and needs more steps inference to get the final answer. For example in Figure 1, considering the question "which country  [17]. The rounded rectangles represent the entities in KG and the solid arrows represent the relations between entities. The dot arrows represent the attention flow in the reasoning process. The entity "L MESSI" is the first part to focus on, the phrase "play professional in" next and "country" finally.
does L MESSI play professional in ?", where more than one relations (i.e., "plays in club" and "is in country") are involved. Due to the variety and complexity of knowledge and semantic information, multi-hop question answering over knowledge base is still a challenging task. Generally, there are two challenges need to be addressed.
First, the multi-hop question has more complicated semantic information. The tokens of the question have different influence on the triples selection in each reasoning step. Take for example the question in Figure 1, The entity "L MESSI" is the first part that should be focused on, the phrase "play professional in" next and "country" finally. Accordingly, the model should dynamically pay attention to different parts of the question during reasoning. However, current model often takes the question as a whole and ignore the priority information in it.
Second, the triplets have implicit relationship as some of them share entities or relations. From the way of humans thinking, we often find associated information from context. For example, "FC Barcelona" and "Real Madrid CF" share the same tail entity "Spain", which would enhance our memory that the two clubs are located in the same country. So the implicit graph context between triplets need to be modeled to improve the representation of entities and relations [24]. However, previous work only considers the individual triplet and local information, the implicit graph context of knowledge base has not been fully explored.
Considering the aforementioned challenges, we propose an architecture with question-aware attention to dynamically pay attention to different parts of the question in the reasoning process. We implement the architecture with key-value memory neural network, named QA2MN (Question-Aware Memory Network for Question Answering), to update the attention on question timely during reasoning. To improve the representation of entities, we utilize KG embedding model to pre-train the embedding of entities and relations. For the triplets are modeled and scored independently in general KG model, we integrate graph context into the scoring function to enrich the semantic representation.
To summarize, we have three-fold contributions: (i) propose a novel architecture with question-aware attention in the reasoning process and implement it with QA2MN to improve the query update mechanism. (ii) incorporate graph context information into KG embedding model to improve the representation of entities and relations. (iii) achieve state-of-the-art Hits@1 accuracy on two representative datasets and the ablation study demonstrates the interpretability of QA2MN.
The rest of the paper is structured as follows. We first give a review of related work in Section 2. Then background is showed in Section 3 and the detailed approaches are followed in Section 4. Experimental setups and results are reported in Section 5. Finally, we end the paper with conclusion and future work in Section 6.

Related work
Traditional SP-based models heavily depend on predefined templates instead of exploring the inherent information in knowledge graph [25,8]. Yih et al. [26] proposes query graph method to effectively leverage the graph information by cutting the semantic parsing space and simplifies the difficulty of semantic matching. For multi-hop question, Xu et al. [27] uses key-value memory neural network to store the graph information, and a new query update mechanism is proposed to remove the key and value that has been located in the query when updating. So the model can better pay attention to the content that needs reasoning in the next step. The SP-based methods give logic form representation of natural language question and the query operation is followed to get the final answer. However, the SP-based methods more or less rely on feature engineering and data annotation. In addition, they are demanding for researchers to master the syntax and logic structures of data, which poses additional difficulties for non-expert researchers.
The IR-based methods treat KBQA as information retrieval problem by modeling questions and candidate answers with ranking algorithm. Bordes et al. [13] firstly employed embedding vectors to encode the question and knowledge graph into high-dimension semantic space. Hao et al. [14] presented a novel cross-attention based neural network model to consider the mutual influence between the representation of questions and the corresponding answer aspects, where attention mechanism was used to learn the dynamically relevance between answer and words in the question to effectively improve the matching performance. Chen et al. [11] proposed bidirectional attentive memory network to capture the pairwise correlation between question and knowledge graph information and simultaneously improve the query expression by the attention mechanism. However, those models are not enough to handle multi-relation questions due to the lack of multi-hop reasoning ability. Zhou et al. [18] proposed an interpretable, hop-by-hop reasoning process for multi-hop question answering. The model predicts the complete reasoning path till the final answer. However, considering the cost of data collection, it is scarcely possible to be generalized to other domains. So weak-supervision 1 with the final answer labeled is better suited to current needs. The IR-based method converts the graph query operation into a data-driven learnable matching problem and can directly get the final answer by end-to-end training. Its advantages is that it reduces the dependence on hand-crafted templates and feature engineering, while the method is blamed for poor interpretability.
Recent work [19,28] also formulates multi-hop question answering as a sequential decision problem. Zhang et al. [19] treats the topic entity as a latent variable and handles multi-hop reasoning with variational inference. 1 Full supervision means annotating the complete answer path till the final answer. The weak-supervision means only the final answer is labeled. The un-supervision means no label is needed. For example, considering the question "which country does L MESSI play professional in ?", full supervision would annotate the complete answer path as (L MESSI, plays position, FC Barcelona), (FC Barcelona, is in country, Spain) and "Spain", while weak-supervision only resorts to the final answer "Spain".
Qiu et al. [28] performs path search with weak supervision to retrieve the final answer. The model proposes a potential-based reward shaping strategy to alleviate the delayed and sparse reward problem.

Task Description
For the given structured knowledge graph G , with entity set E and relation set R, each triplet T = (h, r, t) ∈ G represents an atomic fact, where h ∈ E , t ∈ E , r ∈ R denote head entity, tail entity and the relaion between them. Given a natural language question X, the task is to reason over G and predict Y to answer the question. Generally, the possible answers including (i) an entity from the entity set E , (ii) the numerical results of arithmetic operations, such as SUM or COUNT, and (iii) one of the possible boolean values, such as True or False [5]. In this paper, we mainly focus on the first problem of entity-centroid natural language question. To facilitate understanding, we summarize the important symbols used in the paper in Table 1.

Preliminary 3.2.1. KG Embedding
KG embedding converts symbolic representation of knowledge triples in a KG into continuous semantic spaces by embedding entities and relations into high-dimension vectors [29]. It can effectively improve the downstream tasks such as KG completion [30,31], relation extraction [32] and KBQA [33].
For each e ∈ E and r ∈ R, KG embedding first maps it into continuous hidden representation E e and E r . Then, a scoring function ψ(E e , E r , E e ) assign a score to a possible triple (h, r, t) to measure its plausibility. The triplets existed in G tend to have higher score than those not. To learn those entity and relation representations, an optimization method is used to maximize the total plausibility of observed Triplets.

Memory Neural Network
The memory neural network [34] is well-known for its multiple hop reasoning ability and has been successfully applied in many natural language processing applications such as question answering [11] and reading comprehension [34]. A memory neural network is often stacked with multi-layers, the key embedding matrix Φ K the value embedding matrix T C the candidate triplet set A C the candidate answer set each layer has two independent embedding matrices to transform the supporting facts into input memory representation and output memory representation. As shown in Figure 2(a), given the query vector, it first finds the supporting memories from the input memory representation and then produces output features by a weighted sum over the output memory representation. Key-value memory neural network generalizes the standard memory network by dividing the memory arrays into two parts, i.e., the key slot and the value slot, as shown in Figure 2(b). The model learns to use the query to address relevant memories with the keys, whose values are subsequently returned for output computation. Compared to the flat representation in standard memory network, the key-value architecture gives more flexibility to encode prior knowledge via functionality separation and is more applicable to complex structured knowledge sources [27,35].
(a) Memory neural network. Φ A and Φ C denote input embedding matrix and output embedding matrix.
Key-value memory neural network. Φ K and Φ V denote key embedding matrix and value embedding matrix.  We use a three-stage model for question answering. First, we exploit the graph context information in knowledge base by pre-training KG embedding model. Then, we use Bi-directional Gated Recurrent Unit (BiGRU) to encode the question into continuous hidden representation. Finally, we using a question-aware key-value memory network to reason over the knowledge graph. The proposed QA2MN has three main components, i.e., KG Embedding, Question Encoder and KG Reasoning. Figure 3 illustrates the architecture.

KG Embedding with Graph Context
We adopt translational distance model [36] to train the embedding of entities and relations. For each fact T i = (h i , r i , t i ) ∈ G , we apply translational distance constraint for the entities and the relations by the following equation, where E h i ∈ R dent , E r i ∈ R d rel and E t i ∈ R dent are the embeddings of head entity, relation, and tail entity respectively, W e2r ∈ R d rel ×dent is a projection matrix from the entity space to the relation space. In our implementation, d ent is equal to d rel . Then, we obtain the translational distance score by, where · denotes the l 2 norm of variable. To explore the implicit context information of knowledge graph, we integrate graph context into the distance scoring to improve the representation of entities. For the triplet T i , we consider two kinds of context information: (i) head-related context: all the triples share the same head with T i , i.e., C h (T i ) = {T j |T j = (h j , r j , t j ) ∈ G, h j = h i }; (ii) tail-related context: all the triples share the same tail with First, we integrate the head-related context with E h i by taking average over the triplets from C h (T i ), where, |C h (T i )| is the number of head-related context triplets, −1 is the inverse operator. Then, we compute the distance of the head-related context representation and E h i by, In the same way, we compute the tail-related context representationẼ t i as the average from triplets in C t (T i ), where |C t (T i )| is the number of tail-related context triplets. Correspondingly, the distance of the tail-related context representation and E t i is computed by,

Question Encoder
We use BiGRU [37] to encode the question to keep the token-level and sequence-level information. With a question X = [x 1 , x 2 , ..., x M ], where M is the total number of tokens in X. We feed X into the BiGRU encoder, which is computed as follows, where GRU is the standard Gated Recurrent Unit, E x i ∈ R d emb is the embedding of token, is the hidden representation, d emb is the token embedding size and d hid is the hidden size. Then we obtain the hidden representations for each token,

KG Reasoning
The focus of vanilla key-value memory neural network is about understanding the knowledgeable triplets in the memory slots. It often encodes the question as a whole vector and ignores its priority information. It is relatively enough for single-relation question but insufficient for complex multihop question. To improve the reasoning ability of key-value memory neural network, we introduce QA2MN to dynamically pay attention to different parts of the question in each reasoning step. In implement, QA2MN consists of five parts, i.e., key hashing, key addressing, value reading, query updating and answer prediction.

Key Hashing
Key hashing uses the question to select a list of candidate triplets to fill the memory slot. Specifically, we first detect core entity as the entity mentioned in the question and find out its neighboring entities within K hops relation. Then, we extract all triplets in G that contains any one of those core entities as the candidate triplets, denoted as T C = {T 1 , T 2 , ..., T N }, where N is the number of candidate triplets. All the entities in T C are extracted as candidate answers, we denote it as A C = {A 1 , A 2 , ..., A L }, where L is the number of candidate answers. For each candidate triplet T i = (h i , r i , t i ) ∈ T C , we store the head and relation in the i-th key slot, which is denoted as, Correspondingly, the tail is stored in the i-th value slot, denoted as, where W k ∈ R d hid ×d rel and W v ∈ R d hid ×dent are trainable parameters. At the z-th reasoning hop, QA2MN makes multiple hop reasoning over the memory slot by (i) computing relevance probability between query vector q z ∈ R d hid and the key slots, (ii) reading from the value slots, (iii) updating the query representation based on the value reading output and the question hidden representation.

Key Addressing
Key addressing computes the relevance probability distribution between q z and Φ K (k i ) in the key slots,

Value Reading
Value reading component reads out the value of each value slot by taking the weighted sum over them with p qk i ,

Query Updating
The value reading output is used to update the query representation to change the query focus for next hop reasoning. First, we compute the attention distribution between the value reading output o z and the hidden representation of each token in the question, Then, we update the query vector by summing the value reading output o z and the weighted sum over tokens in question with p vq i :

Answer Prediction
We initialize the query q 1 with the self attention of the question representation, is the integrated representation of question, T is the transposition operator. After Z hops of reasoning over the memories, the final value representation o Z is used to perform the final prediction over all candidate answers. Finally, we compute the matching score between final value representation o Z and candidate answers and normalize it into the range of (0,1), where W p ∈ R L×d hid is a trainable parameter. Finally, the candidate answers are ranked by their score.

Training
The training process can be divided into two stages. We first pre-train the KG embedding for several epochs, then we optimize the parameters of QA2MN and KG embedding iteratively. We combine the three distance score stated in Equation (2,4,6) as the loss function for KG embedding training, As for QA2MN optimization, we use the cross-entropy to define the loss function. Given an input question X, we denote y as the gold answer and y as the predicted answer distribution. We compute the cross-entropy loss between y and P (y) by,

Dataset
PathQuestion [18] and WorldCup2014 [17] are two representative datasets for complex multi-hop question answering, we employ them to evaluate QA2MN and the baselines. • PathQuestion (PQ): It is a manually generated dataset with predefined templates and its knowledge base is adopted from subset of FB13 [38]. PathQuestion-Large (PQL) is more challenging with less training instances and larger scale of knowledge base adopted from Freebase [6]. Both contain two-hop relation questions (2H) and three-hop relation questions (3H).  Table 2.
The complete KG setting in original dataset is too ideal for the question because there is often missing link in practical application. So the model should also be able to work on an incomplete KG setting. Following Zhou et al. [18], we simulate an incomplete KG setting, named PQ-50, by randomly removing half of the triples from the PQ-2H dataset.

Evaluation Metric
Following Qiu et al. [28], we measure the performance of models by Hits@1, which is the percentage of examples the predicted answer exactly matches the gold one. When a question has multiple possible answers, the predicted answer would be correct if matching any one of them.

Implementation Detail
For the training of QA2MN, we use ADAM [39] to optimize the trainable parameters. Gradients are clipped when their norm is bigger than 10. We partition the datasets in the proportion of 8:1:1 for training, validating and testing. The batch size is set to 48. The relation hop K is set to 3 and the reasoning hop Z is set to 3. The learning rate is initialized to 10 −3 and exponentially annealed in the range of [10 −3 , 10 −5 ] with a decay rate of 0.96. The entity embedding dimension and the relation embedding dimension are set to 100. The token embedding dimension and hidden size are also set to 100. To increase model generalization, dropout mechanism is adopted by randomly masking 10% of the memory slots.
For the pre-training of KG embedding, we set the same optimizer and embedding dimension as above. We set the batch size to 64 and pre-train KG embeddings for 20 epochs.

Baseline
We have six baselines for comparison, including current state-of-the-art model. All of them are listed as follow, • Seq2Seq [40]. It is an encoder-decoder model, adopting a LSTM to encode the input question sequence and another LSTM to decode the answer path.
• MemNN [34]. It is an end-to-end memory network that stores the KG triplets in memory arrays by bag-of-words representation.
• KV-MemNN [35]. It uses a key-value memory neural network to generalize the original memory network by dividing the memory arrays into two parts. For each triplet, the head and the relation are stored in the key slot, and the tail is stored in the value slot.
• IRN [18]. It proposes an interpretable, hop-by-hop reasoning process to predict the complete intermediate relation path. The answer module chooses the corresponding entity from KB at each hop and the last selected entity is chosen as the answer.
• IRN-weak. IRN needs label the complete paths from topic entities to gold answers, which need extra annotation for the dataset. IRN-weak is a variant of IRN which only utilizes supervision from the final answer.
• SRN [28]. SRN formulates multi-relation question answering as a sequential decision problem. The model performs path search over the knowledge graph to obtain the answer and proposes a potential-based reward shaping strategy to alleviate the delayed and sparse reward problem caused by weak supervision.  The results are shown in Table 3. QA2MN outperforms or shows comparable performance to all the baselines on the two datasets, which demonstrates that QA2MN is effective and robust in face with different datasets and questions. Seq2Seq shows the worst performance on the two datasets, indicating that multi-hop question answering is a challenging problem and the vanilla Seq2Seq model is not good at the complex reasoning process. KV-MemNN always outperforms MemNN, confirming that the key-value architecture of KV-MemNN gives more flexibility to encode the triplet in KG and is more applicable to the multi-hop reasoning problem. After further observations, we draw the following conclusions,

Experimental Result
(1) QA2MN shows robustness on both simple and complex question.
We classify the simple question as dataset with less hops and larger data scale, including PQ-2H and WC-1H. Correspondingly, complex question has more hops and smaller data scale, including PQ-3H, PQL-3H, WC-2H and WC-C.
As can be seen from Table 3, QA2MN performs similar to prior state-ofthe-art model in case of simple question, since it has less challenge to predict the correct answer as the answer is directly connected to the core entity. For the complex question, IRN and SRN significantly lag behind QA2MN, showing that multi-hop reasoning is also challenging to prior state-of-theart model. IRN initializes the question by adding the token embeddings as a whole vector, which would loss the priority information in question. The action space of SRN would exponentially growth as the reasoning hop increasing. So the performance drop is unavoidable for IRN and SRN as the question becomes more complex. On the other hand, the highest score on PQ-3H, PQL-3H, WC-2H and WC-C reveals QA2MN is able to precisely focus on the proper position of the question and infer the correct entity from the candidate triplets. So the result suggests that QA2MN is more robust when facing with complex multi-hop questions.
In the incomplete KG setting, only half of the original triples are reserved. Current model like IRN requires a path between the core entity and the answer entity. On the other hand, QA2MN uses dropout mechanism to randomly mask triplets in the memory slot to prevent it from over-fitting. QA2MN can implicitly explore the observed and unobserved paths around the core entity, which greatly improve the robustness of model to deal with the incomplete setting. So even there is no path between the core and answer entity, QA2MN can work to predict the answer.
(3) QA2MN meets current demand with weak-supervision learning. IRN outperforms IRN-weak for IRN need full-supervision along the whole intermediate relation and entity path. However, full-supervised method need large amount of data annotation which is cost and impractical for most case [28]. That is to say, weak-supervised or unsupervised method is more suitable to the current demand.
QA2MN and SRN achieved the best and second-best performance on the two datasets, which confirms that weak-supervised method has great potential to explore the inherent semantic information in knowledge graph.

Ablation Study
To further verify the significance of the question-aware query update mechanism and knowledge graph embedding, we do model ablation to explore the following two questions: (i) is KG embedding necessary for model training? (ii) is the question-aware query update mechanism helpful for reasoning over the knowledge graph? We use two ablated models to answer them.
• QA2MN\KE. It removes the pre-training of KG embedding . • QA2MN\QA. It removes the question-aware query update mechanism and replaces it with standard key-value memory neural network. We evaluate the ablation models on PQ and PQL dataset and take KV-MemNN for comparison. As shown in Table 4, comparing with QA2MN, the performance obviously dropped after removing any one of the two components, which confirms that both the question-aware query update mechanism and knowledge graph embedding are effective for improving the model performance.
QA2MN\QA always outperforms KV-MemNN, which approves that KG embedding adds context information from knowledge base to improve the representation of entities and relations. QA2MN\KE outperforms KV-MemNN as well, which confirms that the question-aware query update mechanism could improve the model to deal with more complex questions. To account for the performance improvement, we visualize the weight distributions on the question during the reasoning process in next subsection.

Visualization Analysis
To illustrate how QA2MN allocates the attention hop-by-hop in the reasoning process, we choose a testing example from PathQuestion and visualize the attention distributions on the question in each reasoning step.  Figure 4 shows the attention heat-map of question "what is the archduke johann of austria -s mother -s father -s religious belief ?". The question contains a core entitiy (i.e., "archduke johann of austria") and three relations ( i.e., "mother", "father" and "religious belief"). To answer the question, three triplets, i.e., (archduke johann of austria, parents, maria louisa of spain), (maria louisa of spain, parents, charles iii of spain) and (charles iii of spain, religion, catholicism) are needed to enable the reasoning. From Figure 4, we find that QA2MN can focus on the correct position during reasoning process as human do. The question-aware attention detects relation "mother" initially. Then the attention turn to "father" and focuses on "religious belief" finally.
Previous work often uses bag-of-word representation or RNN/LSTM/GRU to encode the question into an integrated vector, resulting in the loss of inherent priority information in the sentence. In the reasoning process, the integrated vector is used to retrieve and rank the candidate triplets. It is challenging for the coarse-grained semantic representation to do complex reasoning. Figure 4 intuitively illustrates the fine-grained information brought from question-aware attention, which is also the main reason for performance improvement. That is to say, question-aware attention can effectively explore the priority of the question, and utilize the fine-grained information for pre-cisely reasoning.

Conclusion
Multi-hop question answering over knowledge bases is a challenging task. There are two main aspects need to be addressed. First, multi-hop questions have more various and complicated semantic information. Then, the triplets have implicit relation as some of them share the heads or tails. We propose QA2MN to dynamically focus on different parts of the questions during reasoning steps. In addition, KG embedding is incorporated to learn the representation of entities and relations to extract the context information in knowledge graph. Extensive experiments demonstrate that QA2MN achieves state-of-the-art performance on two representative datasets.
In application, there are more complex questions which need arithmetic function or boolean logical operation. Furthermore, user may ask sequential questions continuously, which would lead to co-reference resolution problem. We would explore these problems in future work.