Introduction

Making machines like humans is the goal of struggle for researchers. Because the concept of intelligence is difficult to define exactly, Turing proposed a famous Turing test: If a machine can hold a conversation with humans without being identified as such, then the machine is said to be intelligent. Turing test has always been used as a symbol of artificial intelligence. The question answering (QA) system itself is a Turing test scene [1]. If we have the same intelligent QA system as humans, then it is equivalent to passing Turing test [2]. Therefore, the study of the QA system has always received a lot of attention [3].

Fig. 1
figure 1

Incorporating copying and prediction mechanisms in generating a natural answer

Traditional knowledge QA is to provide precise answer entities for questions raised by users, such as: “What is the capital of China?", Returned to “Beijing". However, it is not a very friendly way to provide such a lonely answer entity [4]. In a real-world environment, users hope to receive a complete answer expressed in natural language sentences, such as “The capital of China is called Beijing." Based on these observations, we note that in the real-world environment, “natural answer", which is expressed in a complete natural language sentence rather than a single semantic unit, is a promising solution to QA system [5]. Figure 1 illustrates the major process in the natural answer generation task, which mainly includes analyzing the question, retrieving the related triple from knowledge graphs (KGs) and generating the natural language answer. Firstly, the system identifies the focus entity (namely China) and the main relation (namely capital) in the question. Then, the entity and the relation are mapped to the triple query over KGs (namely (China, capital, ?)), and the related fact triple (China, capital, Beijing) is retrieved from KGs. Finally, based on the natural language question, the retrieved fact triple and the vocabulary table, the system can generate mostly-grammatical sentence-length strings of natural language text automatically. The generated answer words in a natural way are obtained by different ways, including: the common words are usually predicted using a conditional model (e.g. “is called"); the major phrases are copied from the user question (e.g. “The capital of China"); analogously, the tail entity is copied from the input triple (e.g. “Beijing"), as shown in Fig. 1.

Natural answers can be widely used in the field of knowledge services such as community QA and intelligent customer service. The generation of natural answers in knowledge QA is in a very clear practical significance and strong application background [6]. Existing works are successful, but they still suffer from some problems: (1) universal answers: some studies focus on generating safe, universally relevant responses with little meaning, e.g., “something" [7] and “I don’t know" [8]; (2) incomplete output answers: it is impractical to retrieve the related fact and then generate the meaningful response from the insufficient input question, especially when the input question is really short. Moreover, we observe that there are still two important research problems (RPs) which are not processed well or even neglected.

RPs-1: The generated natural answer is required to copy semantic units from user’s question and the input triple. Although existing methods for KGs-based QA (KGs-QA), such as GenQA [9], can retrieve facts from KGs with neural models. Unfortunately, they cannot copy semantic units from the input question in generating target answer sequences. Moreover, existing methods can generate natural answers, but they cannot interact with KGs [7]. For instance, CopyNet [10] can copy words from the source question in generating target answer sequences, but it cannot copy the entity from fact triples.

RPs-2: The generated natural answer is required to match the question word. For instance, natural answer generation models aim at generating the natural answer “Xiao Ming’s home is in China." for the input question “Where is Xiao Ming’s home?". Moreover, the generated natural answer “Xiao Ming’s home is in China.", is a better answer than “Xiao Ming’s nationality is Chinese." and “Xiao Ming is Chinese.". The target answer sequences “Xiao Ming’s home is in China." match the question word (namely, Where) better. That is to say, natural answer generation systems generate question type words (namely, \(in \ China\)) in target answer sequences.

To deal with the aforementioned two problems, we propose a novel attention-based recurrent neural network (RNN) for natural answer generation, which incorporates the multi-level copying mechanisms (question copying and entity copying) and question-aware loss to generate natural answers automatically. Multi-level copying mechanisms and the prediction mechanism can generate natural answers, which are able to copy semantic units from the user question and the input triple, and predict the common words from the conditional model. In addition, we propose question-aware loss by optimizing the cross-entropy between the generated answer and the input question, which is beneficial to generate natural answer sequences matching the user question.

In brief, our main contributions are as follows:

  • To generate the correct answer replied with a more natural way, we leverage multi-level copying mechanisms and the prediction mechanism, which are able to copy semantic units (words, phrases and entities) from the source question and the input triple, and predict the common words from the conditional model.

  • To make generated target answer sequences correspond to the input question, we propose question-aware loss. Based on question-aware loss, it is prone to generate question type words in target answer sequences, which contributes to generating target answer sequences matching the question word.

  • Experiments on three response generation tasks show our model to be superior in quality while being more parallelizable and requiring significantly less time to train. The experimental results show that our method can generate grammatical and contextual natural answers according to user needs.

The remainder of this paper is structured as follows. In “Related work”, we provide a review of related work. “Task description and challenges” discusses task description and challenges that need to be faced. We provide our detailed approach in “Model”. “Experiments” reports experimental settings and experimental results, and “Conclusion” presents our conclusion.

Related work

In a real-world environment, users hope to receive complete answers expressed in natural language sentences. We note that natural answer generation is a promising solution to QA system.

Template generation methods. With the continuous improvement of computing power and model power, it is possible to convert data with different structures into natural text. Template generation methods are one of the earliest methods applied in the field of natural text generation [11]. A series of templates, each containing some fixed constants and variable variables, are designed and constructed in advance to achieve natural text generation [12]. This method is widely adopted in industrial applications because it has the advantages of interpretability and controllability, which can ensure the correctness of the output text [13]. However, the template generation technique also has some disadvantages. Firstly, it is difficult to achieve end-to-end optimization because the design and construction of templates need to be done manually, human intervention is needed to build high-quality templates. Secondly, the limited number of variable variables in the template leads to a low upper bound on information loss, which results in a lack of diversity in the generated text. Finally, due to the fixed linguistic structure in the template, the generated text may have problems with fluency and coherence. To overcome its shortcomings, researchers are also exploring the use of other methods to improve the quality of natural text generation.

Neural generation methods. With the development of deep learning technology, scholars begin to use neural network sequence generation methods to realize the automatic generation of natural text from data with different structures [14, 15]. The methods can significantly improve the performance of text generation by training a better modeling vocabulary and the dependencies between contexts through deep neural networks [16]. At present, natural text generation research mainly relies on deep learning technology, including convolutional neural network [17], gated recurrent unit [18], RNN Noraset et al. [19]. To conduct data-to-text generation tasks, Jiang et al. [20] proposed a novel pipeline assisted neural networks, which integrated the traditional pipeline modules and neural generation systems. The results demonstrated that the proposed model outperformed the existing solutions. To generate natural text, Tian et al. [21] proposed a deep neural network model to approximate the optimal solutions of large-scale multiobjective optimization problems. According to the experimental results on eight benchmark problems and eight real-world problems, the proposed algorithm could effectively solve sparse large-scale multiobjective optimization problems with 10,000 decision variables by only 100,000 evaluations. Noraset et al. [19] proposed a novel deep learning model for the natural language generation task, which used a dynamic regularizer that updated as training proceeds based on the generative behavior of the RNN language model. The proposed approach was successful at improving wordlevel repetition statistics by a factor of four in the RNN language model. Compared with template generation methods, neural network language models require less manual intervention and can generate richer and more fluent text descriptions. However, due to the limitations of the corpus, the neural network models are difficult to be controlled directly [22, 23], which cannot ensure that the output text is consistent with the input data.

Knowledge-enhanced generation methods. In order to improve the practicality of the generated text, it is necessary to optimize the neural network models. Yu et al. [24] proposed a knowledge-enhanced text generation method, which used knowledge graph embedding as external knowledge to assist natural text generation. Since commonsense knowledge can provide background knowledge, knowledge-enhanced text generation methods can increase the richness and diversity of the generated text. Madotto et al. [25] proposed a novel yet simple end-to-end differentiable model called memory-to-sequence, which solved the challenge of integrating KGs in end-to-end task-oriented dialogue systems. The proposed model could be trained faster and attain the state-of-the-art performance on three different task-oriented dialog datasets. Koncel et al. [6] introduced a novel graph transforming encoder for graph-to-text generation, which could leverage the relational structure of such KGs without imposing linearization or hierarchical constraints. Automatic and human evaluations showed that the proposed model produced more informative texts which exhibited better document structure than competitive encoder-decoder methods. Wu et al. [26] proposed a community answer generation method based on KGs. This was the first work that utilized the user knowledge and text semantics to improve the performance of community QA. Experiments on real datasets showed that the proposed method was superior to the state-of-the-art QA frameworks. Yang et al. [27] proposed a novel multi-task and knowledge enhanced multi-head interactive attention network for community QA, which not only used external knowledge to learn better representations of questions and answers but also improved representation learning by considering question categorization as an auxiliary task. The results on three widely used community QA datasets demonstrate that the proposed model achieved impressive results compared to other models. Yin et al. [9] proposed an end-to-end neural generative model which could generate answers to simple factoid questions based on the fact triples. Empirical study showed the proposed model could effectively deal with the variations of questions and answers, and generated right and natural answers by referring to the facts in the KGs.

Based on these observations, existing knowledge-enhanced approaches can retrieve structured data from KGs. Unfortunately, they cannot copy related semantic units from the user questions in generating target answer sequences. Therefore, the generated natural answers are required to copy semantic units from the user questions. Liu et al. [28] proposed a neural encoder-decoder model with multi-level replication mechanisms to generate natural text. Moreover, He et al. [29] proposed an end-to-end QA system, which incorporated copying and retrieving mechanisms to generate natural answers. Specifically, words, phrases and entities were dynamically predicted from the vocabulary, copied from the given question and retrieved from the corresponding KGs jointly.

Attention generation methods: In recent years, with the introduction of the attention mechanism [30] and the transformer architecture [31] into the field of natural language processing, natural text generation technology has also soared [32]. Rush et al. [33] first applied the attention mechanism and the encoder-decoder structure to natural text generation. Researchers at Google addressed long-distance dependencies in sequence-to-sequence models with the transformer architecture [34]. Despite some progress in attention-based natural answer generation methods, there are still many challenges, such as maintaining consistency with the question context, avoiding generating meaningless responses, and resolving grammatical errors in long sentences.

Our work distinguishes itself from previous work which enhanced neural network-based end-to-end framework with external knowledge, whereas we are attempting to leverage RNN transformers to fuse user questions with external knowledge by attention mechanism. We adopt transformer decoder with two different decoder mechanisms (namely copying mechanism and prediction mechanism) to fuse the question and knowledge representations in joint way to generate natural response. In addition, we intoduces question-aware loss to make the generated target answer sequences maintain consistency with the question. With the continuous development of deep learning technology, the attention mechanism and the transformer structure, natural answer generation technology will achieve more breakthrough results in the future.

Fig. 2
figure 2

Overall structure of the proposed natural answer generation model

Task description and challenges

In this section, we present task description and challenges. The purpose of discussions is to make a wise decision in the design of our system.

Task description

We leverage a question, a fact triple and a vocabulary table to generate a natural answer. In this paper, we focus on how to generate natural answer sequences that are not only grammatical but also contextual. The task can be formalized as follows:

$$\begin{aligned} {P\left( {A\left| {Q,T} \right. } \right) = \prod _{t = 1}^{\left| A \right| }P\left( {{a_t}\left| {{a_{ < t}},Q,T,V} \right. } \right) } \end{aligned}$$
(1)

where \(A = \left( {{a_1},{a_2},...,{a_{\left| A \right| }}} \right) \) is the generated natural answer sequences, Q is the use question, \(T = \left( {h,r,t} \right) \) represents the head entity \(\left( h \right) \), the relation \(\left( r \right) \) and the tail entity \(\left( t \right) \) of the input triple, V is a settled vocabulary table, and \({a_{ < t}}\) represents all previously generated answer sequences.

Challenges

In order to generate natural answers that are not only grammatical but also contextual for user questions, some problems should be considered. That is, how to obtain the related semantic units from the input fact triple and the source question, and how to ensure that the generated answer matches the input question.

We conclude challenges as following aspects:

  • Existing approaches are able to generate a short answer response. Unfortunately, in generating the target answer sequences, these approaches cannot copy semantic units from the source question and the input triple. In other words, the natural answer generated by the existing model is not a real answer to the user question, which is worth thinking about.

  • It’s important to note that existing models can generate natural answer sequences. However, there is a problem in generated answer sequences whether the generated natural answer corresponds to the input question. Thus, the key challenge for natural answer generation is that the generated answer sequences match the user question and really meet the user requirement.

Model

In this section, we present a novel attention-based RNN to generate natural answers. As illustrated in Fig. 2, the input to our model is a question and a triple which are encoded with a bi-directional RNN (Bid-RNN). Encoder creates vector representations of the input question and the input triple (see “Encoder”). Then, the encoded representations are feed to decoder to generate natural answers. Details of our decoding process are discussed in “Attention-based decoder”. It is worth mentioning that we additionally design question-aware loss to make the generated natural answer match the input question (see “Training”).

Encoder

All discrete input symbols (including the question and the triple) are transformed into numerical vector representations by the encoder. We describe the specific details of question encoding and triple encoding (to be discussed in “Question encoding layer” and “Triple encoding layer”). The specific details of encoder can be given in Algorithm 1, which mainly consists of two parts: question encoder and triple encoder. There are two aspects of the input: a question phrased in natural language Q and a fact triple T. Output is the representation of question: \({\mathbf{{h}}_Q}\) and the representation of triple: \({\mathbf{{h}}_T}\).

Algorithm 1
figure a

Question and triple encoding layers

Question encoding layer

The question encoding layer transforms question sequences into a sequence of concatenated hidden states with two independent RNN (namely Bid-RNN). Bid-RNN model generates \(\left\{ {\overrightarrow{h_1^{}},\overrightarrow{h_2^{}},...,\overrightarrow{h_x^{}},...,\overrightarrow{h_{{L_Q} - 1}^{}},\overrightarrow{h_{{L_Q}}^{}} } \right\} \) and \(\left\{ \overleftarrow{h_{{L_Q}}^{}},\overleftarrow{h_{{L_Q} - 1}^{}},...,\right. \left. \overleftarrow{h_x^{}},...,\overleftarrow{h_2^{}},\overleftarrow{h_1^{}} \right\} \). For a user question \(Q = \left( q_1^{},q_2^{},...,q_x^{},...,\right. \left. q_{{L_Q} - 1}^{},q_{{L_Q}}^{} \right) \), the concatenated representation for hidden states in bi-directions is shown as following:

$$\begin{aligned} {{\mathbf{{h}}_x} = \left[ {\overrightarrow{h_x^{}},\overleftarrow{h_{{L_Q} - x + 1}^{}} } \right] } \end{aligned}$$
(2)

where \({\mathbf{{h}}_x}\) is considered to be the contextual word representation of the input word \({q_x}\). Analogously, the concatenated representation \({\mathbf{{h}}_{{L_Q}}}\) of the last word \(q_{{L_Q}}^{}\) is represented as follows:

$$\begin{aligned} { {\mathbf{{h}}_{{L_Q}}} = \left[ {\overrightarrow{h{}_{{L_Q}}},\overleftarrow{{h_1}} } \right] } \end{aligned}$$
(3)

\({\mathbf{{h}}_{{L_Q}}}\) is used as the representation of the user question, which is represented as \({\mathbf{{h}}_Q}\). That is, the representation of the entire question is given as follows:

$$\begin{aligned} { {\mathbf{{h}}_Q} = {\mathbf{{h}}_{{L_Q}}} = \left[ {\overrightarrow{h{}_{{L_Q}}},\overleftarrow{{h_1}} } \right] } \end{aligned}$$
(4)

Triple encoding layer

In this section, we describe the triple encoding in detail. We note that the input is the structured data. The head entity, the relation and the tail entity of the input triple T is represented as h, r and t. The triple encoder is used to transform each element of the input triple into a fixed vector, and the vector is generated from the KGs vector matrix. To capture more triple information, the KG vector matrix is pre-trained using ConvKB [35]. The corresponding vector representations of h, r and t are expressed as \({\mathbf{{v}}_h}\), \({\mathbf{{v}}_r}\) and \({\mathbf{{v}}_t}\), respectively. For instance, the vector \({\mathbf{{v}}_h}\) of the head entity h is mapped from the KGs vector matrix \(\mathbf{{V}}\). Analogously, the relation vector \({\mathbf{{v}}_r}\) and the vector \({\mathbf{{v}}_t}\) of the tail entity t are looked up from the KGs vector matrix \(\mathbf{{V}}\). Let us take a running example as illustration, the concatenation \(\left[ {{\mathbf{{v}}_h},{\mathbf{{v}}_r},{\mathbf{{v}}_t}} \right] \) of \({\mathbf{{v}}_h}\), \({\mathbf{{v}}_r}\) and \({\mathbf{{v}}_t}\) represents the vector of the fact triple T, which is denoted as \({\mathbf{{h}}_T}\).

Attention-based decoder

The decoder is responsible for generating target answer sequences. Attention-based RNN is used to generate natural answer sequences based on vector representations of question and triple. Different from other decoders, we note that the decoding process of our model has the following differences:

Target sequence generation. Our model generates target sequences based on a mixed probabilistic model of the multi-level copying mechanism and the prediction mechanism where the first mechanism copies target words from the question and the triple, and the second mechanism predicts answer words from the decoder’s vocabulary table.

Reading vector representations \({\mathbf{{h}}_Q}\) and \({\mathbf{{h}}_T}\). \({\mathbf{{h}}_Q}\) and \({\mathbf{{h}}_T}\) are integrated into the decoder, which not only integrates “meaning" with vector representations but also feeds corresponding positional information.

At each time step t in the decoding process, considering the hidden state of decoder \({\mathbf{{d}}_t}\), the representation of the entire natural question \({\mathbf{{h}}_Q}\) and the vector of the fact \({\mathbf{{h}}_T}\), the probabilistic function for generating any answer word \({\mathbf{{w}}_t}\) is a “mixture" model, with the probabilistic function defined as follows

$$\begin{aligned} { \begin{array}{l} P\left( {{\mathbf{{w}}_t}\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1,}}{\mathbf{{h}}_Q},{\mathbf{{h}}_T}} \right. } \right) \\ \quad = {P_{cop{y_Q}}}\left( {{\mathbf{{w}}_t}\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1}},{\mathbf{{h}}_Q}} \right. } \right) \times {P_d}\left( {cop{y_Q}\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1}}} \right. } \right) \\ \qquad + \mathrm{{ }}{P_{cop{y_T}}}\left( {{\mathbf{{w}}_t}\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1}},{\mathbf{{h}}_T}} \right. } \right) \times {P_d}\left( {cop{y_T}\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1}}} \right. } \right) \\ \qquad + \mathrm{{ }}{P_{predict}}\left( {{\mathbf{{w}}_t}\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1}},{\mathbf{{c}}_t}} \right. } \right) \times {P_d}\left( {predict\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1}}} \right. } \right) \mathrm{{ }} \end{array} }\end{aligned}$$
(5)

where \(cop{y_Q}\), \(cop{y_T}\) and predict respectively stand for the copying mechanism for the source question, the copying mechanism for the input triple and the prediction mechanism for common words, \({P_d}\left( { \cdot \left| \cdot \right. } \right) \) shows the probability model for choosing different mechanism, and \({\mathbf{{c}}_t}\) represents the context vector.

The decoder learns the attention value written as \({\alpha _{tm}}\), which is computed using the following formula:

$$\begin{aligned} { {\alpha _{tm}} = \frac{{{e^{\sigma \left( {{\mathbf{{d}}_{t - 1}},{\mathbf{{h}}_m}} \right) }}}}{{\sum \nolimits _{m'} {{e^{\sigma \left( {{\mathbf{{d}}_{t - 1}},{{\mathbf{{h'}}}_m}} \right) }}} }} }\end{aligned}$$
(6)

where the function \(\sigma \) is used to compute the attentive strength with each source state \({\mathbf{{h}}_m}\). Most of the current decoders use the fixed context vector \(\textbf{c}\) for the prediction mechanism, which often leads to suboptimal results when generating target words from a fixed context vector \(\textbf{c}\). However, the attention mechanism allows for dynamic selection of hidden states at each time step to generate different context vectors. In this paper, we leverage the attention mechanism to dynamically select relevant hidden states, thereby generating different context vectors \(\mathbf{{c}}_t\) and improving the quality of generated results. The context vector \({\mathbf{{c}}_t}\) is computed as follows:

$$\begin{aligned} { \begin{array}{l} {\mathbf{{c}}_t} = \sum \limits _{m = 1}^{{L}} {{\alpha _{tm}}} {\mathbf{{h}}_m}\\ \end{array} }\end{aligned}$$
(7)

According to Formula (6),

$$\begin{aligned} { \begin{array}{l} {\mathbf{{c}}_t} = \sum \limits _{m = 1}^{{L}} {\frac{{{e^{\sigma \left( {{\mathbf{{d}}_{t - 1}},{\mathbf{{h}}_m}} \right) }}}}{{\sum \nolimits _{m'} {{e^{\sigma \left( {{\mathbf{{d}}_{t - 1}},{{\mathbf{{h'}}}_m}} \right) }}} }}{\mathbf{{h}}_m}} \\ \end{array} }\end{aligned}$$
(8)

The specific details of decoder layer can be given in Algorithm 2, which mainly consists of three parts: the major phrases are copied from the user question; the answer entity is copied from the input triple; the common words are predicted from the vocabulary table. There are two aspects of the input: the representation of question: \({\mathbf{{h}}_Q}\), and the representation of triple: \({\mathbf{{h}}_T}\). Output is the target answer words.

Algorithm 2
figure b

Decoding algorithm

Multi-level copying mechanism

As shown in Fig. 2, multi-level copying mechanism is proposed to generate target answer sequences, which allows copying from the source question and the input triple simultaneously.

Question copying. The semantics of the user question plays a very important role in natural answer generation. However, some words in source question are “no-meaning" symbols, and there is no need to analyze them in encoding and decoding processes. Let us take a running example as illustration, the reply of a question like this “Can you read the word ’read’?" is “Of course, read.", which should not consider the meaning of the second word “read". The question copying mechanism is integrated into the decoder, which does not need to consider the meaning of the question sequences and could directly copy the sub-sequences of source question for generating target answer sequences.

In the natural answer generation task, the question copying mechanism copies relevant semantic units from the user question, which refers to removing phrases besides the question word and the verb in the question. Moreover, the relevant semantic units copied by the copying mechanism are placed first in the answer sequence.

The score function for copying the word \({q_x}\) from the input question is computed as follows:

$$\begin{aligned} { {\phi _{cop{y_Q}}}\left( {{\mathbf{{w}}_t} = {\mathbf{{h}}_x}} \right) = N{N_2}\left( {{\mathbf{{h}}_x},{\mathbf{{d}}_t},{\mathbf{{H}}_Q}} \right) }\end{aligned}$$
(9)

where \({\phi _{cop{y_Q}}}\left( \cdot \right) \) is a score function for choosing answer words in the question copying mechanism, \(N{N_2}\left( \cdot \right) \) is a neural network function with a two-layer perceptron, and \({\mathbf{{H}}_Q}\) is an accumulated vector which gathers attentive history for each semantic unit in the source question.

Entity copying. Many researches have shown that most source questions contain the head entity and the relation, the tail entity hardly appears in the source question [26]. Thus, there is a demand to copy the tail entity from the input triple. To achieve this purpose, our study focuses on generating target answer sequences by copying the tail entity from the input triple.

The entity copying mechanism copies the tail entity from the input triples. As shown in Fig. 2, the input question “Where is Xiao Ming’s home?" and the fact triple “(Xiao Ming, nationality, Chinese)" are used as model inputs, and our model aims to generate natural answer sequences “Xiao Ming’s home is in China." Therefore, the tail entity copied by the entity copying mechanism needs to undergo part-of-speech conversion. Moreover, the tail entity copied by the entity copying mechanism is placed at the end of natural answer sequences.

The score function of copying the tail entity from the input triple is represented as follows:

$$\begin{aligned} { {\phi _{cop{y_T}}}\left( {{\mathbf{{w}}_t} = {\mathbf{{v}}_t}} \right) = N{N_2}\left( {{\mathbf{{v}}_t},{\mathbf{{d}}_t},{\mathbf{{H}}_T}} \right) }\end{aligned}$$
(10)

where \({\phi _{cop{y_T}}}\left( \cdot \right) \) is a score function for choosing answer words in the entity copying mechanism, and \({\mathbf{{H}}_T}\) is an accumulated vector which gathers attentive history for each entity in the input triple.

Prediction mechanism

Both verbs and prepositions in the natural answer sequence are generated by the prediction mechanism, which are usually placed in the middle of the relevant semantic units from the user question and the tail entity. To be more specific, some words or phrases need to be predicted from a settled vocabulary (e.g. “is in" should be predicted from the vocabulary table, as shown in Fig. 2). A vocabulary V is defined as follows:

$$\begin{aligned} { V = \left\{ {{v_1},{v_2},...,{v_{n - 1}},{v_n}} \right\} \cup \left\{ {OOVW} \right\} }\end{aligned}$$
(11)

where OOVW indicates any out-of-vocabulary words. Moreover, another two sets (\({V_Q}\) and \({V_T}\)), which cover words and entities in the input question and the input triple, are adopted. That is, we have adopted the vocabulary table \(V \cup {V_Q} \cup {V_T}\) for each target answer sequence. It is worth mentioning that three vocabulary tables V, \({V_Q}\) and \({V_T}\) may overlap.

The common words are usually predicted using a conditional language model. The score function of predicting the common words from the vocabulary table is given as follows:

$$\begin{aligned} \begin{array}{l} {\phi _{predict}}\left( {{\mathbf{{w}}_t} = {\mathbf{{v}}_i}} \right) = {\mathbf{{v}}_i}{\mathbf{{W}}_{predict}}\left[ {{\mathbf{{d}}_t},{\mathbf{{c}}_t}} \right] \\ \qquad \qquad \qquad \qquad \qquad = {\mathbf{{v}}_i}{\mathbf{{W}}_{predict}}\left[ {{\mathbf{{d}}_t},{\mathbf{{c}}_{{t_Q}}},{\mathbf{{c}}_{{t_T}}}} \right] \end{array} \end{aligned}$$
(12)

According to Formula (8),

$$\begin{aligned} { \begin{array}{l} {\mathbf{{c}}_{{t_Q}}} = \sum \limits _{m = 1}^{{L_Q}} {\frac{{{e^{\sigma \left( {{\mathbf{{d}}_{{{t_Q}} - 1}},{\mathbf{{h}}_m}} \right) }}}}{{\sum \nolimits _{m'} {{e^{\sigma \left( {{\mathbf{{d}}_{{{t_Q}} - 1}},{{\mathbf{{h'}}}_m}} \right) }}} }}{\mathbf{{h}}_m}} \\ \end{array}}\end{aligned}$$
(13)
$$\begin{aligned} { \begin{array}{l} {\mathbf{{c}}_{{t_T}}} = \sum \limits _{m = 1}^{{L_T}} {\frac{{{e^{\sigma \left( {{\mathbf{{d}}_{{{t_T}} - 1}},{\mathbf{{h}}_m}} \right) }}}}{{\sum \nolimits _{m'} {{e^{\sigma \left( {{\mathbf{{d}}_{{{t_T}} - 1}},{{\mathbf{{h'}}}_m}} \right) }}} }}{\mathbf{{h}}_m}} \\ \end{array} }\end{aligned}$$
(14)
Table 1 Summary statistics of QA datasets used in the paper

Training

Answer-aware loss

Our natural answer generation model is totally differentiable, which can be optimized in an end-to-end way by back-propagation algorithm. Given the source question Q, the input triple T and target answer sequences W, the object function is to optimize the following negative log-likelihood:

$$\begin{aligned} { {\psi _{A\_loss}} = \frac{{ - 1}}{W}\sum \limits _{t = 1}^W {\log \left[ {P\left( {{\mathbf{{w}}_t}\left| {{\mathbf{{d}}_t},{\mathbf{{w}}_{t - 1}},{\mathbf{{h}}_Q},{\mathbf{{h}}_T}} \right. } \right) } \right] } + \kappa {L_2} }\nonumber \\ \end{aligned}$$
(15)

where \(\kappa \) is a hyper-parameter for \({L_2}\). Our model is able to generate target answer sequences by optimizing answer-aware loss \({\psi _{A\_loss}}\), which contains a negative log-likelihood for generated natural reply and L2 regularization (\({L_2}\)).

Question-aware loss

It’s important to note that the system can generate a grammatical sequence of target answers by optimizing answer-aware loss \({\psi _{A\_loss}}\). However, there is a problem in target answer sequences whether the generated natural answer corresponds to the input question. That is, the generated natural answer is not a real answer to the input question, which does not meet the user requirement. To address the above-mentioned impediment, we propose a simple but effective optimization technique, called question-aware loss, which can make generated target answer sequences correspond to the input question. Based on question-aware loss, it is prone to generate question type words in the target answer sequences (namely, \(in \ China\)), which contributes to generating target answer sequences (namely, \(in \ China\)) matching the question word (namely, Where), as shown in Fig. 2. Formally, question-aware loss \({\psi _{Q\_loss}}\) is represented as follows:

$$\begin{aligned} { {\psi _{Q\_loss}} = \mathop {\min }\limits _{{\mathbf{{h}}_n} \in {\mathbf{{Q}}^T}} \mathop {\min }\limits _{{\mathbf{{w}}_t} \in \mathbf{{Y}}} {H_{{\mathbf{{h}}_n},{\mathbf{{w}}_t}}} }\end{aligned}$$
(16)

where \({\mathbf{{Q}}^T} = \left\{ {{\mathbf{{h}}_n}} \right\} _{n = 1}^{\left| {{Q^T}} \right| }\) is a set of question type words, \(\mathbf{{Y}}\) is a set of generated answer words, and \({H_{{\mathbf{{h}}_n},{\mathbf{{w}}_t}}}\) indicates cross entropy between the question type word \(\mathbf{{h}}{}_n\) and the generated answer word \(\mathbf{{w}}{}_t\). The model uses minimum cross entropy as question-aware loss \({\psi _{Q\_loss}}\). We focus on generating question type words in the answer sequences by optimizing \({\psi _{Q\_loss}}\). \({\psi _{Q\_loss}}\) can integrate \({\psi _{A\_loss}}\) by a weight coefficient \(\alpha \) where the total loss is shown as following:

$$\begin{aligned} { {\psi _{total\_loss}} = {\psi _{A\_loss}} + \alpha {\psi _{Q\_loss}} }\end{aligned}$$
(17)
Table 2 Summary of KGs

Experiments

In this section, we evaluate our natural answer generation model on QA corpuses and KGs datasets. (Note that QA corpuses and KGs datasets are described in more detail in “Experimental data details”.) The evaluation in this section serves two goals: (1) our model is able to generate natural language answers rather than single entity answers; (2) our model can generate natural answer sequences that conform to grammar and context for natural language questions.

Table 3 The sample fact triple, the pattern and the generated Q-A pair

Experimental settings

Experimental data details

We evaluate our model on QA corpuses (SimpleQuestions, WebQuestions and GraphQuestions) and KGs (FB15k). SimpleQuestions [36] is a single-relation KGs-QA corpus. This corpus is composed of questions annotated with corresponding entities from FB15k. The QA corpus consists of 101,754 questions. And, WebQuestions [37], including 3778 training questions and 2032 testing questions, is collected by Google Suggest API and crowdsourcing. Each question is paired with its own answer. All answers are from FB15k. Moreover, GraphQuestions [38] is composed of 5166 questions, which are constructed based on FB15k. The dataset presents a high diversity, which covers a wide range of domains including astronomy, medicine, people, etc. Specifically, the QA corpus consists of 148 domains, 506 classes, 596 relations, 376 topic entities and 3026 words. From what has been discussed above, we compare several QA datasets constructed from FB15k where statistics are given in Table 1. In addition, for the sake of reproducibility, experiments were conducted on the benchmark KGs, namely FB15k, which has been widely used for evaluating model performance in the QA task. Table 2 provides statistics of the benchmark KGs used.

Based on QA corpuses, different Q-A patterns have been constructed by two generators, one is responsible for generating question patterns and another one takes charge of generating corresponding suitable natural answer patterns, e.g. \( What \ is \ the \ capital \ of \ Country? \) \(\rightarrow \) \( The \ capital \ of \ Country \ is \ called \ City. \) where the variables Country and City indicate the country’s name and the city’s name, respectively. Given KGs fact triples and Q-A patterns, we can finally obtain specific Q-A pairs. And the KGs fact triple, the pattern, and the generated QA pair are shown in Table 3. Finally, we totally obtain 111K instances, each of which includes a natural language question, a natural answer and a fact triple.

Implementation details

In order to keep our model comparable to the comparative models, we keep the same parameter value and experimental environment. Our model is trained end-to-end to minimize the negative joint log likelihood of the target text vocabulary. We utilize stochastic gradient descent (SGD) algorithm with a decreasing learning rate (0.05) to optimize the model. In self attention layer, dropout [39] is set to 0.25. The dimension of triple embeddings is fixed at 300, and triple embeddings are pre-trained by ConvKB [35]. The word embeddings are initialized by the pre-trained BERT word vectors with 300 dimensions. The weight \(\left( \alpha \right) \) of question-aware loss is set to 0.25.

Evaluation metrics

Automatic evaluation metrics In order to evaluate performance of different models, automatic evaluation (AE) is adopted. AE for natural answer generation is a challenging and under-researched problem. Most natural answer generation evaluation indicators come from automatic text summary or machine translation. Common evaluation indicators are listed, and a brief introduction is given as follows:

  • Bilingual evaluation understudy (BLEU): It uses the number of co-occurring n-grams, which is to calculate relevance between the generated natural answer and the gold natural answer.

  • NIST: Firstly, the amount of information of n-grams in reference text is calculated. And then, based on weighted sum of n-grams, the matching degree of texts is calculated.

  • Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-N (ROUGE-N): It considers overlap of n-grams between the generated natural answer and the gold natural answer. Let us take an example as illustration, ROUGE-1 refers to overlap of unigram.

  • ROUGE-L: It is to measure the longest matching sequence of words, which uses Longest Common Subsequence (LCS). It only needs to be in-sequence match, which reflects sentence-level word order.

  • ROUGE-S: In order to measure word order at the sentence level, it utilizes Skip-Bigram to generate discontinuous word pairs.

  • ROUGE-SU: To evaluate the matching degree between the generated natural answer and the gold natural answer, it uses Skip-Bigram and unigram together for calculating co-occurrence statistics.

  • METEOR: Firstly, WordNet is introduced to supplement the thesaurus. And then, based on chunk alignment, the fluency of sentences is measured.

As mentioned above, the indicators could calculate relevance between the machine-generated text and the original text from different aspects, which are suitable for various research tasks. Based on the goal of specific research tasks and the feature of evaluation indexes, appropriate evaluation indicators are chosen. To test the effect of the proposed method, we select following metrics, including BLEU and ROUGE-N.

BLEU is a measure commonly used for many natural language processing tasks (such as machine translation tasks, dialogue tasks, QA tasks and chatbot tasks). The major reason is, it is regularly used to measures the similarity between the generated natural text and the gold natural text. It has a value between 0 and 1 that indicates the similarity of natural text. Let us take a running example as illustration, 1 means that the semantics are entirely identical, and vice versa. BLEU score is a relevant measure in task-oriented natural answer generation tasks. Thus, BLEU score is introduced into our evaluation. The modified n-gram precision of answers is defined as

$$\begin{aligned} { \begin{array}{l} {P_{n - gram}} = \frac{{\sum \nolimits _{c \in \left\{ {Candidate} \right\} } {\sum \nolimits _{n - gram \in c} {Coun{t_{clip}}\left( {n - gram} \right) } } }}{{\sum \nolimits _{c' \in \left\{ {Candidate} \right\} } {\sum \nolimits _{n - gram' \in c'} {Count\left( {n - gram'} \right) } } }} \end{array} }\end{aligned}$$
(18)

Moreover, considering the effect of the answer length on precision, Brevity Penalty (BP) is introduced, with BP defined as follows

$$\begin{aligned} { BP = \left\{ \begin{array}{l} 1,\mathrm{{ \qquad \qquad \ \ if \ c > r}}\\ \exp \left( {1 - \frac{r}{c}} \right) ,\mathrm{{ if \ c}} \le r \end{array} \right. }\end{aligned}$$
(19)

In our experiment, we employ \(N \le 4\) and uniform weights \({\omega _n} = \frac{1}{N}\). BLEU is computed as follows:

$$\begin{aligned} { BLEU = BP \cdot \exp \left( {\sum \limits _{n = 1}^N {{\omega _n}\log {P_{n - gram}}} } \right) }\end{aligned}$$
(20)

ROUGE is regularly used to evaluate differences between texts. Based on the co-occurrence probability of n-grams, ROUGE-N is employed to test the semantic relevance of two texts, with ROUGE-N defined as follows

$$\begin{aligned} { \begin{array}{l} ROUGE - N \\ \qquad = \mathrm{{ }}\frac{{\sum \nolimits _{G \in \left\{ {GoldAnswer} \right\} } {\sum \nolimits _{n - gram \in G} {Coun{t_{match}}\left( {n - gram} \right) } } }}{{\sum \nolimits _{G \in \left\{ {GoldAnswer} \right\} } {\sum \nolimits _{n - gram \in G} {Count\left( {n - gram} \right) } } }} \end{array} }\end{aligned}$$
(21)

In our experiment, we employ it to measure the similarity between the generated answer and the gold answer. The numerator of ROUGE-N represents the number of n-grams which appear in the generated natural answer and the gold natural answer simultaneously. Analogously, its denominator represents the number of n-grams in the gold natural answer. It’s important to note that the value range of N is defined from 1 to 4.

Table 4 Experimental results obtained on different datasets

Unreferenced evaluation metrics The phrase-based answer semantic similarity evaluation (PASSE) indicator is introduced, which is presented in [26]. Two texts: \({T_1} = \left\{ {{P_{11}},{P_{12}},...,{P_{1i}}} \right\} \) and \({T_2} = \left\{ {{P_{21}},{P_{22}},...,{P_{2j}}} \right\} \), are given where i and j are the length of texts. \({P_{1m}}\) denotes the \(m - th\) phrase in the text 1. Analogously, \({P_{2n}}\) denotes the \(n - th\) phrase in the text 2. The semantic similarity between a phrase P and a piece of text T is defined as following:

$$\begin{aligned} { \begin{array}{l} sim\left( {\mathbf{{P}},\mathbf{{T}}} \right) = \mathop {\max }\limits _{{\mathbf{{P}}_k} \in \mathbf{{T}},k = 1,2,...} sim\left( {\mathbf{{P}},{\mathbf{{P}}_k}} \right) \mathrm{{ }}\\ \quad \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathrm{{ = }}\mathop {\max }\limits _{{\mathbf{{P}}_k} \in \mathbf{{T}},k = 1,2,...} \frac{{\mathbf{{P}} \cdot {\mathbf{{P}}_k}}}{{\left\| \mathbf{{P}} \right\| \times \left\| {{\mathbf{{P}}_k}} \right\| }}\\ \quad \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathrm{{ = }}\mathop {\max }\limits _{{\mathbf{{P}}_k} \in \mathbf{{T}},k = 1,2,...} \frac{{\sum \nolimits _{l = 1}^{\left| P \right| } {{\mathbf{{P}}_l} \times {\mathbf{{P}}_{kl}}} }}{{\sqrt{\sum \nolimits _{l = 1}^{\left| P \right| } {{{\left( {{\mathbf{{P}}_l}} \right) }^2}} } \times \sqrt{\sum \nolimits _{l = 1}^{\left| {{P_k}} \right| } {{{\left( {{\mathbf{{P}}_{kl}}} \right) }^2}} } }} \end{array} }\end{aligned}$$
(22)

where \({\mathbf{{P}}_k}\) is the closest to \(\mathbf{{P}}\) in the text. And then, the semantic similarity between two texts \({T_1}\) and \({T_2}\) is expressed as following:

$$\begin{aligned} { \begin{array}{l} PASSE\left( {{\mathbf{{T}}_1},{\mathbf{{T}}_2}} \right) = \frac{{\sum \nolimits _{{\mathbf{{P}}_{1m}} \in {\mathbf{{T}}_1},m = 1,2,...,i} {sim\left( {{\mathbf{{P}}_{1m}},{\mathbf{{T}}_2}} \right) } }}{i}\\ \qquad \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \mathrm{{ }} \times \frac{{\sum \nolimits _{{\mathbf{{P}}_{2n}} \in {\mathbf{{T}}_2},n = 1,2,...,j} {sim\left( {{\mathbf{{P}}_{2n}},{\mathbf{{T}}_1}} \right) } }}{j} \end{array} }\end{aligned}$$
(23)

In this paper, PASSE is used to calculate the semantic similarity of the generated natural answer and the standard natural answer. Let us take a running example as illustration, baseline approaches A1 and A2 generate their answers. Based on approach A1, PASSE of each generated answer and the corresponding standard answer is calculated, and then all PASSE values are averaged, which obtains PASSE of approach A1 \(PASS{E_{A1}}\). Analogously, PASSE of approach A2 \(PASS{E_{A2}}\) is obtained. If \(PASS{E_{A1}} > PASS{E_{A2}}\), we think that the natural answer generated by approach A1 is closer to the standard natural answer in the aspect of the semantic similarity than approach A2, and vice versa.

Human evaluation metrics. To evaluate the quality of the generated natural answer, automatic evaluation metrics such as BLEU and ROUGE-N give a measure of how close the generated natural answer is to the standard natural answer. However, they still suffer from many limitations. To better evaluate generated answers, we run one further human evaluation as following:

  • Naturalness. Following [40], the comprehensibility and readability of generated natural answers is measured. To automatically evaluate naturalness of generated answers, each annotator is asked to rate each generated answer by a score of 0–5 where: (5) perfectly clear and natural, (3) artificial but understandable, and (1) completely not understandable. We sample 100 answers from each system with the help of three annotators.

Comparison methods

In order to assess the effectiveness and efficiency of our work, we make a comparison with the following existing models: VHRED [41], ZSQG [40], INLQA [42], QUEST [43], KGM [44], and ZSCEDT [45]. We briefly describe these baseline models below:

  • VHRED is a neural network-based generative architecture, with stochastic latent variables that span a variable number of time steps. It is essentially a conditional variational auto-encoder with the hierarchical encoder, which is able to model dependencies in the generative framework.

  • ZSQG is a neural model for text generation from KGs triples in the “ZeroShot” setup. It utilizes triple occurrences in the natural language corpus paired with an original part-of-speech copy action mechanism to generate natural text.

  • INLQA is a data-driven method for answering natural language questions based on KGs, which considers user interaction in the process of question understanding.

  • QUEST is an unsupervised approach for answering complex natural language questions, which utilizes partial results from different documents to compute the similarity join.

  • KGM is a novel pretraining-based encoder-decoder model, which can enhance the multi-turn dialogue response generation by incorporating external textual knowledge.

  • ZSCEDT is a novel approach for controlling the encoder-decoder transformer-based natural language generation model in zero shot, which enables such control for smaller models. This is done by applying two control knobs which control the generation process by directly manipulating trained natural language generation models.

Fig. 3
figure 3

Running time comparison of different methods on different datasets

Overall comparisons

Table 4 presents results of our work and existing methods on different QA corpuses. It is very clear from Table 4 that our work outperforms all other compared methods, which obtains the highest performance on all metrics (marked as red). Moreover, it can also be seen from Table 4 that incorporating question-aware loss mechanism into our model contributes to generating better natural answers. It is evident that our work is better than baselines on BLEU metric, where BLEU score increases 0.007 compared with the strongest baseline ZSCEDT on SimpleQuestions dataset. Analogously, ROUGE-N score increases 0.001 compared with the strongest baseline ZSCEDT on GraphQuestions dataset. These results on three response generation corpuses indicate our model to be superior over competitive baselines on automatic evaluations.

Table 5 The BLEU score on different weights of question-aware loss

In Table 4, the last row is labeled “without question-aware loss”, which indicates that the question-aware loss mechanism is removed in our model. Comparing the performance results of the last two rows in Table 4, our model outperforms the model without question-aware loss mechanism by a significant margin. This result demonstrates the effectiveness of the question-aware loss mechanism. We note that the question-aware loss module significantly boosts the performance of our model. In Table 4, it can be seen from the comparative experiments that our model is capable of generating more natural and accurate natural answer sequences.

Performance analysis

To verify the efficiency of our model, we compared testing time of different baselines on three datasets, as shown in Fig. 3. Overall, our model has certain advantages under various comparison conditions. We analyze the experimental results and draw relevant conclusions.

Figure 3 shows that our model consumes less testing time than other baselines. The critical factor is that it incorporates multi-level copying mechanisms to generate natural answers, which improves accuracy and reduces generating time. ZSCEDT brings slight improvements over VHRED and INLQA. The critical factor is that ZSCEDT applies two control knobs which control the generation process by directly manipulating trained natural language generation models. VHRED employs a high-dimensional latent variable to enhance the contextual information, which requires more time to obtain local importance of words. And INLQA is a data-driven-based method. It gains user features by multiple rounds of user interaction, which has a great effect on the operating efficiency. In addition, VHRED brings slight improvements over QUEST and ZSQG. QUEST generates answers from existing documents that have huge volumes. As a result, it runs less efficiently. And ZSQG utilizes triple occurrences in the natural language corpus, paired with an original part-of-speech copy action mechanism. It takes much time to generate natural text from KGs triples in a “ZeroShot" setup, which affects its efficiency. The above analysis demonstrates that our model is superior to baseline models in terms of testing time.

Fig. 4
figure 4

Efficiency comparison of PASSE, ROUGE-N and BLEU

Effectiveness of question-aware loss mechanism

Table 5 reports BLEU score on different weights of question-aware loss \(\left( \alpha \right) \) where \(\alpha \) is the weight of question-aware loss in Eq. (18). To demonstrate the effectiveness of question-aware loss, we set the weight of question-aware loss \(\left( \alpha \right) \) to 0.05/0.1/0.2/0.25/0.3/0.5/1 (the last seven lines in Table 5). Note that, in Table 5, QAL represents question-aware loss, and w/o represents without. It can be seen that our model with question-aware loss has a significant improvement on BLEU, which indicates that question-aware loss contributes to generating better natural answers. When the weight of question-aware loss is 0.25 (namely \({\alpha = 0.25}\)), our model obtains the highest performance on BLEU (marked as red).

Performance analysis of PASSE

In this section, PASSE is used to measure semantic similarity between the generated natural answers and the standard natural answers. Thus, some experiments were conducted, which were based on the cumulative number of processed answer pairs and comparison frequency of basic units (word to word, n-gram to n-gram, or phrase to phrase) within a fixed amount of time.

In our research, PASSE is compared with BLEU and ROUGE-N, which is based on 24,217 answer pairs. Figure 4 shows efficiency comparison of PASSE, ROUGE-N and BLEU. As shown in Fig. 4, efficiency of PASSE, ROUGE-N and BLEU is stable. This is because both the cumulative number of answer pairs and comparison frequency of basic units are significantly linearly related to testing time.

Experimental results display that in a fixed amount of time, PASSE processes more answer pairs than BLEU and ROUGE-N, as shown in Fig. 4a. We further display that different indicators (PASSE, ROUGE-N and BLEU) lead to different comparison frequency of essential elements (word to word, n-gram to n-gram, or phrase to phrase), as shown in Fig. 4b. As demonstrated in Fig. 4b, we note that comparison frequency of ROUGE-N and BLEU is far more than PASSE. The major reason is, ROUGE-N and BLEU make use of n-grams and words as comparison units. However, PASSE takes phrases as comparison units, which is the key to complete the goal with less time and fewer comparison frequency.

Table 6 Performance on naturalness
Table 7 Ablation experiment by removing the main components on SimpleQuestions dataset
Fig. 5
figure 5

Performance on test data through epochs

Comparison settings. To validate the effectiveness of model components, some important components in our model are removed as follows: (1) w/o QC: without the question copy mechanism; (2) w/o TC: without the triple copy mechanism; (3) w/o PM: without the prediction mechanism; (4) w/o QAL: without the question-aware loss mechanism.

Table 8 Case study

Performances on naturalness

Human evaluation can help to find possible problems of the model in a specific domain or scenario, so as to further optimize the performance of the model. In our experiments, 100 generated answers were randomly sampled from each system for human evaluation, and then naturalness is measured by 3 annotators. Table 6 shows how our system can measure naturalness with comparison to baselines. Our system has scored a significant enhancement on naturalness with up to 0.05 more than best performing baseline, which demonstrates our system can deliver more natural answers than other baseline systems.

Ablation study for model components

Comparative results. Results of ablation study are shown in Table 7. We note that removing any component brings obvious performance degradation on all evaluation metrics. It shows that each of these components plays a crucial role in the model performance. Specifically, “w/o QC" performs the worst on all of the evaluation metrics, which demonstrates the importance of “the question copy mechanism" for the generated natural answer.

The convergence speed

In order to further explore the convergence speed, we plot the performances on test data through epochs. The curve of BLEU and ROUGE-N with respect to the number of epochs is provided in Fig. 5. In the experiment, we recorded BLEU score and ROUGE-N score at different epochs. As shown in Fig. 5, we note that as the number of epochs increases, the curve becomes flat, which embodies the convergence of models. Our model has much more information such as the question representations and the knowledge representations to learn, and it may have a bad impact on the convergence speed. Nevertheless, our model can copy question elements and triple elements simultaneously, which can accelerate the convergence speed. As demonstrated in Fig. 5, our model achieves the best performances on almost epochs. After about 6 epochs, BLEU score of our model become stable and convergent. However, BLEU score of VHRED becomes flat on about 7 epochs. In addition, after about 5 epochs, ROUGE-N score of our model becomes flat. But, ROUGE-N score of VHRED becomes convergent on about 6 epochs.

Case study

The task of our model is to generate natural language answers according to the input question and the input triple. For example, as illustrated in Table 8, our model aims at generating the natural answer “Xiao Ming’s home is in China." (NA3) for the input question and the input factual triple. Here, the generated answer is associated to the important part of the question “Xiao Ming’s home", the tail entity of the triple “Chinese" where morphological variant needs to be processed (e.g. “Chinese" in triple but “China" in answer), and common words (namely, “is in"). It is worth mentioning that the generated answer sequences match the question word (namely, “Where"). As stated above, the model tends to generate NA3 rather than NA2 or NA1. The major reason is, NA3 generates question type words in target answer sequences (namely, “in China"). It demonstrates the importance of “matching question word" (namely question-aware loss) for the generated natural answer. (Note that, QC represents question copy, TC represents triple copy, and PM represents prediction mechanism.)

Conclusion

In this research, we proposed a novel attention-based RNN model for natural answer generation, which incorporates multi-level copying mechanisms and question-aware loss to generate natural answers automatically. To generate natural answers, we leverage multi-level copying mechanisms and the prediction mechanism, which are able to copy semantic units from the user question and the input triple, and predict the common words from the conditional model. In addition, considering the problem where the generated natural answer is not in context, question-aware loss is introduced to make the generated natural answer match the user question. On SimpleQuestions, WebQuestions and GraphQuestions response generation tasks, we achieve a new state of the art. Our model achieves 0.463 ROUGE-N on the GraphQuestions response generation task. We compared testing time of different baselines on three response generation datasets, our model consumes less testing time than other baselines. Experimental results show that the generated natural answer sequences are not only grammatical but also contextual.