Introduction

Recently, question generation (QG) has attracted considerable interest. Its purpose is to generate human-like questions from a given sentence or paragraph [7]. Owing to its complexity and ambiguity, QG is a highly challenging problem in natural language processing. Specifically, unlike one-to-one mapping tasks such as machine translation, QG involves significant diversity in the space of reasonable questions that may be obtained from a given descriptive text [10].

Current studies on QG use paragraphs and answers as input to predict questions [6, 10]. Most QG datasets lack conversational context, whereas humans normally gather information or test knowledge through conversations involving a series of interconnected questions and answers, thus alleviating ambiguity [28]. To addresses this issue, Gao et al. recently identified a new QG challenge called conversation QG (CQG) [11]. In this task, a system should be able to generate a series of interconnected questions depending on a given passage, and participate in a question-answering-style conversation as the questioner. Table 1 presents an example of such a task. In this conversation, a questioner and a respondent talk about a passage, and this implies that the content of the passage is important in predicting this conversation. The task here is to evaluate the capability of the system to understand the given passage and output proper and coherent questions corresponding to the provided answers depending on the conversation history. Consequently, a large-scale conversational question-answering (CoQA) dataset [28] has been modified to a CQG dataset by filtering out QA pairs with “yes”, “no”, or “unknown” as answers (28.7% of the total QA pairs). Each sample contains one passage along with conversation history (previous n turns of QA pairs), the current answer, and the question to be predicted.

Table 1 Example of CQG from CoQA dataset [28]

Gao et al. [11] also presented an end-to-end neural model with coreference alignment and conversation flow modeling. In this model, which outperforms several baseline models, a fundamental requirement is coreference relation annotation on conversation history as input for training. In the inference phase, the model can generate conversational questions depending on the explicated coreference labels.

Traditional coreference-resolution approaches [9, 14] involve complex feature engineering and heavy human labor [33]. Even the best neural coreference-resolution model should compute the scores for all possible spans [21], and this is challenging in dialogue with multiple coreferences [33]. Recently, pretrained language models (e.g., BERT) have been proven effective in coreference resolution [17], as they can capture complex semantic information in long-form language; this is important for multi-turn conversation modeling and the corresponding QG.

In this paper, we propose an advanced CQG framework using text passage fusion and a pretrained language model. We use the BERT model as the encoder, and the decoder in a transformer. The conversation history and relevant text passage are fed into a novel BERT2Trans encoder–decoder model to incorporate the semantic information from both passage and hidden conversation representations.

During fine-tuning, the BERT model should differentiate the input sequences into two segments by a special token “[SEP]” and a learnable segment embedding [3]. However, a long-form text such as a conversation may contain more than two intention or semantic segments [30]; this raises an interesting question: Is it helpful to add more segments into BERT for QG in the context of multi-turn conversation? A conversation consists of multiple turns involving two interlocutors [30], who usually speak alternately. Each turn contains utterances from one speaker. It could be one sentence or multiple sentences. To this end, we further propose an effective multi-task learning mechanism so that segment information of different turns in the conversation history may be fed into the model during training. Thereby, the proposed BERT2Trans can use input sequences with multiple segments.

To validate the proposed framework, we first test it on the CQG dataset [11]. Furthermore, we also test it on document-grounded conversations (DGCs) [23]. This involves unstructured text passages and multi-turn conversations as input to predict next-turn utterances instead of next-turn questions. Both automatic and human evaluations demonstrate that the proposed model substantially outperforms baseline models and can generate informative and a coherent next-turn utterance (question or dialogue response).

The main contributions of our study are as follows: (1) We propose a novel Bert2trans encoder–decoder using pretrained language models to encode semantic information from both unstructured text and conversation history, and to predict the next-turn question or utterance. (2) We adopt multi-task learning to feed segment information of different turns in a conversation into the model during training. Thereby, the proposed BERT2Trans can use input sequences with multiple segments. (3) The proposed framework is evaluated on two major datasets, and it is demonstrated that it significantly outperforms baseline models and achieves state-of-the-art performance on two generation tasks.

Related work

The purpose of QG is to generate an appropriate question for a given input context and answer. Duan et al. proposed a retrieval-based method using convolutional neural networks and a generation-based method using recurrent neural networks (RNNs) [8]. They also used the generated questions to improve existing QA systems and concluded that QG and QA can boost each other.

Seq2Seq models using long short-term memory (LSTM) with global attention are widely used for QG [8, 15, 18, 20, 40, 42, 43]. In decoding, the hidden states of the decoder are used to generate attention weights for the encoded representation of the input. These weights are used to generate the context vector, which is subsequently concatenated with the hidden-state vector for classification. This widely used mechanism may vary in the computation of the attention weights (e.g., Luong or Bahdanau attention).

As the context consists of multiple sentences, Du et al. proposed encoding at sentence and paragraph level [5]. At sentence level, the sum or convolution operation followed by max pooling is applied to the token embedding vector to obtain the representation of each sentence. At paragraph level, the sentence-representation vector is fed into a bi-directional LSTM.

As some of the tokens are rarely used but are also keywords that can appear in both the context and the conversation, a copy mechanism is proposed for directly selecting words from the context [40, 42, 43]. For repeated words in the context, the copy scores can be naturally higher than those for others. Therefore, Zhao et al. [40] proposed limiting the magnitude of these scores by a maxout pointer. As the question tokens copied from the context tend to be close and relevant to the answer, Zi et al. [43] proposed a position-aware model. Moreover, they proposed selecting question words from a restricted vocabulary. In other models, more input features are added, such as part of speech, named entities, word case, coreference, and dependency [15, 20, 42, 43]. These features are concatenated with the token embedding vector and answer-position signals, and are subsequently fed into an LSTM encoder.

The purpose of CQG is to generate interconnected questions in CoQA [11], where these questions depend on the conversation history and a given passage. CFNet [11] proposed an RNN-based Seq2Seq model with coreference alignment and conversation flow modeling. To obtain a coherent conversation, CFNet adopts a conversation flow mechanism that generates questions about the beginning of a paragraph and gradually shifts focus to the later parts. To correlate the generated question with the dialog history, the coreference alignment mechanism explicitly aligns the referential relationship in the conversation history with the corresponding reference in the generated question.

In parallel to question generation in a conversational context, another closely related line of research is DGC generation, in which dialogue responses are generated based on the context of a given document [23]. The Seq2Seq model has been widely adopted in dialogue response generation by incorporating external unstructured knowledge. External relevant facts and/or knowledge are another type of information that facilitates dialogue understanding and response generation [12, 36, 41]. For example, [27] extended Seq2Seq using an external knowledge embedding to incorporate external knowledge from Wikipedia for conversation generation. [4] combined a transformer framework with a memory network to retrieve knowledge and generate natural dialogue responses.

In addition, large-scale unsupervised pretrained language models, such as BERT [3], XLNet [37], and RoBERTa [24] have achieved significant performance gains in several NLP tasks. Some recent studies have also proposed using pretrained language models for text generation [39]. In the present study, we use BERT as an encoder that can fuse information by its multiple self-attention layers in both the right-to-left and left-to-right directions. Furthermore, we adopt multi-task learning so that segment information of different turns in a conversation may be fed into the model during training. Thereby, the proposed BERT2Trans can use input sequences with multiple segments.

Methodology

Overview

The proposed pretrained question-generation framework is shown in Fig. 1. A novel encoder–decoder model uses BERT as the encoder to fuse the information from a given text passage with dialogue history. A multi-head attention-based decoder is proposed to incorporate the semantic information from both the encoded knowledge and hidden dialogue states to generate informative and coherent dialogue responses.

Fig. 1
figure 1

Knowledge-enhanced response generation

We use \(D = \{x_1,\ldots ,x_m\}\), \(P=\{k_1,\ldots ,k_n\}\), and \(Q = \{y_1,\ldots ,y_l\}\) to represent dialogue history, related text passage, and generated question respectively; \(x_t\), \(k_t\), and \(y_t\) are words from a vocabulary V. For a given dialogue history D and the corresponding text passage P, our goal is to generate a proper and informative next-turn utterance Q (question or response in case of DGC).

Encoder

As most question-generation models have an encoder–decoder structure [32, 35], we adopt the BERT model [3] to map the input sequence of conversation history and relevant passage into a continuous representation sequence. We propose three different mechanisms to arrange the dialogue history and relevant passage:

  1. (1)

    We consider the passage P a sequence of tokens. The conversation history is concatenated with the relevant text passage, and [SEP] is inserted in between, that is, {[CLS] P [SEP] D [SEP]}.

  2. (2)

    In the above method, the multiple-turn utterances of the dialogue history are concatenated into a sequence of tokens in chronological order. Conversation session segmentation (i.e., turn boundaries) is ignored. The model encodes the dialogue history as a continuous sequence of tokens. However, conversation session segmentation is important for semantic understanding. A multi-turn dialogue normally involves two interlocutors [30], who speak alternately. Thus, intentions vary greatly between adjoining turns. In CQG and DGC generation, the last-turn utterance is significantly more relevant to what was generated than those of previous turns. Thus, we propose inserting [SEP] between adjoining turns in the conversation history, that is, {[CLS] P [SEP] Turn\(_1\) [SEP] Turn\(_2\) [SEP] ... [SEP] Turn\(_N\) [SEP]}.

  3. (3)

    Adopting multiple [SEP] in the input sequence is substantially different from the pretraining setup of BERT [3], and thus the model is difficult to converge during fine-tuning. The self-attention block may pay excessive attention to some tokens in the input sequence. We will elaborate this in detail in Sect. 4.5. We propose multi-task learning so that the segment information of adjacent turns in the conversation may be fed into the model during training. The setup of [SEP] in the input sequence is the same as that in BERT pretraining (i.e., {[CLS] P [SEP] D [SEP]}). As shown in Fig. 1, the boundary detection layer is a simple binary classifier consisting of a linear transformation and a softmax function that map the final hidden vector from BERT to the predicted segmentation labels for each token in the conversation history. The starting tokens of each turn/round are labeled as 1, and other tokens in the turn/round are labeled as 0. In the inference phase, the model can generate the next-turn question without an explicit dialog session segmentation label.

In addition to the position and the token embedding, we also use a learnt segment embedding to every token indicating whether the token belongs to the passage text or dialogue history. As shown in Fig. 1, for each token \(x_i\), the input embedding is the sum of the corresponding token, segment, and position embeddings:

$$\begin{aligned} I(x_i)= E(x_i) + T(x_i) + P(x_i), \end{aligned}$$
(1)

where \(E(x_i)\), \(T(x_i)\), and \(P(x_i)\) are the word, segment, and position embeddings, respectively. The positional embedding is a learnable embedding with supported sequence length up to 512 tokens [3].

The input embedding is then fed into the BERT model to obtain encoded representations of the passage and of the dialogue history.

$$\begin{aligned} H_p ; H_d= BERT(I(k_1) \cdots I(k_n), I(x_1) \cdots I(x_m)), \end{aligned}$$
(2)

where \(H_p\) and \(H_d\) are the semantic representations of the passage text and the dialogue history, respectively.

Turn-boundary detection output: We assume that \(H^i_d\) is the semantic representation of the i th token in the dialogue history. The probability that the i th token is labeled as class c (i.e., the starting token of the next turn) is predicted by the softmax function:

$$\begin{aligned} P(c|x_i) = softmax(W^T_dH^i_d), \end{aligned}$$
(3)

where the parameter \(W_d\) is learnt during model training.

Decoder

After comprehensive fusion in the BERT model, \(H_p\) and \(H_d\) are the semantic representations of the passage text and the dialogue history, respectively. The self-attention sub-layer in the decoder is modified to prevent paying attention to subsequent positions, ensuring that predictions are only made depending on the known outputs [34]:

$$\begin{aligned} \begin{aligned} D_{m}=\,&MaskedMultiHead(Q=Q,K=Q,\\&\qquad V=Q), \end{aligned} \end{aligned}$$
(4)

where Q is the generated question embedding, and \(D_m\) is the output of masked self-attention. Following the definition proposed in [34], Q, K, and V denote the query, key, and value vectors in multi-head attention, respectively.

As shown in Fig. 1, \(H_d\) is concatenated with \(H_p\) and fused with the response representation in multi-head attention:

$$\begin{aligned} \begin{aligned} D_{pd}=\,&MultiHead(Q=D_{m},K=[H_p;H_d],\\&\qquad V=[H_p;H_d]). \end{aligned} \end{aligned}$$
(5)

The decoder fuses the information from \(H_p\) and \(H_d\), and effectively uses both the conversation context and relevant passage to predict the next-turn utterance. Subsequently, \(D_{pd}\) is fed into a position-wise feedforward network.

$$\begin{aligned} F_{dp}= FFN(D_{dp}). \end{aligned}$$
(6)

As in transformer networks [34], the embedding weight matrix of the encoder (BERT model) is shared with the decoder. The pretrained word embedding has rich semantic information about each word [30] and provides a good initialization point to the decoder word embedding.

Copy mechanism

We apply dot-product attention [1] to the decoder hidden states \(F_{dp}\) with encoded representations \([H_p ; H_d]\) to obtain the attended token representations \(F^{'}_{dp}\) and the corresponding attention weight distribution W. The vocabulary distribution M is subsequently calculated using a feedforward neural network:

$$\begin{aligned}&F^{'}_{dp}, W= Attention(F_{dp},[H_p ; H_d]), \end{aligned}$$
(7)
$$\begin{aligned}&M= softmax(FNN(F^{'}_{dp})). \end{aligned}$$
(8)

Repeating the words from conversation context is an important ability in human language communication [29]. As some of tokens (e.g., entity names) are rarely used but are also important for context understanding and dialogue generating. The copy mechanism in [13, 29] is introduced to allow both copying words from the input by pointing and generating words from a predefined vocabulary V during decoding. It can improve generation accuracy and handle the out-of-vocabulary words in context. In this study, the attended token representations \(F^{'}_{dp}\) and the decoder output \(F_{dp}\) are concatenated to learn the generation probability \(P_{gen}\), which is used to obtain the final distribution of the generated question Q:

$$\begin{aligned} P_{gen}= & {} sigmoid(FNN(F^{'}_{dp};F_{dp})) \end{aligned}$$
(9)
$$\begin{aligned} Q= & {} P_{gen}M+(1-P_{gen})W, \end{aligned}$$
(10)

where M is the vocabulary distribution from Eq. 8, and W is the attention weight distribution from Eq. 7.

Joint training

We now introduce a turn-boundary detection loss to feed dialogue session segmentation (i.e., boundaries between adjacent turns) information into the model during training:

$$\begin{aligned} Loss_{boundary} = \frac{1}{N_s}\sum _{n=1}^{N_s}, \sum _{i=0}^{m}{logP(c_i|x_i)}, \end{aligned}$$
(11)

where \(N_s\) is the number of training samples, m is the number of dialogue history tokens, and \(c_i\) is the correct boundary label indicating whether the token \(x_i\) is the start of the next turn in the dialogue history.

$$\begin{aligned} Loss_{gen} = \frac{1}{N_s}\sum _{i=1}^{N_s}{logP(Q_i|D_i,P_i)}, \end{aligned}$$
(12)

where \(N_s\) is the number of training samples. \(D_i\), \(Q_i\), and \(P_i\) are the dialogue history, question, and relevant text passage. \(Loss_{gen}\) is the typical negative log-likelihood loss in Seq2Seq learning.

Considering these two components, we define a joint loss function as

$$\begin{aligned} Loss = \lambda _g Loss_{gen} +(1-\lambda _g) Loss_{boundary}, \end{aligned}$$
(13)

where \(\lambda _g\) is hyperparameter.

Experimental study

Datasets

We evaluated the proposed BERT2Trans model on two datasets:

1. CQG dataset. We used a processed version of the dataset in [11]; it contains 66298 training dialog samples and 8360 test samples. A related document from seven different domains [28] is given for each dialogue. The average length of passages, questions, and answers are 332.9, 6.44, and 3.2 tokens, respectively.

2. DGC dataset. We used a processed version of the dataset in [23]. This contains 72922 training samples and 11577 test samples, the utterance style of which is that of a casual chat. A related document about popular movies is given for each dialogue, which contains descriptive information about the movie.

Evaluation metrics and configuration

Regarding the CQG dataset, we used the evaluation metrics BLEU [26] and ROUGE-L [2] to measure the question generation accuracy and make an aligned comparison with CFNet [11]. These metrics are widely used for measuring textual similarity. The distinct-1, distinct-2 [22] metrics were also adopted to measure the degree of diversity of the generated questions. Furthermore, we used the average entropy [31] to measure the performance in providing more informative content in question generation. Regarding the DGC dataset, we closely followed [23] and used perplexity (PPL) and BLEU [26] for comparison. Moreover, we used human evaluation, as it is broadly agreed that objective metrics weakly correlate with human evaluation results. Human evaluation is a necessity in dialogue generation.

Table 2 presents the hyperparameters for training the proposed model. We used the pretrained BERT\(_{base}\) as the encoder. The BERT2Trans decoder has two layers of attention blocks. All models were trained using Adam [19] for optimization. The learning rate was set to 1e–4 for the CQG task and 5e–5 for the DGC task. Following [11, 23], we used three previous turns of QA pairs in the conversation history for the CQG task, and three utterances for the DGC task to make the aligned comparison. All the models were trained for at most 20 epochs. The best hyperparameter values were selected using the validation set, and the results are reported for the test set.

Table 2 Hyperparameters for model training

Experimental results

BERT2Trans+nosep, +sep and, +detect refer to no separation between questions and answers in the conversation history, using “[SEP]” as separation between questions and answers in the conversation history, and BERT2trans with boundary detection mechanism (i.e., the three mechanisms described in Sect. 3.2), respectively.

On the CQG dataset, we compared the proposed models with the following baselines: BahdanauAttention [1] and LuongAttention [25] are Seq2Seq with two different attention mechanisms. PGNet [29] is Seq2Seq with Pointer-generator network. MCT [38] is Seq2Seq with attention, copy mechanism, and answer encoder layer. ER GAN [16] extends attentional Seq2Seq network with an adversarial discriminator and is trained with policy gradient method. NQG [6] is Seq2Seq network that incorporates coreference feature via gated network. In BahdanauAttention, LuongAttention, PGNet, MCT, and NQG, passage, conversation history, and current answer are concatenated as a sequence for the input. MSNet [11], CorefNet [11], FlowNet [11], and CFNet [11] are multi-source encoder–decoder models with coreference alignment and conversation flow modeling.

Table 3 shows a comparison of the BLEU (1–4) (i.e., B1, B2, B3, B4) [26] and ROUGE-L (i.e., R-L) [2] scores of the proposed models and those of the baseline models. All the proposed models performed significantly better than the competitive baseline models and achieved state-of-the-art results in CQG [11].

Table 3 Results for the CQG test set

The proposed BERT2Trans+detect outperforms the best baseline (CFNet [11]) by 8.9% and 3.3% in terms of BLEU3 and BLEU4, respectively. As the average question length is only 6.44 tokens, the improvement is distinctly large in terms of BLEU3 and even more significant in terms of BLEU2 and BLEU1. The proposed models achieve better n-gram similarity between the generated question and the ground truth. BERT2Trans+nosep is not better than the baseline model (CFNet). This suggests that separation information of multiple turns (questions and answers) in conversations is important for conversational QG. Good model-generated questions should be based on the answers in the last turn. As shown in Table 3, ER GAN also has room to improve, which is consistent with the finding in previous study [16].

Table 4 PPL and BLEU4 for the DGC test set. Baseline results from [23]

On the DGC dataset, we compared the proposed models with the following baseline models: Seq2Seq [1], HRED [30], Transformer [34], and Wizard Transformer [4]. ITE+CKAD and ITE+DD were proposed in [23] and consist of an incremental transformer with a deliberation decoder for DGCs.

Table 4 shows the PPL and BLEU4 scores for the DGC dataset. The baseline results are reported in [23]. Among the baseline models, ITE+DD has the best performance. This model is enhanced by a deliberation decoder[23, 39]. To make an aligned comparison with [23], we also equipped the proposed Bert2trans+dectect model with a similar deliberation decoder. The results demonstrate that the use of the deliberation decoder only improved the PPL score, and the proposed boundary detection mechanism significantly improved BLEU4.

The Bert2trans+detect+DD model outperformed all baseline models in terms of PPL and BLEU4.

Table 5 Generation diversity results for the CQG test set

Current Seq2Seq models tend to generate generic non-informative text. We present a comparison of the diversity levels of the results for the CQG dataset in Table 5. Li et al. [22] proposed distinct-1 and distinct-2 to measure the diversity of the generated text. These metrics can be computed as the number of distinct unigrams and bigrams divided by the total number of generated words. Table 5 presents the distinct unigrams, distinct bigrams, and total number of generated words as 1gram, 2gram, and |wd|, respectively. The proposed BERT2Trans+detect model substantially increased dct-1 and dct-2 (i.e., the distinct unigrams and bigrams, respectively) over the baseline model (CFNet [11]). This indicates that the proposed models can significantly improve the diversity of the generated questions.

The BERT2Trans+sep model achieved the best distinct-1 score. It yielded smaller values for 1gram, 2gram, and |wd|. A high distinct-1 score can be a direct consequence of the relatively low total number of generated words. As shown in Table 5, the proposed models generated more diverse questions compared with the baseline model (CFNet [11]).

Table 6 Response information entropy for the CQG test set
Table 7 QG examples
Table 8 Examples of generated dialog responses for DGC
Fig. 2
figure 2

Visualization of attention weights in the last self-attention layer of BERT encoder

Table 6 shows the average response length and average response information entropy with respect to the maximum likelihood unigram model for the generated questions. Following the definition in [31], we computed the unigram probabilities based on the maximum-likelihood unigram distribution of the training corpus. \(H_w\) (i.e., information entropy per word) is computed as [31]

$$\begin{aligned} H_w = - \sum _{w \in U} p(w)logp(w), \end{aligned}$$
(14)

\(H_U\) denotes the information entropy per response and |U| denotes the average response length. It can be observed that the proposed models, particularly BERT2Trans +detect, generated questions with larger average length and remarkably enhanced the utterance entropy \(H_U\). This indicates their capability to generate questions with higher informativeness, and it is in good agreement with the previous experiment regarding the BLEU and ROUGE-L metrics. The human-generated questions were superior to all questions generated by neural generative models in terms of response length and information entropy, suggesting that a higher entropy is desirable [31].

Case study

Table 7 shows three QG examples from the CQG task, where the proposed BERT2Trans+detect model substantially outperformed the baseline model. In case 1, it appears that the baseline model failed to learn the semantics of interconnected questions in the conversation history and generated a question related to the word “daily”, whereas our model generated a coherent question depending on the conversation history and the passage. In case 2, although the answer is “at the age of 5”, the passage and conversation history are about “Kenny’s love of cooking”. The proposed model determined the sentences related to the conversation topic and generated a question related to the time when he cooked his first dish. In case 3, the baseline model only generated a generic question “when?” and completely ignored the story of the tennis ball player. The proposed model generated a specific question using the correct contextual information.

Table 8 shows the generated responses for the DGC dataset by the proposed Bert2Trans+detect and ITE+DD [23], which achieves the best performance in terms of automatic evaluation. In case 1, both Bert2Trans+detect and ITE+DD output meaningful responses depending on the relevant movie descriptions; however, the Bert2Trans+detect model output a more proper response in the context of conversation. In case 2, ITE+DD generated a generic response, whereas the Bert2Trans+kg model produced the correct answer to the question. This indicates that the proposed model understands implicit semantic information in long-form language.

Attention visualization

Figure 2 shows a comparison of the attention weights in the last self-attention layer of the BERT model (encoder) for three mechanisms for separating questions and answers (or turns) in the conversation history (see Sect.  3.2). As shown in Fig. 2a, the setup of the input sequence is same as that for pretraining (i.e., \(\{[CLS] \ K \ [SEP]\ D \ [SEP]\}\)). All words pay considerable attention to “[SEP]”. Figure 2b shows that if multiple “[SEP]s” are inserted to separate questions and answers in the conversation history, the model will pay great attention to other tokens (i.e., “.”). The additional “[SEP]” misleads the BERT model to learn an abnormal attention weight. By contrast, “[SEP]” does not receive significant attention. Figure 2c, d shows the weight of two self-attention heads. As shown on the x-axis, the conversation history is “Q: who knocked on the door? A: the lawyer Q: when? A: at eight o’clock Q: who did he find? A: the artist A: sadly altered for the worse”. Without explicit indication in the test data, the proposed model learns the correct boundary between adjacent questions and answers, as demonstrated by the large attention weight received by “[SEP]” (Fig. 2a). The last answer (i.e., “sadly altered for the worse”) is the most important information for generating an informative and coherent question. As shown in Fig. 2d, one of the self-attention heads pays special attention to the starting token of the current answer; this is not found in the previous two mechanisms (see Sect.  3.2).

Human evaluation

We also conducted a manual evaluation for both the CQG and DGC datasets, as a supplement to the evaluation through the metrics. For the CQG dataset, 93 randomly sampled questions with relevant passage, conversation history, and current answer were used for human evaluation. Human annotators evaluated the questions generated by the baseline model (CFNet) and those generated by the proposed model in terms of the same metrics as in [11]: “Grammaticality”, “Answerability”, and “Interconnectedness”. “Grammaticality” measures whether the generated question is fluent and grammatically correct. “Answerability” measures whether the generated question can be answered by the current answer [11]. “Interconnectedness” evaluates whether the generated question can refer back to the conversation history [11].

Table 9 Human evaluation for the CQG dataset

For the DGC dataset, we randomly selected 100 samples for human evaluation. We used the same metrics as in [23] (i.e., “Fluency”, “Knowledge Relevance”, and “Context Coherence”) to compare the proposed model with the best baseline model (ITE-DD) [23]. “Fluency” measures the fluency of a response, “Knowledge Relevance” determines whether relevant knowledge was used in a response, and “Context Coherence” considers whether the responses are coherent with dialogue history.

Table 10 Human evaluation for DGC dataset

We anonymized the model identities for each generated response. All metrics are on a 1–3 scale (3 for the best). The results are shown in Tables 9 and 10. The proposed models outperformed the baseline models (CFNet and ITE+DD) in terms of all six metrics. Both the proposed and the baseline model achieved high “Grammaticality” and “Fluency” scores; this is consistent with the findings in [11, 23]. It is conceivable that both the baseline and the proposed model can generate fluent and grammatical next-turn utterances (question or response). In addition, the proposed BERT2Trans + detect model is superior to the baseline models in terms of the “Answerability” and “Knowledge Relevance” metrics, suggesting that BERT2Trans can better combine passages and dialogue history, and can generate appropriate next-turn utterances (question and response) with better knowledge relevance.

Conclusion and future work

In this paper, we proposed a deep pretrained question generation model to facilitate conversation history understanding and question generation by using the information fusion ability of pretrained language models. We evaluated the model on CQG and DGC datasets in terms of multiple metrics. The results demonstrated that the proposed approach can generate more coherent and informative next-turn utterances (question or response).

In future work, we should enhance the model with reasoning ability for multiple passages or knowledge sources. Thereby, the most relevant textual knowledge content may be distinguished, and the effectiveness of conversation generation may be improved. The experiments and the observations regarding the generated utterances indicate that the comprehensive representation of the complexity of multi-turn conversation remains challenging for current models.