Multi-turn dialogue-oriented pretrained question generation model

In recent years, teaching machines to ask meaningful and coherent questions has attracted considerable attention in natural language processing. Question generation has found wide applications in areas such as education (testing knowledge) and chatbots (enhancing interaction). Following previous studies on conversational question generation, we propose a pretrained, encoder–decoder model that can incorporate the semantic information from both passage and hidden conversation representations. We adopt BERT as the encoder to combine external text and dialogue history, and we design a multi-head attention-based decoder to incorporate the semantic information from both text and hidden dialogue representations into the decoding process, thereby generating coherent questions. Experiments with conversational question generation and document-grounded dialogue response generation tasks indicate that the proposed model is superior to baseline models in terms of both standard metrics and human evaluations.


Introduction
Recently, question generation (QG) has attracted considerable interest.Its purpose is to generate human-like questions from a given sentence or paragraph [7].Owing to its complexity and ambiguity, QG is a highly challenging problem in natural language processing.Specifically, unlike one-toone mapping tasks such as machine translation, QG involves significant diversity in the space of reasonable questions that may be obtained from a given descriptive text [10].
Current studies on QG use paragraphs and answers as input to predict questions [6,10].Most QG datasets lack conversational context, whereas humans normally gather information or test knowledge through conversations involving a series of interconnected questions and answers, thus alleviating ambiguity [28].To addresses this issue, Gao et al. recently identified a new QG challenge called conversation QG (CQG) [11].In this task, a system should be able to generate a series of interconnected questions depending on a given passage, and participate in a question-answering-style conversation as the questioner.Table 1 presents an example of such a task.In this conversation, a questioner and a respondent talk about a passage, and this implies that the content of the passage is important in predicting this conversation.The task here is to evaluate the capability of the system to understand the given passage and output proper and coherent questions corresponding to the provided answers depending on the conversation history.Consequently, a large-scale conversational question-answering (CoQA) dataset [28] has been modified to a CQG dataset by filtering out QA pairs with "yes", "no", or "unknown" as answers (28.7% of the total QA pairs).Each sample contains one passage along with conversation history (previous n turns of QA pairs), the current answer, and the question to be predicted.
Gao et al. [11] also presented an end-to-end neural model with coreference alignment and conversation flow modeling.In this model, which outperforms several baseline models, a fundamental requirement is coreference relation annotation 123 Table 1 Example of CQG from CoQA dataset [28] Passage: As soon as the Columbia could make the proper landing, Captain Ponsberry went ashore and reported his arrival to the authorities, and also reported the escape of Shamhaven and Peterson.The authorities had already heard of the capture of the Columbia from the Russians, and said that the schooner would have to remain at Nagasaki until the whole case could be adjusted.The Japanese were inclined to favor both the Richmond Importing Company and the owners of the vessel, so it was not likely that our friends would lose much in the end.In the meantime, the Columbia could be put in a dry dock and given the overhauling that she needed."We shall do all we can to locate Shamhaven and Peterson and get back your money", said an official of the secret service department.But his hands were so full with other matters of greater importance that little attention was paid to the disappearance of the two rascals."Well, this will tie me up at Nagasaki for some time to come", said Captain Ponsberry to Larry, on the third day after arriving at the Japanese port.on conversation history as input for training.In the inference phase, the model can generate conversational questions depending on the explicated coreference labels.
Traditional coreference-resolution approaches [9,14] involve complex feature engineering and heavy human labor [33].Even the best neural coreference-resolution model should compute the scores for all possible spans [21], and this is challenging in dialogue with multiple coreferences [33].Recently, pretrained language models (e.g., BERT) have been proven effective in coreference resolution [17], as they can capture complex semantic information in long-form language; this is important for multi-turn conversation modeling and the corresponding QG.
In this paper, we propose an advanced CQG framework using text passage fusion and a pretrained language model.We use the BERT model as the encoder, and the decoder in a transformer.The conversation history and relevant text passage are fed into a novel BERT2Trans encoder-decoder model to incorporate the semantic information from both passage and hidden conversation representations.
During fine-tuning, the BERT model should differentiate the input sequences into two segments by a special token "[SEP]" and a learnable segment embedding [3].However, a long-form text such as a conversation may contain more than two intention or semantic segments [30]; this raises an interesting question: Is it helpful to add more segments into BERT for QG in the context of multi-turn conversation?A conversation consists of multiple turns involving two interlocutors [30], who usually speak alternately.Each turn contains utterances from one speaker.It could be one sentence or multiple sentences.To this end, we further propose an effective multitask learning mechanism so that segment information of different turns in the conversation history may be fed into the model during training.Thereby, the proposed BERT2Trans can use input sequences with multiple segments.
To validate the proposed framework, we first test it on the CQG dataset [11].Furthermore, we also test it on documentgrounded conversations (DGCs) [23].This involves unstructured text passages and multi-turn conversations as input to predict next-turn utterances instead of next-turn questions.Both automatic and human evaluations demonstrate that the proposed model substantially outperforms baseline models and can generate informative and a coherent next-turn utterance (question or dialogue response).
The main contributions of our study are as follows: (1) We propose a novel Bert2trans encoder-decoder using pretrained language models to encode semantic information from both unstructured text and conversation history, and to predict the next-turn question or utterance.(2) We adopt multi-task learning to feed segment information of different turns in a conversation into the model during training.Thereby, the proposed BERT2Trans can use input sequences with multiple segments.(3) The proposed framework is evaluated on two major datasets, and it is demonstrated that it significantly outperforms baseline models and achieves state-of-the-art performance on two generation tasks.

Related work
The purpose of QG is to generate an appropriate question for a given input context and answer.Duan et al. proposed a retrieval-based method using convolutional neural networks and a generation-based method using recurrent neural net-works (RNNs) [8].They also used the generated questions to improve existing QA systems and concluded that QG and QA can boost each other.
Seq2Seq models using long short-term memory (LSTM) with global attention are widely used for QG [8,15,18,20,40,42,43].In decoding, the hidden states of the decoder are used to generate attention weights for the encoded representation of the input.These weights are used to generate the context vector, which is subsequently concatenated with the hiddenstate vector for classification.This widely used mechanism may vary in the computation of the attention weights (e.g., Luong or Bahdanau attention).
As the context consists of multiple sentences, Du et al. proposed encoding at sentence and paragraph level [5].At sentence level, the sum or convolution operation followed by max pooling is applied to the token embedding vector to obtain the representation of each sentence.At paragraph level, the sentence-representation vector is fed into a bidirectional LSTM.
As some of the tokens are rarely used but are also keywords that can appear in both the context and the conversation, a copy mechanism is proposed for directly selecting words from the context [40,42,43].For repeated words in the context, the copy scores can be naturally higher than those for others.Therefore, Zhao et al. [40] proposed limiting the magnitude of these scores by a maxout pointer.As the question tokens copied from the context tend to be close and relevant to the answer, Zi et al. [43] proposed a position-aware model.Moreover, they proposed selecting question words from a restricted vocabulary.In other models, more input features are added, such as part of speech, named entities, word case, coreference, and dependency [15,20,42,43].These features are concatenated with the token embedding vector and answer-position signals, and are subsequently fed into an LSTM encoder.
The purpose of CQG is to generate interconnected questions in CoQA [11], where these questions depend on the conversation history and a given passage.CFNet [11] proposed an RNN-based Seq2Seq model with coreference alignment and conversation flow modeling.To obtain a coherent conversation, CFNet adopts a conversation flow mechanism that generates questions about the beginning of a paragraph and gradually shifts focus to the later parts.To correlate the generated question with the dialog history, the coreference alignment mechanism explicitly aligns the referential relationship in the conversation history with the corresponding reference in the generated question.
In parallel to question generation in a conversational context, another closely related line of research is DGC generation, in which dialogue responses are generated based on the context of a given document [23].The Seq2Seq model has been widely adopted in dialogue response generation by incorporating external unstructured knowledge.External relevant facts and/or knowledge are another type of information that facilitates dialogue understanding and response generation [12,36,41].For example, [27] extended Seq2Seq using an external knowledge embedding to incorporate external knowledge from Wikipedia for conversation generation.[4] combined a transformer framework with a memory network to retrieve knowledge and generate natural dialogue responses.
In addition, large-scale unsupervised pretrained language models, such as BERT [3], XLNet [37], and RoBERTa [24] have achieved significant performance gains in several NLP tasks.Some recent studies have also proposed using pretrained language models for text generation [39].
In the present study, we use BERT as an encoder that can fuse information by its multiple self-attention layers in both the right-to-left and left-to-right directions.Furthermore, we adopt multi-task learning so that segment information of different turns in a conversation may be fed into the model during training.Thereby, the proposed BERT2Trans can use input sequences with multiple segments.

Overview
The proposed pretrained question-generation framework is shown in Fig. 1.A novel encoder-decoder model uses BERT as the encoder to fuse the information from a given text passage with dialogue history.A multi-head attention-based decoder is proposed to incorporate the semantic information from both the encoded knowledge and hidden dialogue states to generate informative and coherent dialogue responses.
We use D = {x 1 , . . ., x m }, P = {k 1 , . . ., k n }, and Q = {y 1 , . . ., y l } to represent dialogue history, related text passage, and generated question respectively; x t , k t , and y t are words from a vocabulary V.For a given dialogue history D and the corresponding text passage P, our goal is to generate a proper and informative next-turn utterance Q (question or response in case of DGC).

Encoder
As most question-generation models have an encoderdecoder structure [32,35], we adopt the BERT model [3] to map the input sequence of conversation history and relevant passage into a continuous representation sequence.We propose three different mechanisms to arrange the dialogue history and relevant passage: (1) We consider the passage P a sequence of tokens.The conversation history is concatenated with the relevant text In addition to the position and the token embedding, we also use a learnt segment embedding to every token indicating whether the token belongs to the passage text or dialogue history.As shown in Fig. 1, for each token x i , the input embedding is the sum of the corresponding token, segment, and position embeddings: where E(x i ), T (x i ), and P(x i ) are the word, segment, and position embeddings, respectively.The positional embedding is a learnable embedding with supported sequence length up to 512 tokens [3].The input embedding is then fed into the BERT model to obtain encoded representations of the passage and of the dialogue history.
where H p and H d are the semantic representations of the passage text and the dialogue history, respectively.Turn-boundary detection output: We assume that H i d is the semantic representation of the ith token in the dialogue history.The probability that the ith token is labeled as class c (i.e., the starting token of the next turn) is predicted by the softmax function: where the parameter W d is learnt during model training.

Decoder
After comprehensive fusion in the BERT model, H p and H d are the semantic representations of the passage text and the dialogue history, respectively.The self-attention sub-layer in the decoder is modified to prevent paying attention to subsequent positions, ensuring that predictions are only made depending on the known outputs [34]: where Q is the generated question embedding, and D m is the output of masked self-attention.Following the definition proposed in [34], Q, K , and V denote the query, key, and value vectors in multi-head attention, respectively.As shown in Fig. 1, H d is concatenated with H p and fused with the response representation in multi-head attention: ( The decoder fuses the information from H p and H d , and effectively uses both the conversation context and relevant passage to predict the next-turn utterance.Subsequently, D pd is fed into a position-wise feedforward network. As in transformer networks [34], the embedding weight matrix of the encoder (BERT model) is shared with the decoder.The pretrained word embedding has rich semantic information about each word [30] and provides a good initialization point to the decoder word embedding.

Copy mechanism
We apply dot-product attention [1] to the decoder hidden states F dp with encoded representations [H p ; H d ] to obtain the attended token representations F dp and the corresponding attention weight distribution W .The vocabulary distribution M is subsequently calculated using a feedforward neural network: Repeating the words from conversation context is an important ability in human language communication [29].As some of tokens (e.g., entity names) are rarely used but are also important for context understanding and dialogue generating.The copy mechanism in [13,29] is introduced to allow both copying words from the input by pointing and generating words from a predefined vocabulary V during decoding.It can improve generation accuracy and handle the out-ofvocabulary words in context.In this study, the attended token representations F dp and the decoder output F dp are concatenated to learn the generation probability P gen , which is used to obtain the final distribution of the generated question Q: where M is the vocabulary distribution from Eq. 8, and W is the attention weight distribution from Eq. 7.

Joint training
We now introduce a turn-boundary detection loss to feed dialogue session segmentation (i.e., boundaries between adjacent turns) information into the model during training: where N s is the number of training samples, m is the number of dialogue history tokens, and c i is the correct boundary label indicating whether the token x i is the start of the next turn in the dialogue history.
where N s is the number of training samples.D i , Q i , and P i are the dialogue history, question, and relevant text passage.
Loss gen is the typical negative log-likelihood loss in Seq2Seq learning.
Considering these two components, we define a joint loss function as where λ g is hyperparameter.

Datasets
We evaluated the proposed BERT2Trans model on two datasets: 1. CQG dataset.We used a processed version of the dataset in [11]; it contains 66298 training dialog samples and 8360 test samples.A related document from seven different domains [28] is given for each dialogue.The average length of passages, questions, and answers are 332.9,6.44, and 3.2 tokens, respectively.123 2. DGC dataset.We used a processed version of the dataset in [23].This contains 72922 training samples and 11577 test samples, the utterance style of which is that of a casual chat.A related document about popular movies is given for each dialogue, which contains descriptive information about the movie.

Evaluation metrics and configuration
Regarding the CQG dataset, we used the evaluation metrics BLEU [26] and ROUGE-L [2] to measure the question generation accuracy and make an aligned comparison with CFNet [11].These metrics are widely used for measuring textual similarity.The distinct-1, distinct-2 [22] metrics were also adopted to measure the degree of diversity of the generated questions.Furthermore, we used the average entropy [31] to measure the performance in providing more informative content in question generation.Regarding the DGC dataset, we closely followed [23] and used perplexity (PPL) and BLEU [26] for comparison.Moreover, we used human evaluation, as it is broadly agreed that objective metrics weakly correlate with human evaluation results.Human evaluation is a necessity in dialogue generation.
Table 2 presents the hyperparameters for training the proposed model.We used the pretrained BERT base as the encoder.The BERT2Trans decoder has two layers of attention blocks.All models were trained using Adam [19] for optimization.The learning rate was set to 1e-4 for the CQG task and 5e-5 for the DGC task.Following [11,23], we used three previous turns of QA pairs in the conversation history for the CQG task, and three utterances for the DGC task to make the aligned comparison.All the models were trained for at most 20 epochs.The best hyperparameter values were selected using the validation set, and the results are reported for the test set.

Experimental results
BERT2Trans+nosep, +sep and, +detect refer to no separation between questions and answers in the conversation history, using "[SEP]" as separation between questions and answers in the conversation history, and BERT2trans with boundary detection mechanism (i.e., the three mechanisms described in Sect.3.2), respectively.
All the proposed models performed significantly better than the competitive baseline models and achieved state-of-the-art results in CQG [11].
The proposed BERT2Trans+detect outperforms the best baseline (CFNet [11]) by 8.9% and 3.3% in terms of BLEU3 and BLEU4, respectively.As the average question length is only 6.44 tokens, the improvement is distinctly large in terms of BLEU3 and even more significant in terms of BLEU2 and BLEU1.The proposed models achieve better n-gram similarity between the generated question and the ground truth.BERT2Trans+nosep is not better than the baseline model (CFNet).This suggests that separation information of multiple turns (questions and answers) in conversations is important for conversational QG.Good model-generated questions should be based on the answers in the last turn.As shown in Table 3, ER GAN also has room to improve, which is consistent with the finding in previous study [16].
On the DGC dataset, we compared the proposed models with the following baseline models: Seq2Seq [1], HRED [30], Transformer [34], and Wizard Transformer [4].ITE+ CKAD and ITE+DD were proposed in [23] and consist of an incremental transformer with a deliberation decoder for DGCs.
Table 4 shows the PPL and BLEU4 scores for the DGC dataset.The baseline results are reported in [23].Among the baseline models, ITE+DD has the best performance.This model is enhanced by a deliberation decoder [23,39].To make an aligned comparison with [23], we also equipped the proposed Bert2trans+dectect model with a similar deliberation decoder.The results demonstrate that the use of the deliberation decoder only improved the PPL score, and the proposed boundary detection mechanism significantly improved BLEU4.
The Bert2trans+detect+DD model outperformed all baseline models in terms of PPL and BLEU4.
Current Seq2Seq models tend to generate generic noninformative text.We present a comparison of the diversity levels of the results for the CQG dataset in Table 5. Li et al. [22] proposed distinct-1 and distinct-2 to measure the diversity of the generated text.These metrics can be computed as the number of distinct unigrams and bigrams divided by  the total number of generated words.Table 5 presents the distinct unigrams, distinct bigrams, and total number of generated words as 1gram, 2gram, and |wd|, respectively.The proposed BERT2Trans+detect model substantially increased dct-1 and dct-2 (i.e., the distinct unigrams and bigrams, respectively) over the baseline model (CFNet [11]).This indicates that the proposed models can significantly improve the diversity of the generated questions.The BERT2Trans+sep model achieved the best distinct-1 score.It yielded smaller values for 1gram, 2gram, and |wd|.A high distinct-1 score can be a direct consequence of the relatively low total number of generated words.As shown in Table 5, the proposed models generated more diverse questions compared with the baseline model (CFNet [11]).
Table 6 shows the average response length and average response information entropy with respect to the maximum likelihood unigram model for the generated questions.Following the definition in [31], we computed the unigram probabilities based on the maximum-likelihood unigram distribution of the training corpus.H w (i.e., information entropy per word) is computed as [31] H U denotes the information entropy per response and |U | denotes the average response length.It can be observed that the proposed models, particularly BERT2Trans +detect, generated questions with larger average length and remarkably enhanced the utterance entropy H U .This indicates their capability to generate questions with higher informativeness, and it is in good agreement with the previous experiment regarding the BLEU and ROUGE-L metrics.The human-generated questions were superior to all questions generated by neural 123 ...just pop pin in to say I let my balloon off with a message on it, hope you got it ok and it made you laugh up there".Five more men aged between 17 and 27 have been found hanged in the area since January 2007.Speaking to the Daily Mail newspaper, Liam Clarke's father, Kevin Clarke, said the seven who had killed themselves appeared to have known each other."we don't know if it is some weird cult or copycat suicides or if they have had some bizarre pact to kill themselves", Clarke said....Kenny said he developed his love of cooking by watching his mother, and his grandmother, who owned a catering business herself.Kenny helped them both in order to remember their tips: How long to cook chicken so it stays wet, and the right amount of tomatoes to add to a spaghetti dish.At the age of 5, he cooked his first dish of shrimp and broccoli ... Conversation History: Q: How did he obtain his passion for cooking?A: Watching his mother and grandmother, Q: Did he learn from them?A: Yes, Q: What tips he did get?A: How long to cook chicken, how many tomatoes go in a spaghetti dish Answer: At the age of 5 Gold: How old was he when he started cooking?

BERT2Trans+detect (proposed):
When did he get his first dish of cooking?
Case 3: Passage: ...Kournikova made her WTA first show at 15 years old at the US Open where she finally lost against player Steffi Graf.But she made it to the double quarter finals that same match.In 1996, Kournikova won the rookie of the year award and the next year ... aim in 1999, she made her first career WTA final in Key Biscayne against Venus Williams in a tough 3 set match.She also won her first doubles title with Monica Seles in Tokyo.At present Kournikova is more successful on the net than at the net.She remains the "most searched" and "most download" on the internet, three times more popular than the no. 2 sports figure, Michael Jordan.She is still very young and she seems to have a great future ahead!Conversation history: Q: Where is she most successful?A: Doubles, Q: More popular than whom?A: Michael Jordan, Q: In 1999, she battled whom in a tough match?A: Venus Williams

Answer: 1996
Gold: When was she rookie of the year?

BERT2Trans+detect (proposed):
What year did she win the rookie of the year?generative models in terms of response length and information entropy, suggesting that a higher entropy is desirable [31].

Case study
Table 7 shows three QG examples from the CQG task, where the proposed BERT2Trans+detect model substantially outperformed the baseline model.In case 1, it appears that the baseline model failed to learn the semantics of interconnected questions in the conversation history and generated a question related to the word "daily", whereas our model generated a coherent question depending on the conversation history and the passage.In case 2, although the answer is "at the age of 5", the passage and conversation history are about "Kenny's love of cooking".The proposed model determined the sentences related to the conversation topic and generated a question related to the time when he cooked his first dish.In case 3, the baseline model only generated a generic question "when?" and completely ignored the story of the tennis ball player.The proposed model generated a specific question using the correct contextual information.
Table 8 shows the generated responses for the DGC dataset by the proposed Bert2Trans+detect and ITE+DD [23], which achieves the best performance in terms of automatic evaluation.In case 1, both Bert2Trans+detect and ITE+DD output meaningful responses depending on the relevant movie descriptions; however, the Bert2Trans+detect model output a more proper response in the context of conversation.In case 2, ITE+DD generated a generic response, whereas the Bert2Trans+kg model produced the correct answer to the question.This indicates that the proposed model understands implicit semantic information in long-form language.

Attention visualization
Figure 2 shows a comparison of the attention weights in the last self-attention layer of the BERT model (encoder) for three mechanisms for separating questions and answers (or turns) in the conversation history (see Sect. 3.2).As shown in Fig. 2a, the setup of the input sequence is same as that for pretraining (i.e.

, {[C L S] K [S E P] D [S E P]}).
All words pay considerable attention to "[SEP]". Figure 2b shows that if multiple "[SEP]s" are inserted to separate questions and answers in the conversation history, the model will pay great  2c, d shows the weight of two self-attention heads.As shown on the x-axis, the conversation history is "Q: who knocked on the door?A: the lawyer Q: when?A: at eight o'clock Q: who did he find?A: the artist A: sadly altered for the worse".Without explicit indication in the test data, the proposed model learns the correct boundary between adjacent questions and answers, as demonstrated by the large attention weight received by "[SEP]" (Fig. 2a).The last answer (i.e., "sadly altered for the worse") is the most important information for generating an informative and coherent question.As shown in Fig. 2d, one of the selfattention heads pays special attention to the starting token of the current answer; this is not found in the previous two mechanisms (see Sect. 3.2).

Human evaluation
We also conducted a manual evaluation for both the CQG and DGC datasets, as a supplement to the evaluation through the metrics.For the CQG dataset, 93 randomly sampled questions with relevant passage, conversation history, and current answer were used for human evaluation.Human annotators evaluated the questions generated by the baseline model (CFNet) and those generated by the proposed model in terms of the same metrics as in [11]: "Grammaticality", "Answerability", and "Interconnectedness". "Grammaticality" measures whether the generated question is fluent and grammatically correct."Answerability" measures whether the generated question can be answered by the current answer [11]."Interconnectedness" evaluates whether the generated question can refer back to the conversation history [11].
For the DGC dataset, we randomly selected 100 samples for human evaluation.We used the same metrics as in [23] (i.e., "Fluency", "Knowledge Relevance", and "Context Coherence") to compare the proposed model with the best baseline model (ITE-DD) [23]."Fluency" measures the fluency of a response, "Knowledge Relevance" determines whether relevant knowledge was used in a response, and "Context Coherence" considers whether the responses are coherent with dialogue history.We anonymized the model identities for each generated response.All metrics are on a 1-3 scale (3 for the best).The results are shown in Tables 9 and 10.The proposed models outperformed the baseline models (CFNet and ITE+DD) in terms of all six metrics.Both the proposed and the baseline model achieved high "Grammaticality" and "Fluency" scores; this is consistent with the in [11,23].It is conceivable that both the baseline and the proposed model can generate fluent and grammatical next-turn utterances (question or response).In addition, the proposed BERT2Trans + detect model is superior to the baseline models in terms of the "Answerability" and "Knowledge Relevance" metrics, suggesting that BERT2Trans can better combine passages and dialogue history, and can generate appropriate next-turn utterances (question and response) with better knowledge relevance.

Conclusion and future work
In this paper, we proposed a deep pretrained question generation model to facilitate conversation history understanding and question generation by using the information fusion ability of pretrained language models.We evaluated the model on CQG and DGC datasets in terms of multiple metrics.The results demonstrated that the proposed approach can generate more coherent and informative next-turn utterances (question or response).
In future work, we should enhance the model with reasoning ability for multiple passages or knowledge sources.Thereby, the most relevant textual knowledge content may be distinguished, and the effectiveness of conversation generation may be improved.The experiments and the observations regarding the generated utterances indicate that the comprehensive representation of the complexity of multi-turn conversation remains challenging for current models.
in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 1
Fig. 1 Knowledge-enhanced response generation Conversation history: Q: Who is Liam Clarke's father?A: Kevin Clarke, Q: Who did he speak to?A: The newspaper, Q: Did the people who did this know each other?A: Yes Answer: Daily Mail Gold: Which newspaper did he talk to?

Fig. 2
Fig. 2 of attention weights in the last self-attention layer of BERT encoder

Table 2
Hyperparameters for model training

Table 4
PPL and BLEU4 for the DGC test set.

Table 5
Generation diversity results for the CQG test set

Table 6
Response information entropy for the CQG test set

Table 7
QG examples

Table 8
The Wolf of Wall Street is a 2013 American biographical black comedy crime film directed by Martin Scorsese and written by Terence Winter, based on the memoir of the same name by Jordan Belfort.It recounts Belfort's perspective on his career as a stockbroker in New York City and how his firm, Stratton Oakmont, engaged in rampant corruption and fraud on Wall Street, which ultimately led to his downfall.Leonardo Dicaprio (who was also a producer) stars as Belfort, with Jonah Hill as his business partner and friend Donnie Azoff, Margot Robbie as his wife Naomi Lapaglia and Kyle Chandler as Patrick Denham, the FBI agent who tries to bring him down.The Duo try to make it to the moving truck, but Sid's dog Scud sees them and gives chase.Buzz saves Woody from Scud but is left behind, so Woody attempts to rescue him with Andy's remote-controlled car, RC.Thinking that Woody is "killing" RC as well, the other toys attack and toss him off the truck.Having evaded Scud, Buzz and RC pick up Woody and continue after the truck.Upon seeing Woody and Buzz together on RC, the other toys realize their mistake, and try to help them get back aboard, but RC's batteries become depleted, stranding them.... Woody and Buzz stage another reconnaissance mission to prepare for the new toy arrivals...

Table 9
Human evaluation for the CQG dataset

Table 10
Human evaluation for DGC dataset