Enhancing Document Information Selection Through Multi-Granularity Responses for Dialogue Generation

Wang, Meiqi; Qiao, Kangyu; Xing, Shuyue; Yuan, Caixia; Wang, Xiaojie

doi:10.1007/s11063-024-11633-w

Enhancing Document Information Selection Through Multi-Granularity Responses for Dialogue Generation

Open access
Published: 28 May 2024

Volume 56, article number 189, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Enhancing Document Information Selection Through Multi-Granularity Responses for Dialogue Generation

Download PDF

Meiqi Wang¹,
Kangyu Qiao¹,
Shuyue Xing¹,
Caixia Yuan¹ &
…
Xiaojie Wang¹

167 Accesses
Explore all metrics

Abstract

Document information selection is an essential part of document-grounded dialogue tasks, and more accurate information selection results can provide more appropriate dialogue responses. Existing works have achieved excellent results by employing multi-granularity of dialogue history information, indicating the effectiveness of multi-level historical information. However, these works often focus on exploring the hierarchical information of dialogue history, while neglecting the multi-granularity utilization in response, important information that holds an impact on the decoding process. Therefore, this paper proposes a model for document information selection based on multi-granularity responses. By integrating the document selection results at the response word level and semantic unit level, the model enhances its capability in knowledge selection and produces better responses. For the division at the semantic unit level of the response, we propose two semantic unit division methods, static and dynamic. Experiments on two public datasets show that our models combining static or dynamic semantic unit levels significantly outperform baseline models.

Hierarchical history based information selection for document grounded dialogue generation

Article 23 December 2022

A Document Driven Dialogue Generation Model

Coarse-to-Fine Response Generation for Document Grounded Conversations

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

There are many scenarios where conversations need to be conducted based on documents, such as the customer service personnel need to answer user questions based on the product description documents. In recent years, this document-grounded dialogue task which uses unstructured documents as human knowledge [1,2,3] has attracted extensive attention from the research community of natural language processing. One of the core challenges in this task is document information selection, which refers to selecting highly relevant knowledge from complex document sources to help generate reasonable responses [4, 5].

Existing knowledge-selecting methods that achieve excellent results can be categorized into three types: extraction-based methods, generation-based methods, and hybrid methods. Extraction-based methods typically utilize gold document knowledge labels as supervised signals to extract keywords, semantic units, or fragments from documents [6,7,8]. These approaches can always identify useful information, but they struggle to smoothly integrate the selected knowledge into the decoder, resulting in difficulties in generating fluent and natural responses. Generation-based methods acquire implicit knowledge through the interaction of dialogue and document, which is incorporated into the decoding process to improve the generation probability of knowledge-related words at decoding steps [9,10,11]. Hybrid methods combine both extraction and generation approaches. For example, [12, 13] provides both retrieval and generation options for each decoding step, i.e. it can extract a word/semantic unit from the document as the output of the decoding step, or generate a response word based on an implicit document representation.

Taking advantage of different granularities of dialogue history is always effective in the above three knowledge selection methods. Based on the granularity of history, these methods can be categorized into three types: (1) Models based on word-level history [4, 14]. (2) Models based on sentence-level history [15, 16]. (3) Models based on multi-granularity history [10, 17,18,19]. These methods enhance the understanding of history semantics by introducing multiple granularities, leading to more accurate document knowledge selection. However, these studies only focus on dialogue history, which overlooks the role of multi-level response information in document knowledge selection.

Actually, during the decoding process, the hierarchical information of the already generated response part can be of great help to determine the document information [20]. As shown in Fig. 1, to continue to predict the phrase ‘close the beach’ in the response (in the bracket), the model needs to focus on the snippet ‘refuses to close the beach’ in the document. It is evident that multi-level history information is not enough to infer this knowledge, whereas the semantic unit ‘didn’t want to’ in the preceding part of the response can facilitate a more straightforward focus on the document phrase ‘refuses to’. Consequently, it enables the model to locate the key knowledge ‘close to beaches’ in the document. Compared to the word-level representation of the response, semantic units aggregate the meaning of multiple words, so they can capture the precise essence of phrases, which enables the effective selection of crucial knowledge from the document and provides support for the decoding process.

Therefore, to improve the accuracy of document knowledge selection, we propose a generation-based model utilizing multi-level responses for document information selection. To be more specific, we select the information from documents through response in two levels, to enhance the model’s positioning ability of the key information in the document. In addition to the word-level response, we introduce the semantic unit-level response representation, which aggregates the meaning of the word and its neighbors. Among them, we propose two kinds of semantic unit segmentation methods, static and dynamic, to explore the function of semantic units from the global and local scope respectively. In the static method, we divide the generated response sequence by fixed-size units to obtain the semantic unit representation at the sentence level. In the dynamic method, n-gram is introduced to dynamically find the semantic unit combination related to each response word. Secondly, we fuse the response-document results at the semantic unit level with the response-document results at the word level through the gate mechanism to obtain the document knowledge-related response representation. Finally, the next word of the response is obtained through a standard decoding process. Our main contributions are:

1.
Propose a dialogue generation model by utilizing a multi-granularity response for document information selection, which improves the effect by fusing the word level and semantic unit level response selection result.
2.
Propose static and dynamic semantic unit division methods to explore the effect of the semantic unit from the global and local scope respectively.
3.
The experimental results show that we have made significant improvements over the baseline models on automatic and human evaluation metrics.

2 Related Work

2.1 History Granularity

In knowledge-based dialogue systems, most research focuses on enhancing knowledge selection by utilizing different granularities of dialogue history. These methods can be divided into three categories:

1.
Models based on word-level history. [4] captures the correlation between word-level history and document using bidirectional word-level history-document attention for knowledge preselection. [14] concatenates word-level history information and each document sentence, then captures the interaction pattern through multiple self-attention layers. Knowledge selection is treated as a sequential decoding process. KAT [21] adds a gate as controller for the knowledge selection blocks of history and document to improve the effect of selecting key knowledge words.
2.
Models based on sentence-level history. [15] learns the distribution of documents over history sentences using KL-loss during training, then selects document knowledge based on this distribution during the testing process. [16] introduces latent variables for posterior knowledge inference from the previous dialogue sentence and prior knowledge estimation for the current sentence. CKL [22] builds pseudo labels through the correlation between sentence-level dialogue history / sentence-level documents and gold response, then achieves implicit selection of key sentences in documents and history by this weak supervision.
3.
Models based on multi-granularity history information. [17] first retrieves multiple document sentences based on recent dialogue sentences, then interacts with the discourse-level history with documents to select key document knowledge. In [10], cross-attention is employed between history and document on a word level, which is used to select document word information corresponding to each history sentence, followed by an incremental integration to utilize discourse-level dialogue history. HHKS [19] locates important information in history and documents by merging both word-level and utterance-level attention to history. In [18], the history sentences are processed in chronological order from distant to recent to interact with the document at a word-level non-linear transformation to update the selection of document knowledge. Then, the sentence-level history representations are fed into a GRU to obtain a discourse-level representation of the history, which is then used to predict the response.

These studies focus on matching multi-granularity history information and external document in the process of knowledge selection. They capture relevant document information from multiple levels and perspectives of history, providing accurate knowledge for response generation and enhancing the utilization of documents. However, these works rarely analyze and utilize the multi-level response. And more related work about the application of hierarchical history in dialogue generation tasks is in appendix A.

2.2 Semantic Unit Granularity

Compared to word-level responses, the semantic unit level can aggregate multiple word meanings to achieve more accurate expression. In contrast to sentence-level responses, semantic units often pay attention to key details in the sentence. Some studies have noted the importance of the semantic unit level and conducted diverse research. These methods usually divide sentences into several fixed or unfixed length units to obtain semantic units and improve task performance through the effective use of semantic units.

[23,24,25] improves the word-level attention calculation method by performing local attention within fixed or unfixed length units to enhance the performance of self-attention. DIALKI [26] extracts knowledge by dividing a given long document into paragraphs and contextualizing them using dialogue context. In [13], when selecting document knowledge, a convolutional network is used to divide the matching results after history-document interaction into units, to obtain the transition probability of dialogue history and each knowledge semantic unit. [27] applies a one-dimensional convolutional pooling to the context to obtain N-gram information, serving as local semantic unit-level information. [28] reads the word embedding sequence through a sliding convolutional unit and uses max-pooling to obtain the sentence-level history representation, interacting the word and sentence-level representation with candidate responses to select high probability responses. [29] propose a multi-level transformer-based model that divides multiple words/sentences into local units of fixed size for sequences of words/sentences respectively, and uses RNN to encode units at two levels, to enhance the ability to capture relevant context. In addition to the above hierarchy division methods, the use of knowledge labels can also help predict knowledge [2]. For example, Re3G [5] designed training objectives containing passage span and response by labels, achieving fine-grained knowledge extraction. PostKS [30] improves knowledge prediction performance by constraining the prior and posterior distributions of knowledge through Kullback Leibler divergence loss.

These methods improve understanding of information and enhance model performance by obtaining local semantic unit-level information through unit division methods. However, these methods often focus more on dialogue history or external knowledge and rarely utilize the semantic unit-level information of the response in the dialogue generation task.

3 Model

Dialogue history including M words can be represented as $his=\{h_{1},h_{2},\ldots ,h_{m},\ldots ,h_{M}\}$, and the document corresponding to the history is denoted as $doc=\{d_{1},d_{2},\ldots ,d_{n},\ldots ,d_{N}\}$, where $d_{n}$ indicates the n th word of the document. Note that for the history and document to interact efficiently during the encoding phase, we use the same settings in DoHA [9] where $doc = [his, doc]$. We use $res_{<t}$ to represent the first $t-1$ words of the generated response, where $res_{<t} =\{y_{1},y_{2}\ldots ,y_{t},\ldots ,y_{t-1}\}$ and $y_{t}$ is the t th word at decoding step t.

The structure diagram of the whole model is shown in Fig. 2, which consists of four modules: encoder module (Enc), history information selection module (HIS), multi-granularity document information selection module (MDIS), and response decoding module (Dec). The Enc module encodes the history and document at the word level using self-attention. The HIS module selects information in history with the first $t-1$ response words as a query. The MDIS realizes the interaction of multiple levels of response and documents to the acquisition of key knowledge. Finally, the Dec module decodes according to the decoding vector to realize the prediction of the t th word of the response. Among them, the MDIS module is the main contribution of the paper. In this module, we first put forward two division methods to get the semantic unit-level representation of the response respectively. Secondly, knowledge in the document is selected by word-level-response and semantic unit-level-response, and two kinds of knowledge-related response representations are obtained. Finally, the two representations are fused by a gate mechanism to get the final result of the response.

The whole model adopts the common encoder-decoder framework, and the pre-trained BART model serves as the backbone. Both encoder and decoder are stacked with $L-layer$ networks, and the hidden layer dimension is d.

3.1 Encoder

We use BART [31] encoder, which is composed of L network layers consisting of self-attention and feed-forward network. By inputting the history his and the document doc into the encoder, we obtain word-level representations for the history $H\in R^{M\times d}$, and document representation $D\in R^{N\times d}$:

$$\begin{aligned} H= & {} Encoder\left( his\right) , \ R^{M\times d} \end{aligned}$$

(1)

$$\begin{aligned} D= & {} Encoder\left( doc\right) , \ R^{N\times d} \end{aligned}$$

(2)

3.2 HIS

HIS module implements dialogue history information selection. The previous response words at time step t are used as the query, and useful information from the history is obtained through a cross-attention (CA) mechanism.

First, the first $t-1$ responses are encoded by masked self-attention (SA), obtaining word-level representation ${res}_{<t}^w$:

$$\begin{aligned} {res}_{<t}^w = SA(res_{<t}, res_{<t}, res_{<t} ), \ R^{(t-1)\times d} \end{aligned}$$

(3)

Then, take ${res}_{<t}^w$ as a query to select information from history, obtaining the response representation related to history ${\ Y}_{<t}^w$:

$$\begin{aligned} Y_{<t}^w = CA({res}_{<t}^w, H, H), \ R^{(t-1)\times d} \end{aligned}$$

(4)

3.3 MDIS

In the MDIS module, we use multi-granularity responses for document knowledge selection. Based on the word-level history-aware response obtained from HIS module $Y_{<t}^w$, we use static and dynamic division methods to get a semantic unit-level response at SU (semantic unit) Division section. Secondly, we use word-level and semantic unit-level responses respectively to conduct Document Information Selection (DIS), to get the response representations related to document information at two levels. Finally, a gate mechanism is used to fuse the response representations of the two levels to get a multi-granularity response representation containing document information. The details of each step are described below.

3.3.1 SU Division

We use static and dynamic methods to divide semantic units into fixed length and unfixed length respectively.

3.3.1.1 Static Semantic Unit (SSU) division

First, the word-level response of the first $t-1$ words $Y_{<T}^w$ is divided into $t-1$ sequences: $Q_{1}= \{y_{1}\}$, $Q_{2} = \{y_{1},y_{2}\}$, $Q_{3} = \{y_{1},y_{2},y_{3}$},..., $Q_{t-1} = \{y_{1},y_{2},y_{3},\ldots , y_{t-1}\}$.

Next, for each sequence, $Q_{t-1}$, use sliding step and unit size of s and k to conduct fixed length semantic unit division. So the sequence $Q_{t-1}$ consists of $Z=((t-1-k)/s +1)$ semantic units, and each unit has k words. The multi-semantic unit representation $(Q_{t-1}^{k}) ^{SSU} \in R ^{Z \times d} $ of the sequence $Q_{t-1}$ is obtained by summing the word representations inside each unit. The global static semantic unit representation $(Y_{t-1}^{k}) ^ {SSU} \in R ^{d} $ of the sequence $Q_{t-1}$ is obtained by summing Z units. Then the semantic units representations correspond to the first $t-1$ response sequences are {$(Y_{1}^{k})^{SSU}$, $(Y_{2}^{k})^{SSU}$,..., $(Y_{t-1}^{k}) ^{SSU}$}.

We choose several units of different sizes $K=\{k_{1},k_{2},\ldots , k_{L}\}$ to simulate semantic units of different lengths. Then we sum the different sizes of static semantic units in each unit, to integrate the meanings of different semantics. For the sequence $Q_{t-1}$, its multi-size static semantic unit is represented by:

$$\begin{aligned} Y_{t-1} ^{SSU}= \sum _{k \in K} (Y_{t-1}^{k})^{SSU}, \ R^{d} \end{aligned}$$

(5)

We use the first $t-1$ words’ static fixed unit representation to present a global semantic unit representation of the $(t-1){th}$ word. Then the static semantic unit representation of the first $t-1$ words in the response can be represented as:

$$\begin{aligned} Y_{<t}^{SSU}= [Y_{1}^{SSU},Y_{2}^{SSU},\ldots ,Y_{t-1}^{SSU}], \ R^{(t-1) \times d} \end{aligned}$$

(6)

3.3.1.2 Dynamic Semantic Unit (DSU) division

The core idea of dynamic semantic unit segmentation is to capture the overall correlation of several n-gram semantic units starting from each word and find the n-gram unit with the highest correlation as the semantic unit.

First, obtain the relevance scores between the response words. For the original response embedding $Y_{<t}^{emb}$, apply self-attention (SA) to get the attention values $a_{ij}$ as the word-word semantic correlation scores:

$$\begin{aligned} {{A=(a}_{ij})}_{(t-1)\times (t-1)}, \ R^{(t-1)\times (t-1)} \end{aligned}$$

(7)

Then, calculate the K-gram semantic relevance score of each word. For the $(t-1){th}$ word, construct a unit of length K with that word as the starting point. Then, calculate the average attention score between the words within the unit as the K-gram semantic relevance score $S_{t-1}^{K} \in R^{1}$ for that word (For example, the 3-gram semantic relevance score of $y_1$ is $S_{1}^{3}=(a_{12}+a_{23}+a_{13}) / 3 $):

$$\begin{aligned} S_{t-1}^{K}= \frac{\sum _{i\ =t-1}^{t+K-2}{\sum _{j=i+1}^{t+K-1}a_{ij}\ }\ \ }{\sum _{g=1}^{K-1}g}\, \ R^{1} \end{aligned}$$

(8)

The next part is semantic unit selection. For the $(t-1){th}$ word, we choose multiple units of different sizes $K=\{ k_{1}, k_{2},\ldots , k_{L} \}$ to simulate L semantic units with different sizes. Therefore, the semantic relevance matrix for the $(t-1){th}$ word will be $S_{t-1} = \{S^{k1}_{t-1},S^{k2}_{t-1},\ldots ,S^{kL}_{t-1}\} \in R^{L}$. Consequently, the semantic relevance matrix for the response containing $t-1$ words will be $S=\{ S_{1},S_{2},,\ldots ,S_{t-1}\} \in R^{L \times (t-1)} $. We extract the maximum value from each row in the matrix S, obtaining a dynamic unfixed unit size $k^{*}_{t}$. For the first $t-1$ words, the dynamic units are:

$$\begin{aligned} K^{*} = \{ k^{*}_{1},k^{*}_{2},\ldots ,k^{*}_{t-1}\}, \ R^{t-1} \end{aligned}$$

(9)

Finally, computing the representation of the dynamic semantic units for the response. Using $K^{*}$, we sum the embeddings of the words within the corresponding unit for each word in the response, obtaining the unit representation $Y_{t-1} ^{DSU} \in R^{d}$ for the $(t-1){th}$ word:

$$\begin{aligned} Y_{t-1}^{DSU}= \sum _{g=t-1}^{t+k_t^*-2} Y_{g}, \ R^{d} \end{aligned}$$

(10)

Therefore, the representation of dynamic semantic units for the first $t-1$ words of the response is denoted as:

$$\begin{aligned} Y_{<T}^{DSU}= \{Y_{1}^{DSU},Y_{2}^{DSU},\ldots ,Y_{t-1}^{DSU}\}, \ R^{(t-1) \times d} \end{aligned}$$

(11)

Here is an explanation of the dynamic semantic unit segmentation process using an example. Figure 3 reflects the process of dynamic semantic unit segmentation for the response shown in Fig. 1. To make it easier to observe, we demonstrate the process using the recently generated 5 words of the response, ‘refuse to close the beaches’. First, we calculate the masked self-attention of the embeddings for ‘refuse to close the beaches’ to obtain the lower triangular correlation matrix A for the word-word relevance scores. Then, using Eq. 8, we obtain the K-gram semantic relevance matrix S for each word. In this example, K = 2, 3, 4. For instance, the 2/3/4-gram semantic units corresponding to ‘refuse’ are ‘refuse to’/‘refuse to close’/‘refuse to close the’, and [refuse, 2-gram] in S records the relevance score for ‘refuse to’. Finally, by taking the maximum value in each row of S, we can get the final suitable unit size for each word. For example, by comparing the 2/3/4-gram scores for ‘refuse’, we can determine that the highest-scoring semantic unit is ‘refuse to’. In this way, we can obtain the dynamic semantic units for the entire response sequence.

3.3.2 MDIS

MDIS (Multi-granularity document information selection) interacts with the document at two response levels to obtain word-level and semantic unit-level document-related response representations.

Word-level document information selection: the word-level representation of the response is used as a query to select information from the document D, resulting in a new word-level response representation.

$$\begin{aligned} Y_{<t}^W= CA(Y_{<t}^W, D, D), \ R^{(t-1) \times d} \end{aligned}$$

(12)

Semantic unit-level document information selection: the semantic unit representation of the response is used as a query to select information from the document D, resulting in a new semantic unit level response representation $Y^{SU} \in R^{(t-1) \times d}$.

$$\begin{aligned} Y_{<t}^{SU}= CA(Y_{<t}^{SU}, D, D), \ R^{(t-1) \times d} \end{aligned}$$

(13)

Note that $Y_{<t}^{SU}$ in Eq. 13 can be the static/dynamic division response semantic unit representation $Y_{<t}^{DSU}{/\ Y}_{<t}^{SSU}$.

3.3.3 Fusion

We apply a gate mechanism to fuse the document-related response representations at two levels. First, we take the word-level representation and semantic unit representation into a multilayer perceptron (mlp). Then, we pass the output through a sigmoid function to obtain the weights, denoted as w. Using these weights, a weighted sum of the two representations is performed to obtain the final response representation $Y_{<t}$:

$$\begin{aligned} w= & {} sigmoid(mlp([Y_{<t}^W,Y_{<t}^{SU}])) \end{aligned}$$

(14)

$$\begin{aligned} Y_{<t}= & {} w*Y_{<t}^W + (1-w)*Y_{<t}^{SU}, \ R^{(t-1) \times d} \end{aligned}$$

(15)

3.3.4 Decoder

The decoder takes the response representation $Y_{<t}$ obtained from MDIS and decodes the next word. The response is passed through an mlp network followed by a softmax function to generate the probability distribution of the next word.

$$\begin{aligned} P(y_{t})=softmax(mlp(Y_{<t})) \end{aligned}$$

(16)

4 Experiment

4.1 Dataset

4.1.1 CMU-DoG

The CMU-DoG dataset [32] comprises over 4,000 human-human dialogues, each averaging 21.43 turns. These dialogues were collected through Amazon Robot Turk. The dataset leverages Wikipedia descriptions of diverse movie genres as documents, serving as a reliable foundation for generating dialogues. Each dialogue session focuses on a specific document section, which can either contain the movie’s basic information or a description of the plot. This setup ensures that the dialogues are contextually enriched and built upon the details provided in the corresponding document sections.

4.1.2 Wizard of WikiPedia

Wizard of Wikipedia [33] is a large data set collected by Facebook AI Research for training and evaluating document-driven conversation models. One participant plays the ‘Wizard’ role while the other plays the role of the ‘Apprentice’. The Wizard can use an information retrieval system to answer questions, make statements, and engage in conversation with the Apprentice. However, the Apprentice does not have direct access to the document knowledge and must communicate with the Wizard for knowledge. The test set is divided into two subsets, WoW-seen and WoW-unseen, where the dialogues of the former contain only topics that have appeared in the train set, and the dialogues of the latter contain new topics.

4.2 Automated Evaluation

For automatic evaluation, we use BLEU1-4, Rouge-L [34] and Meteor [35] to assess the consistency between the predicted responses and the gold responses, and we use Doc-BLEU and NW to measure the ability to select document knowledge. Among them, BLEU1-4 calculates the overlap of continuous n-grams between two texts. As the size of the n-gram increases, the strictness of the word order requirement gradually increases. For example, 1-gram does not consider the word order, while 4-gram requires a perfect match of four consecutive words. Rouge-L measures the longest common subsequence between the generated response and the reference in the lexical selection and word order. Meteor is used to measure the similarity between generated response and reference. Doc-BLEU evaluates the quality of dialogue responses by comparing the 1-gram match degree between the predicted response and the document. NW is used to detect the effectiveness of document knowledge selection by measuring the Intersection of the generated response and document. Suppose the token set of history/predicted response/document is H/T/N respectively, and the stop word set is S. The set operation of NW is calculated as $|((T \bigcap N) \setminus H) \setminus S|$.

4.3 Human Evaluation

We evaluate the model from three aspects: (1) Fluency of generated responses (2) Relevance between generated responses and reference responses (Ref-Rel) (3) Relevance of generated sentences to the document (Doc-Rel). We compare DoHA as a benchmark model with ours. For the three models, we randomly sample 150 response predictions from the test set for evaluation, and each result is evaluated by three people. Each criterion is rated on a five-point scale, where 1, 3, and 5 indicate unacceptable, moderate, and perfect performance, respectively. The average score of the assessment serves as the final result.

4.4 Baseline

DRD [36]. To improve model performance with limited training data, a disentangled response decoder is designed to separate the parts that are highly related to knowledge from the other parameters of the model.

BART [31]. The dialogues and documents are concatenated and fed into the encoder to get a joint encoding representation, which is then fed into the decoder to predict the responses.

CoDR [9]. A BART-based model that takes contextualized documents and dialogue history as the input of the decoder to fuse the different information better.

DoHA [9]. Using cross attention_context and cross attention_doc separately to allow the response to interact with dialogue history and documents respectively, thereby obtaining responses that are coherent with the context and rich in document information.

DIALKI [26]. A knowledge identification model that provides dialogue-contextualized passage encodings and better locates conversation-related knowledge by utilizing the document structure.

KAT [21]. A three-stage architecture for dialogue generation model, where the knowledge-aware transformer can achieve good performance by a dynamic knowledge selection mechanism.

CKL [22]. A contextual knowledge learning model that involves latent vectors to capture differential weighting of context utterances and knowledge sentences during the training process.

HHKS [19]. A hierarchical history based knowledge selection for dialogue generation, locates key information in history and document by merging both word-level and utterance-level attention of history.

4.5 Implementation Details

The samples in the dataset are input into the model in the form of tuples $(d_i, hi, y_i)$, where $h_i$ is the dialogue history, $d_i$ is the corresponding document, and $y_i$ is the reference response. The response $y_i$ is based on $d_i$ and serves as the next turn of dialogue for $h_i$. For CMU-DoG, the document is a paragraph from a Wikipedia movie description. For WoW, the document combines the first paragraph of the seven articles from Wikipedia. For dynamic semantic unit division, to avoid interference with word-level SA and better meet the needs of semantic unit division, we add a new SA block to learn word-level relevance. To test the scalability of semantic units on different methods, in addition to conducting experiments with BART-base as the backbone, we also extended static and dynamic semantic units to a stronger model HHKS. Among them, in the experiments using the BART backbone and reproducing DoHA, we followed the data setting of one history utterance, which is the same as DoHA. In the experiment of adding semantic units to HHKS, our experimental parameter settings were identical to those of HHKS. In addition, considering the GPU limitations and time consumption of the model, we conducted most experiments and analyses based on BART as the backbone, including the ablations, human evaluation, and case study.

5 Experiment Result and Analysis

5.1 Main Results

Table 1 Automated metrics of dialogue generation quality on the three datasets. The gray background represents the model result of adding semantic units. Ours- SSU/DSU means adding semantic units based on BART, while HHKSSSU/ DSU means adding semantic units based on HHKS. The bold indicates the best results

Full size table

Table 2 Automated metrics of document knowledge selection effect on the three datasets, the bold indicates the best result

Full size table

Dialogue generation quality. We compared the dialogue generation performance of models on the CMU-DoG and WoW datasets Table 1. SSU and DSU represent our method with the static semantic unit and the dynamic semantic unit respectively. Among all baselines, the model structure closest to Ours-SSU/DSU is DoHA, which also uses BART as the backbone, and we methods added semantic units compared to DoHA. From Table 1, Ours-SSU/DSU surpasses most baselines on most metrics. Compared to DoHA, it can be demonstrated that our methods can be effective in most cases, showing that adding semantic units improved the response generation ability of the BART-based backbone method. In addition, both of our models on the base of HHKS perform better than the baseline models across almost various metrics. The response consistency indicators BLEU1-4, Meteor, and Rouge-L, both HHKS-SSU and HHKS-DSU methods are more consistent with the gold responses. It can be seen that the dialogue effect is improved after adding semantic units, which indicates that our proposed method has scalability on different models.

Knowledge selection effect. From the result of the document relevance indicators NW and Doc-BLEU in Table 2, our predictions are more closely aligned with the document information compared to baselines, effectively incorporating document knowledge into dialogue. Among them, the fixed semantic unit method SSU achieved a higher performance improvement on CMU-DoG compared to unfixed DSU, while on WoW, it was the opposite. This is because the document knowledge in the WoW dataset is richer and more dense, with more document words reflected in the dialogue, which is more suitable for dynamic selection by the unfixed method. In contrast, in CMU-DoG, document knowledge is relatively sparse, and the fixed process can enhance the integration of information from various parts in a global way.

Table 3 Human evaluation result of baseline and our models on the datasets

Full size table

Hmuan evaluation result. Through the results of the human evaluation in Table 3, it can be found that our model performs better in terms of fluency, reference relevance, and document relevance compared to the baseline models.

5.2 Ablation Result

To analyze the influence of various parts of the model on the results, we separately delve into the details of the static semantic unit method and the dynamic semantic unit method. For the static semantic unit method, we conduct experiments and analyze on the size and quantity of semantic units (Sect. 5.2.1). As for the dynamic semantic unit method, we carry out analysis by varying the sizes of the semantic units (Sect. 5.2.2). To facilitate the comparison of the results of various models, we chose the form of a bar chart in the ablation experiment section, and the specific metric values are shown in Appendix B.

5.2.1 Static Unit Method

To compare the effect of unit number and size, we use n means the number of semantic units and k means the minimum size of the semantic unit. For example, n3k2 means the number of units is 3 and the minimum unit size is 2, which indicates we use the unit of size 2/3/4 to experiment.

5.2.1.1 Unit Size

To observe the effect of different semantic unit sizes, we fix the number of units (n=1) and test semantic units of different sizes (k=2/3/4). From the left chart in Fig. 4, we can see that our model (n1k2, n1k3, n1k4) significantly outperforms the comparison model DoHA on all BLEU scores. This indicates that compared to the word-level DoHA, our model has a better ability to deal with semantic unit-level responses, thus improving the final results. Secondly, we can see that the n1k3 model scores higher than the other two models on BLEU-2, BLEU-3, and BLEU-4 metrics. This suggests that the n1k3 model performs better when dealing with longer word groups. It indicates that the n1k3 model excels in capturing the semantic relationships between words and the overall structure of sentences compared to the other models.

5.2.1.2 Unit Num

To observe the effect of the number of semantic units on the experimental results, we fixed the minimum semantic unit size at 2 and tested the experimental results containing 1/2/3 units respectively. From the right chart in Fig. 4, we can see that our model outperforms the DoHA model on all BLEU scores. For different numbers of units (starting from k=2), the performance gap between n1k2 and n2k2 is insignificant, but the n3k2 model performs the best. It scores notably higher than the other two models on BLEU-1, BLEU-2, BLEU-3, and BLEU-4 metrics. This could suggest that when semantic units of size 3 are integrated, the model can capture the meanings of longer word groups.

5.2.2 Dynamic Unit Method

We explore the impact of unfixed unit combinations on the experiment by controlling the maximum length of variable-length units. Note that 2-gram indicates that the variable-length unit can choose either 1-gram or 2-gram. We first calculate the average value of all 2-gram relevancies. If it’s greater than the average value, we select 2-gram, otherwise, we choose 1-gram. The methods of 3-gram and 4-gram are those proposed in the paper, that is, for 3-gram, choose from 2/3-gram; and for 4-gram, choose the variable-length unit from 2/3/4-gram. The results in Fig. 5 show that on the WoW-seen dataset, our model’s performance is significantly better on BLEU-2, 3, and 4 when the maximum unit size is 4-gram. This suggests that dynamically choosing an appropriate semantic unit size can capture the patterns of dialogue, showing better results on topics that have been trained. On the WoW-unseen dataset, however, our model with the maximum unit size of 2-gram performs better. In this case, smaller context units (such as 2-gram) may be more useful because the model needs to understand and generate responses more flexibly, rather than relying on specific patterns learned during training.

5.3 Efficiency Analysis

The main contribution of this paper is to add a semantic unit level to the response, so it is necessary to discuss the efficiency of our method. We first analyze the complexity of the two kinds of semantic units theoretically and then combine HHKS and DSU to further illustrate the impact of semantic unit on training.

For the SSU method, the main operations include SSU division (SSUD) and cross attention between SSU and documents (CA). The complexity of SSUD is related to the response’s length $t-1$, the unit number n, and the unit size k. Among them, n and k are fixed constants in the experiment, n belongs to [1,3], and k belongs to [2,4]. Therefore, the complexity of dividing SSUD is related to the response length. Specifically, for the $t-1$ time step, the length of the response is $t-1$. We first split the response into $t-1$ sequences and divide SSU for each sequence. As time step t changes, the complexity of SSUD changes by the square of t, i.e., $\mathcal {O}(t^2)$. CA is performed with each unit division, and its complexity is also $\mathcal {O}(t^2)$. When the unit number is 3, the above operations will be performed three times, taking up to $\mathcal {O}(3t^2)$. For the overall SSU method, its time complexity is $\mathcal {O}(t^2)$.

The DSU method, its main operations include SA, n-gram relevance calculation (NGRC), n-gram selection (NGS), and cross attention between DSU and documents (CA). SA and NGRC are performed with decoding length $t-1$, and their complexity is $\mathcal {O}(t)$. Among them, in our experiment, n-grams can be [1,2]/[2,3]/[2,3,4], so the complexity of NGRC takes up to $\mathcal {O}(3t)$ at most. In NGS, it is still necessary to divide the response with a length of $t-1$ into $t-1$ sequences first, and then perform NGS and CA operations on each sequence. Therefore, the complexities of NGS and CA are both $\mathcal {O}(t^2)$. So, for the overall DSU method, its time complexity is $\mathcal {O}(t^2)$.

We take the more complex HHKS-DSU method as an example to illustrate the overhead it brings during the training process. Under the experiment setting of NVIDIA TITAN RTX 3090 GPU and the WoW dataset training set, the training time for the addition of SSU on HHKS, each epoch increases by about approximately 1 h on time and 934 M in space. Among them, 1 h includes its main operations including SA, NGRC, NGS, and CA with documents. During model training, the maximum training time is 50 epochs (if the model is fitted, the training will be stopped), which means an increase of approximately 2 days at most. The introduction of semantic units in response inevitably increases training time, which is a limitation of this paper.

5.4 Case Study

We list the responses generated by our model and the strongest baseline according to the same history in Table 4. The main meaning of the dialogue history is that the speaker was surprised at the mayor’s decision to declare the beach safe after capturing a tiger shark. Although the tiger shark is large and people might mistake it for the ‘great white shark’, it’s absurd to declare the beach safe because of this.

The reference response evaluated the dialogue history and further supplemented the mayor’s refusal to close the beach in time, emphasizing his decision-making mistakes. DoHA’s response is a simple ’yes’ and changes the topic, without mentioning more content of the dialogue history. Although it is fluent and pushes the dialogue forward, it has no relevance to the document. The static unit response mentioned the next attack incident from the document, accurately capturing the document’s knowledge while generating a closely linked response to the history, furthering the conversation. The Dynamic Unit response evaluated the mayor’s actions in the dialogue history and explained that he did so because the size of the discovered shark was small. It effectively used the relevant information from both history and the document.

Table 4 A dialogue example from the CMU-DoG dataset

Full size table

To observe the underlying reasons for the different content of each model, we compared the top-10 document words that each model focused on throughout the response generation process in Table 5. It can be found that the knowledge chosen by DoHA is not applied in the response. However, the knowledge selected by our model can be applied in the dialogue. Especially in this example, the words ‘son’, ‘shock’, ‘go’, and ‘into’ from the static model and the keyword ‘shark’ in the dynamic model are all applied in the final response. In terms of knowledge selection, our static and dynamic model is more effective, not only can be seamlessly connected with history, but it also accurately captures the knowledge from the document and integrates it into the dialogue.

Table 5 Top 10 selected knowledge words for example in Table 4

Full size table

In addition, to further observe the effect of our static and dynamic semantic units, we observed the attention scores of semantic units to document words in a single decoding step in Fig. 6.

In our static unit method, we compared the weights of cross-attention at the unit level and the word level when interacting with knowledge. Currently, the model has already generated half of the sentence: ‘I know, scared me so bad. And then another attack happens! The police chief’s son goes into’ and the next word to be generated is ’shock’. The vertical axis selects the five knowledge words with the highest average weight. The horizontal axis represents the semantic unit of size two for the current response. Although both the sentence and word levels selected the word ’shock’, observing the high-weight response content corresponding to it horizontally, the phrases in Fig. 6 like ‘scared me’, ‘, scared’, and ‘attack happens’ are semantically highly correlated with ‘shock’. This indicates that the semantic information of the static unit in the first half of the response played a role in knowledge selection. In this example, the unit-level response information can capture accurate knowledge information better, complementing the word-level cross-attention and enhancing knowledge selection.

In our dynamic unit method, we conducted a similar analysis. At the current moment, the model has already generated ‘he had to, considering the size of the shark posed a small tiger’, and the next word to be generated is ‘shark’. The vertical axis selects the five knowledge words with the highest average weight. The horizontal axis represents the current dynamic semantic unit division. We can find that the multiple previous units, such as ‘the shark posed’, ‘shark posed’, ‘a small tiger’, and ‘small tiger’, have a high correlation with the knowledge of ‘shark’. The dynamic unit in the first half played a good role in knowledge selection.

The above case analysis proves that our method effectively captures the semantic intricacies across different kinds of units, offering a novel viewpoint for interacting with knowledge. This significantly enhances the knowledge selection ability, leading to more fluent and document-aware responses.

6 Conclusion

In this paper, we introduce a document-grounded dialogue response generation model based on multi-granularity responses to provide more accurate document information selection effects. It interacts with documents using representations at the response word level and the semantic unit level to pinpoint key information positions, thereby producing more appropriate responses. Experimental results on two public datasets consistently indicate that the performance of this model significantly surpasses the baseline models.

Data Availability

The data (CMU-DoG [32] and WoW [33]) utilized in this study are based on publicly available datasets and interested researchers can access them through the provided references.

References

Ma L, Zhang W, Li M, Liu T (2020) A survey of document grounded dialogue systems (DGDS). CoRR. abs/2004.13818
Feng S, Patel SS, Wan H, Joshi S (2021) Multidoc2dial: modeling dialogues grounded in multiple documents. In: Moens M, Huang X, Specia L, Yih SW (eds) Proceedings of the 2021 conference on empirical methods in natural language processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11 Nov 2021, pp 6162–6176. 10.18653/v1/2021.emnlp-main.498
Gu J, Ling Z, Liu Q, Chen Z, Zhu X (2020) Filtering before iteratively referring for knowledge-grounded response selection in retrieval-based chatbots. In: Cohn T, He Y, Liu Y (eds) Findings of the association for computational linguistics: EMNLP 2020, Online Event, 16–20 Nov 2020. Findings of ACL, vol EMNLP 2020, pp 1412–1422. 10.18653/v1/2020.findings-emnlp.127
Zhang Y, Ren P, de Rijke M (2019) Improving background based conversation with context-aware knowledge pre-selection. CoRR abs/1906.06685
Zhang Y, Fu H, Fu C, Yu H, Li Y, Nguyen C (2023) Coarse-to-fine knowledge selection for document grounded dialogs, vol abs/2302.11849. https://doi.org/10.48550/arXiv.2302.11849
Qin L, Galley M, Brockett C, Liu X, Gao X, Dolan B, Choi Y, Gao J (2019) Conversing by reading: contentful neural conversation with on-demand machine reading. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, 28 Jul–2 Aug 2019, vol 1. Long Papers, pp 5427–5436. https://doi.org/10.18653/v1/p19-1539
Moghe N, Arora S, Banerjee S, Khapra MM (2018) Towards exploiting background knowledge for building conversation systems. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, 31 Oct–4 Nov 2018, pp 2322–2332. https://doi.org/10.18653/v1/d18-1255
Zhu C, Zeng M, Huang X (2018) Sdnet: contextualized attention-based deep network for conversational question answering. CoRR abs/1812.03593
Prabhumoye S, Hashimoto K, Zhou Y, Black AW, Salakhutdinov R (2021) Focused attention improves document-grounded generation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4274–4287. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.naacl-main.338
Li Z, Niu C, Meng F, Feng Y, Li Q, Zhou J (2019) Incremental transformer with deliberation decoder for document grounded conversations. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, Jul 28–Aug 2, 2019, vol 1. Long papers, pp 12–21. https://doi.org/10.18653/v1/p19-1002
Shen L, Zhan H, Shen X, Feng Y (2021) Learning to select context in a hierarchical and global perspective for open-domain dialogue generation. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2021, Toronto, ON, Canada, 6–11 Jun 2021, pp 7438–7442. https://doi.org/10.1109/ICASSP39728.2021.9414730
Meng C, Ren P, Chen Z, Monz C, Ma J, de Rijke M (2020) Refnet: a reference-aware network for background based conversation. In: The 34th AAAI conference on artificial intelligence, AAAI 2020, The 32nd innovative applications of artificial intelligence conference, IAAI 2020, The 10th AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, 7–12 Feb 2020, pp 8496–8503. https://doi.org/10.1609/aaai.v34i05.6370
Ren P, Chen Z, Monz C, Ma J, de Rijke M (2020) Thinking globally, acting locally: distantly supervised global-to-local knowledge selection for background based conversation. In: The 34th AAAI conference on artificial intelligence, AAAI 2020, The 32nd innovative applications of artificial intelligence conference, IAAI 2020, The 10th AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, 7–12 Feb 2020, pp 8697–8704. https://ojs.aaai.org/index.php/AAAI/article/view/6395
Zhao X, Wu W, Xu C, Tao C, Zhao D, Yan R (2020) Knowledge-grounded dialogue generation with pre-trained language models. In: Webber B, Cohn T, He Y, Liu Y (eds) Proceedings of the 2020 conference on empirical methods in natural language processing, EMNLP 2020, Online, 16–20 Nov 2020, pp 3377–3390. https://doi.org/10.18653/v1/2020.emnlp-main.272
Lian R, Xie M, Wang F, Peng J, Wu H (2019) Learning to select knowledge for response generation in dialog systems. In: Kraus S (ed) Proceedings of the 28th international joint conference on artificial intelligence, IJCAI 2019, Macao, China, 10–16 Aug 2019, pp 5081–5087. https://doi.org/10.24963/ijcai.2019/706
Kim B, Ahn J, Kim G (2020) Sequential latent knowledge selection for knowledge-grounded dialogue. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 Apr 2020. https://openreview.net/forum?id=Hke0K1HKwr
Hua K, Feng Z, Tao C, Yan R, Zhang L (2020) Learning to detect relevant contexts and knowledge for response selection in retrieval-based dialogue systems. In: d’Aquin M, Dietze S, Hauff C, Curry E, Cudré-Mauroux P (eds) CIKM’20: the 29th ACM international conference on information and knowledge management, virtual event, Ireland, 19–23 Oct 2020, pp 525–534. https://doi.org/10.1145/3340531.3411967
Sun Y, Hu Y, Xing L, Yu J, Xie Y (2020) History-adaption knowledge incorporation mechanism for multi-turn dialogue system. In: The 34th AAAI conference on artificial intelligence, AAAI 2020, The 32nd innovative applications of artificial intelligence conference, IAAI 2020, The 10th AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, 7–12 Feb 2020, pp 8944–8951. https://doi.org/10.1609/aaai.v34i05.6425
Wang M, Tian S, Bai Z, Yuan C, Wang X (2023) Hierarchical history based information selection for document grounded dialogue generation. Appl Intell 53(13):17139–17153. https://doi.org/10.1007/s10489-022-04373-8
Article Google Scholar
Shao Y, Gouws S, Britz D, Goldie A, Strope B, Kurzweil R (2017) Generating high-quality and informative conversation responses with sequence-to-sequence models. In: Palmer M, Hwa R, Riedel S (eds) Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, 9–11 Sept 2017, pp 2210–2219. https://doi.org/10.18653/v1/d17-1235
Liu S, Zhao X, Li B, Ren F, Zhang L, Yin S (2021) A three-stage learning framework for low-resource knowledge-grounded dialogue generation. In: Moens M, Huang X, Specia L, Yih SW (eds) Proceedings of the 2021 conference on empirical methods in natural language processing, EMNLP 2021, virtual event/Punta Cana, Dominican Republic, 7–11 Nov 2021, pp 2262–2272. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.173
Zheng W, Milic-Frayling N, Zhou K (2023) Contextual knowledge learning for dialogue generation. In: Rogers A, Boyd-Graber JL, Okazaki N (eds) Proceedings of the 61st annual meeting of the association for computational linguistics (vol 1: long papers), ACL 2023, Toronto, Canada, 9–14 Jul 2023, pp 7822–7839. https://doi.org/10.18653/V1/2023.ACL-LONG.433
Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Màrquez L, Callison-Burch C, Su J, Pighin D, Marton Y (eds) Proceedings of the 2015 conference on empirical methods in natural language processing, EMNLP 2015, Lisbon, Portugal, 17–21 Sept 2015, pp 1412–1421. https://doi.org/10.18653/v1/d15-1166
Yang B, Tu Z, Wong DF, Meng F, Chao LS, Zhang T (2018) Modeling localness for self-attention networks. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4449–4458. Association for Computational Linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1475
Xu M, Wong DF, Yang B, Zhang Y, Chao LS (2019) Leveraging local and global patterns for self-attention networks. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, 28 Jul–2 Aug 2019, vol 1: long papers, pp 3069–3075. https://doi.org/10.18653/v1/p19-1295
Wu Z, Lu B, Hajishirzi H, Ostendorf M (2021) DIALKI: knowledge identification in conversational systems through dialogue-document contextualization. In: Moens M, Huang X, Specia L, Yih SW (eds) Proceedings of the 2021 conference on empirical methods in natural language processing, EMNLP 2021, virtual event/Punta Cana, Dominican Republic, 7–11 Nov 2021, pp 1852–1863. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.140
Tao C, Wu W, Xu C, Hu W, Zhao D, Yan R (2019) Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In: Culpepper JS, Moffat A, Bennett PN, Lerman K (eds) Proceedings of the 12th ACM international conference on web search and data mining, WSDM 2019, Melbourne, VIC, Australia, 11–15 Feb 2019, pp 267–275. https://doi.org/10.1145/3289600.3290985
Zhou X, Dong D, Wu H, Zhao S, Yu D, Tian H, Liu X, Yan R (2016) Multi-view response selection for human-computer conversation. In: Su J, Carreras X, Duh K (eds) Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2016, Austin, Texas, USA, 1–4 Nov 2016, pp 372–381. https://doi.org/10.18653/v1/d16-1036
Wang J, Sun X, Chen Q, Wang M (2023) Information-enhanced hierarchical self-attention network for multiturn dialog generation. IEEE Trans Comput Soc Syst 10(5):2686–2697. https://doi.org/10.1109/TCSS.2022.3172699
Article Google Scholar
Lian R, Xie M, Wang F, Peng J, Wu H (2019) Learning to select knowledge for response generation in dialog systems. In: Kraus S (ed) Proceedings of the 28th international joint conference on artificial intelligence, IJCAI 2019, Macao, China, 10–16 Aug 2019, pp 5081–5087. https://doi.org/10.24963/IJCAI.2019/706
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7871–7880. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.703
Zhou K, Prabhumoye S, Black AW (2018) A dataset for document grounded conversations. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, 31 Oct–4 Nov 2018, pp 708–713. https://doi.org/10.18653/v1/d18-1076
Dinan E, Roller S, Shuster K, Fan A, Auli M, Weston J (2019) Wizard of wikipedia: knowledge-powered conversational agents. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. https://openreview.net/forum?id=r1l73iRqKm
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out
Denkowski MJ, Lavie A (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Callison-Burch C, Koehn P, Monz C, Zaidan O (eds) Proceedings of the 6th workshop on statistical machine translation, WMT@EMNLP 2011, Edinburgh, Scotland, UK, 30–31 Jul 2011, pp 85–91. https://aclanthology.org/W11-2107/
Zhao X, Wu W, Tao C, Xu C, Zhao D, Yan R (2020) Low-resource knowledge-grounded dialogue generation. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 Apr 2020. https://openreview.net/forum?id=rJeIcTNtvS
Serban IV, Sordoni A, Bengio Y, Courville AC, Pineau J (2016) Building end-to-end dialogue systems using generative hierarchical neural network models. In: Schuurmans D, Wellman MP (eds) Proceedings of the 30th AAAI conference on artificial intelligence, 12–17 Feb2016, Phoenix, Arizona, USA, pp 3776–3784. https://doi.org/10.1609/aaai.v30i1.9883
Serban IV, Sordoni A, Lowe R, Charlin L, Pineau J, Courville AC, Bengio Y (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In: Singh S, Markovitch S (eds) Proceedings of the 31st AAAI conference on artificial intelligence, 4–9 Feb 2017, San Francisco, California, USA, pp 3295–3301. https://doi.org/10.1609/aaai.v31i1.10983
Serban IV, Klinger T, Tesauro G, Talamadupula K, Zhou B, Bengio Y, Courville AC (2017) Multiresolution recurrent neural networks: An application to dialogue response generation. In: Singh S, Markovitch S (eds) Proceedings of the 31st AAAI conference on artificial intelligence, 4–9 Feb 2017, San Francisco, California, USA, pp 3288–3294. https://doi.org/10.1609/aaai.v31i1.10984
Zhang H, Lan Y, Pang L, Guo J, Cheng X (2019) Recosa: detecting the relevant contexts with self-attention for multi-turn dialogue generation. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, Jul 28–Aug 2, 2019, vol 1: long papers, pp 3721–3730. https://doi.org/10.18653/v1/p19-1362
Kong Y, Zhang L, Ma C, Cao C (2021) HSAN: a hierarchical self-attention network for multi-turn dialogue generation. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2021, Toronto, ON, Canada, 6–11 Jun2021, pp 7433–7437. https://doi.org/10.1109/ICASSP39728.2021.9413753

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, China
Meiqi Wang, Kangyu Qiao, Shuyue Xing, Caixia Yuan & Xiaojie Wang

Authors

Meiqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kangyu Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Shuyue Xing
View author publications
You can also search for this author in PubMed Google Scholar
Caixia Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Meiqi Wang conceptualized and designed the study, conducted the experiments, and authored the manuscript. Kangyu Qiao performed an extensive literature review and played a significant role in drafting the manuscript. Shuyue Xing offered essential feedback on the manuscript. Caixia Yuan and Xiaojie Wang provided valuable insights during manuscript preparation and contributed to its refinement. All authors have reviewed and approved the final version of the manuscript.

Corresponding author

Correspondence to Meiqi Wang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Informed Consent

During the human evaluation process, informed consent was obtained from all participants involved in this study. Prior to their participation, individuals were provided with detailed information regarding the purpose, procedures, potential risks, and benefits of the study. Participants were informed about their rights to withdraw from the study at any point without penalty or repercussion. They were assured that their participation was voluntary and that their confidentiality and privacy would be strictly maintained.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A More Related Work

In dialogue response generation methods, the modeling of multi-granularity dialogue information has a wide range of applications. In the earlier RNN-based method [37,38,39], the multi-level RNN-encoder architecture is used to obtain multi-granularity history semantic representation, which helps to understand the content of dialogue history from local to global perspectives and generate more contextually consistent responses.

In recent years, most dialogue generation research mainly adopts the transformer-based architecture [11, 29, 40, 41], which introduces the attention mechanism to encode hierarchical dialogue history. [40] first uses the LSTM encoder and self-attention to obtain the word-level and the sentence-level representation of history. Then, the discourse-level history by splicing the sentence-level representation together, which is then sent to the response decoder. [11] first perform word-level cross-attention between word-level history and the response, then uses the sentence-level history to perform cross-attention. Through the interaction of response and multiple levels of history, a more reasonable response is obtained. These studies have explored the role of the hierarchical structure of history without incorporating external knowledge, but often struggle to generate sufficient information response.

Appendix B Ablation Result

Table 6 Ablation result of static unit method

Full size table

Table 7 Ablation result of dynamic unit method

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, M., Qiao, K., Xing, S. et al. Enhancing Document Information Selection Through Multi-Granularity Responses for Dialogue Generation. Neural Process Lett 56, 189 (2024). https://doi.org/10.1007/s11063-024-11633-w

Download citation

Accepted: 29 April 2024
Published: 28 May 2024
DOI: https://doi.org/10.1007/s11063-024-11633-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Enhancing Document Information Selection Through Multi-Granularity Responses for Dialogue Generation

Abstract

Similar content being viewed by others

Hierarchical history based information selection for document grounded dialogue generation

A Document Driven Dialogue Generation Model

Coarse-to-Fine Response Generation for Document Grounded Conversations

1 Introduction

2 Related Work

2.1 History Granularity

2.2 Semantic Unit Granularity

3 Model

3.1 Encoder

3.2 HIS

3.3 MDIS

3.3.1 SU Division

3.3.1.1 Static Semantic Unit (SSU) division

3.3.1.2 Dynamic Semantic Unit (DSU) division

3.3.2 MDIS

3.3.3 Fusion

3.3.4 Decoder

4 Experiment

4.1 Dataset

4.1.1 CMU-DoG

4.1.2 Wizard of WikiPedia

4.2 Automated Evaluation

4.3 Human Evaluation

4.4 Baseline

4.5 Implementation Details

5 Experiment Result and Analysis

5.1 Main Results

5.2 Ablation Result

5.2.1 Static Unit Method

5.2.1.1 Unit Size

5.2.1.2 Unit Num

5.2.2 Dynamic Unit Method

5.3 Efficiency Analysis

5.4 Case Study

6 Conclusion

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Informed Consent

Additional information

Publisher's Note

Appendices

Appendix A More Related Work

Appendix B Ablation Result

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation