1 Introduction

There are many scenarios where conversations need to be conducted based on documents, such as the customer service personnel need to answer user questions based on the product description documents. In recent years, this document-grounded dialogue task which uses unstructured documents as human knowledge [1,2,3] has attracted extensive attention from the research community of natural language processing. One of the core challenges in this task is document information selection, which refers to selecting highly relevant knowledge from complex document sources to help generate reasonable responses [4, 5].

Existing knowledge-selecting methods that achieve excellent results can be categorized into three types: extraction-based methods, generation-based methods, and hybrid methods. Extraction-based methods typically utilize gold document knowledge labels as supervised signals to extract keywords, semantic units, or fragments from documents [6,7,8]. These approaches can always identify useful information, but they struggle to smoothly integrate the selected knowledge into the decoder, resulting in difficulties in generating fluent and natural responses. Generation-based methods acquire implicit knowledge through the interaction of dialogue and document, which is incorporated into the decoding process to improve the generation probability of knowledge-related words at decoding steps [9,10,11]. Hybrid methods combine both extraction and generation approaches. For example, [12, 13] provides both retrieval and generation options for each decoding step, i.e. it can extract a word/semantic unit from the document as the output of the decoding step, or generate a response word based on an implicit document representation.

Taking advantage of different granularities of dialogue history is always effective in the above three knowledge selection methods. Based on the granularity of history, these methods can be categorized into three types: (1) Models based on word-level history [4, 14]. (2) Models based on sentence-level history [15, 16]. (3) Models based on multi-granularity history [10, 17,18,19]. These methods enhance the understanding of history semantics by introducing multiple granularities, leading to more accurate document knowledge selection. However, these studies only focus on dialogue history, which overlooks the role of multi-level response information in document knowledge selection.

Fig. 1
figure 1

Influence of semantic unit level response information on knowledge selection

Actually, during the decoding process, the hierarchical information of the already generated response part can be of great help to determine the document information [20]. As shown in Fig. 1, to continue to predict the phrase ‘close the beach’ in the response (in the bracket), the model needs to focus on the snippet ‘refuses to close the beach’ in the document. It is evident that multi-level history information is not enough to infer this knowledge, whereas the semantic unit ‘didn’t want to’ in the preceding part of the response can facilitate a more straightforward focus on the document phrase ‘refuses to’. Consequently, it enables the model to locate the key knowledge ‘close to beaches’ in the document. Compared to the word-level representation of the response, semantic units aggregate the meaning of multiple words, so they can capture the precise essence of phrases, which enables the effective selection of crucial knowledge from the document and provides support for the decoding process.

Therefore, to improve the accuracy of document knowledge selection, we propose a generation-based model utilizing multi-level responses for document information selection. To be more specific, we select the information from documents through response in two levels, to enhance the model’s positioning ability of the key information in the document. In addition to the word-level response, we introduce the semantic unit-level response representation, which aggregates the meaning of the word and its neighbors. Among them, we propose two kinds of semantic unit segmentation methods, static and dynamic, to explore the function of semantic units from the global and local scope respectively. In the static method, we divide the generated response sequence by fixed-size units to obtain the semantic unit representation at the sentence level. In the dynamic method, n-gram is introduced to dynamically find the semantic unit combination related to each response word. Secondly, we fuse the response-document results at the semantic unit level with the response-document results at the word level through the gate mechanism to obtain the document knowledge-related response representation. Finally, the next word of the response is obtained through a standard decoding process. Our main contributions are:

  1. 1.

    Propose a dialogue generation model by utilizing a multi-granularity response for document information selection, which improves the effect by fusing the word level and semantic unit level response selection result.

  2. 2.

    Propose static and dynamic semantic unit division methods to explore the effect of the semantic unit from the global and local scope respectively.

  3. 3.

    The experimental results show that we have made significant improvements over the baseline models on automatic and human evaluation metrics.

2 Related Work

2.1 History Granularity

In knowledge-based dialogue systems, most research focuses on enhancing knowledge selection by utilizing different granularities of dialogue history. These methods can be divided into three categories:

  1. 1.

    Models based on word-level history. [4] captures the correlation between word-level history and document using bidirectional word-level history-document attention for knowledge preselection. [14] concatenates word-level history information and each document sentence, then captures the interaction pattern through multiple self-attention layers. Knowledge selection is treated as a sequential decoding process. KAT [21] adds a gate as controller for the knowledge selection blocks of history and document to improve the effect of selecting key knowledge words.

  2. 2.

    Models based on sentence-level history. [15] learns the distribution of documents over history sentences using KL-loss during training, then selects document knowledge based on this distribution during the testing process. [16] introduces latent variables for posterior knowledge inference from the previous dialogue sentence and prior knowledge estimation for the current sentence. CKL [22] builds pseudo labels through the correlation between sentence-level dialogue history / sentence-level documents and gold response, then achieves implicit selection of key sentences in documents and history by this weak supervision.

  3. 3.

    Models based on multi-granularity history information. [17] first retrieves multiple document sentences based on recent dialogue sentences, then interacts with the discourse-level history with documents to select key document knowledge. In [10], cross-attention is employed between history and document on a word level, which is used to select document word information corresponding to each history sentence, followed by an incremental integration to utilize discourse-level dialogue history. HHKS [19] locates important information in history and documents by merging both word-level and utterance-level attention to history. In [18], the history sentences are processed in chronological order from distant to recent to interact with the document at a word-level non-linear transformation to update the selection of document knowledge. Then, the sentence-level history representations are fed into a GRU to obtain a discourse-level representation of the history, which is then used to predict the response.

These studies focus on matching multi-granularity history information and external document in the process of knowledge selection. They capture relevant document information from multiple levels and perspectives of history, providing accurate knowledge for response generation and enhancing the utilization of documents. However, these works rarely analyze and utilize the multi-level response. And more related work about the application of hierarchical history in dialogue generation tasks is in appendix A.

2.2 Semantic Unit Granularity

Compared to word-level responses, the semantic unit level can aggregate multiple word meanings to achieve more accurate expression. In contrast to sentence-level responses, semantic units often pay attention to key details in the sentence. Some studies have noted the importance of the semantic unit level and conducted diverse research. These methods usually divide sentences into several fixed or unfixed length units to obtain semantic units and improve task performance through the effective use of semantic units.

[23,24,25] improves the word-level attention calculation method by performing local attention within fixed or unfixed length units to enhance the performance of self-attention. DIALKI [26] extracts knowledge by dividing a given long document into paragraphs and contextualizing them using dialogue context. In [13], when selecting document knowledge, a convolutional network is used to divide the matching results after history-document interaction into units, to obtain the transition probability of dialogue history and each knowledge semantic unit. [27] applies a one-dimensional convolutional pooling to the context to obtain N-gram information, serving as local semantic unit-level information. [28] reads the word embedding sequence through a sliding convolutional unit and uses max-pooling to obtain the sentence-level history representation, interacting the word and sentence-level representation with candidate responses to select high probability responses. [29] propose a multi-level transformer-based model that divides multiple words/sentences into local units of fixed size for sequences of words/sentences respectively, and uses RNN to encode units at two levels, to enhance the ability to capture relevant context. In addition to the above hierarchy division methods, the use of knowledge labels can also help predict knowledge [2]. For example, Re3G [5] designed training objectives containing passage span and response by labels, achieving fine-grained knowledge extraction. PostKS [30] improves knowledge prediction performance by constraining the prior and posterior distributions of knowledge through Kullback Leibler divergence loss.

These methods improve understanding of information and enhance model performance by obtaining local semantic unit-level information through unit division methods. However, these methods often focus more on dialogue history or external knowledge and rarely utilize the semantic unit-level information of the response in the dialogue generation task.

3 Model

Dialogue history including M words can be represented as \(his=\{h_{1},h_{2},\ldots ,h_{m},\ldots ,h_{M}\}\), and the document corresponding to the history is denoted as \(doc=\{d_{1},d_{2},\ldots ,d_{n},\ldots ,d_{N}\}\), where \(d_{n}\) indicates the n th word of the document. Note that for the history and document to interact efficiently during the encoding phase, we use the same settings in DoHA [9] where \(doc = [his, doc]\). We use \(res_{<t}\) to represent the first \(t-1\) words of the generated response, where \(res_{<t} =\{y_{1},y_{2}\ldots ,y_{t},\ldots ,y_{t-1}\}\) and \(y_{t}\) is the t th word at decoding step t.

Fig. 2
figure 2

Overview of our model. The left split shows the main structure and the right split shows the working mechanism of static and dynamic semantic unit division

The structure diagram of the whole model is shown in Fig. 2, which consists of four modules: encoder module (Enc), history information selection module (HIS), multi-granularity document information selection module (MDIS), and response decoding module (Dec). The Enc module encodes the history and document at the word level using self-attention. The HIS module selects information in history with the first \(t-1\) response words as a query. The MDIS realizes the interaction of multiple levels of response and documents to the acquisition of key knowledge. Finally, the Dec module decodes according to the decoding vector to realize the prediction of the t th word of the response. Among them, the MDIS module is the main contribution of the paper. In this module, we first put forward two division methods to get the semantic unit-level representation of the response respectively. Secondly, knowledge in the document is selected by word-level-response and semantic unit-level-response, and two kinds of knowledge-related response representations are obtained. Finally, the two representations are fused by a gate mechanism to get the final result of the response.

The whole model adopts the common encoder-decoder framework, and the pre-trained BART model serves as the backbone. Both encoder and decoder are stacked with \(L-layer\) networks, and the hidden layer dimension is d.

3.1 Encoder

We use BART [31] encoder, which is composed of L network layers consisting of self-attention and feed-forward network. By inputting the history his and the document doc into the encoder, we obtain word-level representations for the history \(H\in R^{M\times d}\), and document representation \(D\in R^{N\times d}\):

$$\begin{aligned} H= & {} Encoder\left( his\right) , \ R^{M\times d} \end{aligned}$$
(1)
$$\begin{aligned} D= & {} Encoder\left( doc\right) , \ R^{N\times d} \end{aligned}$$
(2)

3.2 HIS

HIS module implements dialogue history information selection. The previous response words at time step t are used as the query, and useful information from the history is obtained through a cross-attention (CA) mechanism.

First, the first \(t-1\) responses are encoded by masked self-attention (SA), obtaining word-level representation \({res}_{<t}^w\):

$$\begin{aligned} {res}_{<t}^w = SA(res_{<t}, res_{<t}, res_{<t} ), \ R^{(t-1)\times d} \end{aligned}$$
(3)

Then, take \({res}_{<t}^w\) as a query to select information from history, obtaining the response representation related to history \({\ Y}_{<t}^w\):

$$\begin{aligned} Y_{<t}^w = CA({res}_{<t}^w, H, H), \ R^{(t-1)\times d} \end{aligned}$$
(4)

3.3 MDIS

In the MDIS module, we use multi-granularity responses for document knowledge selection. Based on the word-level history-aware response obtained from HIS module \(Y_{<t}^w\), we use static and dynamic division methods to get a semantic unit-level response at SU (semantic unit) Division section. Secondly, we use word-level and semantic unit-level responses respectively to conduct Document Information Selection (DIS), to get the response representations related to document information at two levels. Finally, a gate mechanism is used to fuse the response representations of the two levels to get a multi-granularity response representation containing document information. The details of each step are described below.

3.3.1 SU Division

We use static and dynamic methods to divide semantic units into fixed length and unfixed length respectively.

3.3.1.1 Static Semantic Unit (SSU) division

First, the word-level response of the first \(t-1\) words \(Y_{<T}^w\) is divided into \(t-1\) sequences: \(Q_{1}= \{y_{1}\}\), \(Q_{2} = \{y_{1},y_{2}\}\), \(Q_{3} = \{y_{1},y_{2},y_{3}\)},..., \(Q_{t-1} = \{y_{1},y_{2},y_{3},\ldots , y_{t-1}\}\).

Next, for each sequence, \(Q_{t-1}\), use sliding step and unit size of s and k to conduct fixed length semantic unit division. So the sequence \(Q_{t-1}\) consists of \(Z=((t-1-k)/s +1)\) semantic units, and each unit has k words. The multi-semantic unit representation \((Q_{t-1}^{k}) ^{SSU} \in R ^{Z \times d} \) of the sequence \(Q_{t-1}\) is obtained by summing the word representations inside each unit. The global static semantic unit representation \((Y_{t-1}^{k}) ^ {SSU} \in R ^{d} \) of the sequence \(Q_{t-1}\) is obtained by summing Z units. Then the semantic units representations correspond to the first \(t-1\) response sequences are {\((Y_{1}^{k})^{SSU}\), \((Y_{2}^{k})^{SSU}\),..., \((Y_{t-1}^{k}) ^{SSU}\)}.

We choose several units of different sizes \(K=\{k_{1},k_{2},\ldots , k_{L}\}\) to simulate semantic units of different lengths. Then we sum the different sizes of static semantic units in each unit, to integrate the meanings of different semantics. For the sequence \(Q_{t-1}\), its multi-size static semantic unit is represented by:

$$\begin{aligned} Y_{t-1} ^{SSU}= \sum _{k \in K} (Y_{t-1}^{k})^{SSU}, \ R^{d} \end{aligned}$$
(5)

We use the first \(t-1\) words’ static fixed unit representation to present a global semantic unit representation of the \((t-1){th}\) word. Then the static semantic unit representation of the first \(t-1\) words in the response can be represented as:

$$\begin{aligned} Y_{<t}^{SSU}= [Y_{1}^{SSU},Y_{2}^{SSU},\ldots ,Y_{t-1}^{SSU}], \ R^{(t-1) \times d} \end{aligned}$$
(6)
3.3.1.2 Dynamic Semantic Unit (DSU) division

The core idea of dynamic semantic unit segmentation is to capture the overall correlation of several n-gram semantic units starting from each word and find the n-gram unit with the highest correlation as the semantic unit.

First, obtain the relevance scores between the response words. For the original response embedding \(Y_{<t}^{emb}\), apply self-attention (SA) to get the attention values \(a_{ij}\) as the word-word semantic correlation scores:

$$\begin{aligned} {{A=(a}_{ij})}_{(t-1)\times (t-1)}, \ R^{(t-1)\times (t-1)} \end{aligned}$$
(7)

Then, calculate the K-gram semantic relevance score of each word. For the \((t-1){th}\) word, construct a unit of length K with that word as the starting point. Then, calculate the average attention score between the words within the unit as the K-gram semantic relevance score \(S_{t-1}^{K} \in R^{1}\) for that word (For example, the 3-gram semantic relevance score of \(y_1\) is \(S_{1}^{3}=(a_{12}+a_{23}+a_{13}) / 3 \)):

$$\begin{aligned} S_{t-1}^{K}= \frac{\sum _{i\ =t-1}^{t+K-2}{\sum _{j=i+1}^{t+K-1}a_{ij}\ }\ \ }{\sum _{g=1}^{K-1}g}\, \ R^{1} \end{aligned}$$
(8)

The next part is semantic unit selection. For the \((t-1){th}\) word, we choose multiple units of different sizes \(K=\{ k_{1}, k_{2},\ldots , k_{L} \}\) to simulate L semantic units with different sizes. Therefore, the semantic relevance matrix for the \((t-1){th}\) word will be \(S_{t-1} = \{S^{k1}_{t-1},S^{k2}_{t-1},\ldots ,S^{kL}_{t-1}\} \in R^{L}\). Consequently, the semantic relevance matrix for the response containing \(t-1\) words will be \(S=\{ S_{1},S_{2},,\ldots ,S_{t-1}\} \in R^{L \times (t-1)} \). We extract the maximum value from each row in the matrix S, obtaining a dynamic unfixed unit size \(k^{*}_{t}\). For the first \(t-1\) words, the dynamic units are:

$$\begin{aligned} K^{*} = \{ k^{*}_{1},k^{*}_{2},\ldots ,k^{*}_{t-1}\}, \ R^{t-1} \end{aligned}$$
(9)

Finally, computing the representation of the dynamic semantic units for the response. Using \(K^{*}\), we sum the embeddings of the words within the corresponding unit for each word in the response, obtaining the unit representation \(Y_{t-1} ^{DSU} \in R^{d}\) for the \((t-1){th}\) word:

$$\begin{aligned} Y_{t-1}^{DSU}= \sum _{g=t-1}^{t+k_t^*-2} Y_{g}, \ R^{d} \end{aligned}$$
(10)

Therefore, the representation of dynamic semantic units for the first \(t-1\) words of the response is denoted as:

$$\begin{aligned} Y_{<T}^{DSU}= \{Y_{1}^{DSU},Y_{2}^{DSU},\ldots ,Y_{t-1}^{DSU}\}, \ R^{(t-1) \times d} \end{aligned}$$
(11)
Fig. 3
figure 3

Dynamic semantic unit division process. The three parts divided by arrows are obtaining the response word semantic relevance matrix A, calculating the response N-gram relevance matrix S, and acquiring the suitable unfixed size semantic unit corresponding to each response word

Here is an explanation of the dynamic semantic unit segmentation process using an example. Figure 3 reflects the process of dynamic semantic unit segmentation for the response shown in Fig. 1. To make it easier to observe, we demonstrate the process using the recently generated 5 words of the response, ‘refuse to close the beaches’. First, we calculate the masked self-attention of the embeddings for ‘refuse to close the beaches’ to obtain the lower triangular correlation matrix A for the word-word relevance scores. Then, using Eq. 8, we obtain the K-gram semantic relevance matrix S for each word. In this example, K = 2, 3, 4. For instance, the 2/3/4-gram semantic units corresponding to ‘refuse’ are ‘refuse to’/‘refuse to close’/‘refuse to close the’, and [refuse, 2-gram] in S records the relevance score for ‘refuse to’. Finally, by taking the maximum value in each row of S, we can get the final suitable unit size for each word. For example, by comparing the 2/3/4-gram scores for ‘refuse’, we can determine that the highest-scoring semantic unit is ‘refuse to’. In this way, we can obtain the dynamic semantic units for the entire response sequence.

3.3.2 MDIS

MDIS (Multi-granularity document information selection) interacts with the document at two response levels to obtain word-level and semantic unit-level document-related response representations.

Word-level document information selection: the word-level representation of the response is used as a query to select information from the document D, resulting in a new word-level response representation.

$$\begin{aligned} Y_{<t}^W= CA(Y_{<t}^W, D, D), \ R^{(t-1) \times d} \end{aligned}$$
(12)

Semantic unit-level document information selection: the semantic unit representation of the response is used as a query to select information from the document D, resulting in a new semantic unit level response representation \(Y^{SU} \in R^{(t-1) \times d}\).

$$\begin{aligned} Y_{<t}^{SU}= CA(Y_{<t}^{SU}, D, D), \ R^{(t-1) \times d} \end{aligned}$$
(13)

Note that \(Y_{<t}^{SU}\) in Eq. 13 can be the static/dynamic division response semantic unit representation \(Y_{<t}^{DSU}{/\ Y}_{<t}^{SSU}\).

3.3.3 Fusion

We apply a gate mechanism to fuse the document-related response representations at two levels. First, we take the word-level representation and semantic unit representation into a multilayer perceptron (mlp). Then, we pass the output through a sigmoid function to obtain the weights, denoted as w. Using these weights, a weighted sum of the two representations is performed to obtain the final response representation \(Y_{<t}\):

$$\begin{aligned} w= & {} sigmoid(mlp([Y_{<t}^W,Y_{<t}^{SU}])) \end{aligned}$$
(14)
$$\begin{aligned} Y_{<t}= & {} w*Y_{<t}^W + (1-w)*Y_{<t}^{SU}, \ R^{(t-1) \times d} \end{aligned}$$
(15)

3.3.4 Decoder

The decoder takes the response representation \(Y_{<t}\) obtained from MDIS and decodes the next word. The response is passed through an mlp network followed by a softmax function to generate the probability distribution of the next word.

$$\begin{aligned} P(y_{t})=softmax(mlp(Y_{<t})) \end{aligned}$$
(16)

4 Experiment

4.1 Dataset

4.1.1 CMU-DoG

The CMU-DoG dataset [32] comprises over 4,000 human-human dialogues, each averaging 21.43 turns. These dialogues were collected through Amazon Robot Turk. The dataset leverages Wikipedia descriptions of diverse movie genres as documents, serving as a reliable foundation for generating dialogues. Each dialogue session focuses on a specific document section, which can either contain the movie’s basic information or a description of the plot. This setup ensures that the dialogues are contextually enriched and built upon the details provided in the corresponding document sections.

4.1.2 Wizard of WikiPedia

Wizard of Wikipedia [33] is a large data set collected by Facebook AI Research for training and evaluating document-driven conversation models. One participant plays the ‘Wizard’ role while the other plays the role of the ‘Apprentice’. The Wizard can use an information retrieval system to answer questions, make statements, and engage in conversation with the Apprentice. However, the Apprentice does not have direct access to the document knowledge and must communicate with the Wizard for knowledge. The test set is divided into two subsets, WoW-seen and WoW-unseen, where the dialogues of the former contain only topics that have appeared in the train set, and the dialogues of the latter contain new topics.

4.2 Automated Evaluation

For automatic evaluation, we use BLEU1-4, Rouge-L [34] and Meteor [35] to assess the consistency between the predicted responses and the gold responses, and we use Doc-BLEU and NW to measure the ability to select document knowledge. Among them, BLEU1-4 calculates the overlap of continuous n-grams between two texts. As the size of the n-gram increases, the strictness of the word order requirement gradually increases. For example, 1-gram does not consider the word order, while 4-gram requires a perfect match of four consecutive words. Rouge-L measures the longest common subsequence between the generated response and the reference in the lexical selection and word order. Meteor is used to measure the similarity between generated response and reference. Doc-BLEU evaluates the quality of dialogue responses by comparing the 1-gram match degree between the predicted response and the document. NW is used to detect the effectiveness of document knowledge selection by measuring the Intersection of the generated response and document. Suppose the token set of history/predicted response/document is H/T/N respectively, and the stop word set is S. The set operation of NW is calculated as \(|((T \bigcap N) \setminus H) \setminus S|\).

4.3 Human Evaluation

We evaluate the model from three aspects: (1) Fluency of generated responses (2) Relevance between generated responses and reference responses (Ref-Rel) (3) Relevance of generated sentences to the document (Doc-Rel). We compare DoHA as a benchmark model with ours. For the three models, we randomly sample 150 response predictions from the test set for evaluation, and each result is evaluated by three people. Each criterion is rated on a five-point scale, where 1, 3, and 5 indicate unacceptable, moderate, and perfect performance, respectively. The average score of the assessment serves as the final result.

4.4 Baseline

DRD [36]. To improve model performance with limited training data, a disentangled response decoder is designed to separate the parts that are highly related to knowledge from the other parameters of the model.

BART [31]. The dialogues and documents are concatenated and fed into the encoder to get a joint encoding representation, which is then fed into the decoder to predict the responses.

CoDR [9]. A BART-based model that takes contextualized documents and dialogue history as the input of the decoder to fuse the different information better.

DoHA [9]. Using cross attention_context and cross attention_doc separately to allow the response to interact with dialogue history and documents respectively, thereby obtaining responses that are coherent with the context and rich in document information.

DIALKI [26]. A knowledge identification model that provides dialogue-contextualized passage encodings and better locates conversation-related knowledge by utilizing the document structure.

KAT [21]. A three-stage architecture for dialogue generation model, where the knowledge-aware transformer can achieve good performance by a dynamic knowledge selection mechanism.

CKL [22]. A contextual knowledge learning model that involves latent vectors to capture differential weighting of context utterances and knowledge sentences during the training process.

HHKS [19]. A hierarchical history based knowledge selection for dialogue generation, locates key information in history and document by merging both word-level and utterance-level attention of history.

4.5 Implementation Details

The samples in the dataset are input into the model in the form of tuples \((d_i, hi, y_i)\), where \(h_i\) is the dialogue history, \(d_i\) is the corresponding document, and \(y_i\) is the reference response. The response \(y_i\) is based on \(d_i\) and serves as the next turn of dialogue for \(h_i\). For CMU-DoG, the document is a paragraph from a Wikipedia movie description. For WoW, the document combines the first paragraph of the seven articles from Wikipedia. For dynamic semantic unit division, to avoid interference with word-level SA and better meet the needs of semantic unit division, we add a new SA block to learn word-level relevance. To test the scalability of semantic units on different methods, in addition to conducting experiments with BART-base as the backbone, we also extended static and dynamic semantic units to a stronger model HHKS. Among them, in the experiments using the BART backbone and reproducing DoHA, we followed the data setting of one history utterance, which is the same as DoHA. In the experiment of adding semantic units to HHKS, our experimental parameter settings were identical to those of HHKS. In addition, considering the GPU limitations and time consumption of the model, we conducted most experiments and analyses based on BART as the backbone, including the ablations, human evaluation, and case study.

5 Experiment Result and Analysis

5.1 Main Results

Table 1 Automated metrics of dialogue generation quality on the three datasets. The gray background represents the model result of adding semantic units. Ours- SSU/DSU means adding semantic units based on BART, while HHKSSSU/ DSU means adding semantic units based on HHKS. The bold indicates the best results
Table 2 Automated metrics of document knowledge selection effect on the three datasets, the bold indicates the best result

Dialogue generation quality. We compared the dialogue generation performance of models on the CMU-DoG and WoW datasets Table 1. SSU and DSU represent our method with the static semantic unit and the dynamic semantic unit respectively. Among all baselines, the model structure closest to Ours-SSU/DSU is DoHA, which also uses BART as the backbone, and we methods added semantic units compared to DoHA. From Table 1, Ours-SSU/DSU surpasses most baselines on most metrics. Compared to DoHA, it can be demonstrated that our methods can be effective in most cases, showing that adding semantic units improved the response generation ability of the BART-based backbone method. In addition, both of our models on the base of HHKS perform better than the baseline models across almost various metrics. The response consistency indicators BLEU1-4, Meteor, and Rouge-L, both HHKS-SSU and HHKS-DSU methods are more consistent with the gold responses. It can be seen that the dialogue effect is improved after adding semantic units, which indicates that our proposed method has scalability on different models.

Knowledge selection effect. From the result of the document relevance indicators NW and Doc-BLEU in Table 2, our predictions are more closely aligned with the document information compared to baselines, effectively incorporating document knowledge into dialogue. Among them, the fixed semantic unit method SSU achieved a higher performance improvement on CMU-DoG compared to unfixed DSU, while on WoW, it was the opposite. This is because the document knowledge in the WoW dataset is richer and more dense, with more document words reflected in the dialogue, which is more suitable for dynamic selection by the unfixed method. In contrast, in CMU-DoG, document knowledge is relatively sparse, and the fixed process can enhance the integration of information from various parts in a global way.

Table 3 Human evaluation result of baseline and our models on the datasets

Hmuan evaluation result. Through the results of the human evaluation in Table 3, it can be found that our model performs better in terms of fluency, reference relevance, and document relevance compared to the baseline models.

5.2 Ablation Result

To analyze the influence of various parts of the model on the results, we separately delve into the details of the static semantic unit method and the dynamic semantic unit method. For the static semantic unit method, we conduct experiments and analyze on the size and quantity of semantic units (Sect. 5.2.1). As for the dynamic semantic unit method, we carry out analysis by varying the sizes of the semantic units (Sect. 5.2.2). To facilitate the comparison of the results of various models, we chose the form of a bar chart in the ablation experiment section, and the specific metric values are shown in Appendix B.

5.2.1 Static Unit Method

To compare the effect of unit number and size, we use n means the number of semantic units and k means the minimum size of the semantic unit. For example, n3k2 means the number of units is 3 and the minimum unit size is 2, which indicates we use the unit of size 2/3/4 to experiment.

Fig. 4
figure 4

The left chart depicts the variation in BLEU scores by changing the size of the semantic unit while keeping the number of semantic units constant at n = 1. The right chart begins with n = 2 and studies the impact of varying the number of windows (k = 2/3/4) on the BLEU scores

5.2.1.1 Unit Size

To observe the effect of different semantic unit sizes, we fix the number of units (n=1) and test semantic units of different sizes (k=2/3/4). From the left chart in Fig. 4, we can see that our model (n1k2, n1k3, n1k4) significantly outperforms the comparison model DoHA on all BLEU scores. This indicates that compared to the word-level DoHA, our model has a better ability to deal with semantic unit-level responses, thus improving the final results. Secondly, we can see that the n1k3 model scores higher than the other two models on BLEU-2, BLEU-3, and BLEU-4 metrics. This suggests that the n1k3 model performs better when dealing with longer word groups. It indicates that the n1k3 model excels in capturing the semantic relationships between words and the overall structure of sentences compared to the other models.

5.2.1.2 Unit Num

To observe the effect of the number of semantic units on the experimental results, we fixed the minimum semantic unit size at 2 and tested the experimental results containing 1/2/3 units respectively. From the right chart in Fig. 4, we can see that our model outperforms the DoHA model on all BLEU scores. For different numbers of units (starting from k=2), the performance gap between n1k2 and n2k2 is insignificant, but the n3k2 model performs the best. It scores notably higher than the other two models on BLEU-1, BLEU-2, BLEU-3, and BLEU-4 metrics. This could suggest that when semantic units of size 3 are integrated, the model can capture the meanings of longer word groups.

5.2.2 Dynamic Unit Method

Fig. 5
figure 5

This chart depicts the variation in BLEU scores by changing the maximum unit size of dynamic unit methods on the WoW dataset

We explore the impact of unfixed unit combinations on the experiment by controlling the maximum length of variable-length units. Note that 2-gram indicates that the variable-length unit can choose either 1-gram or 2-gram. We first calculate the average value of all 2-gram relevancies. If it’s greater than the average value, we select 2-gram, otherwise, we choose 1-gram. The methods of 3-gram and 4-gram are those proposed in the paper, that is, for 3-gram, choose from 2/3-gram; and for 4-gram, choose the variable-length unit from 2/3/4-gram. The results in Fig. 5 show that on the WoW-seen dataset, our model’s performance is significantly better on BLEU-2, 3, and 4 when the maximum unit size is 4-gram. This suggests that dynamically choosing an appropriate semantic unit size can capture the patterns of dialogue, showing better results on topics that have been trained. On the WoW-unseen dataset, however, our model with the maximum unit size of 2-gram performs better. In this case, smaller context units (such as 2-gram) may be more useful because the model needs to understand and generate responses more flexibly, rather than relying on specific patterns learned during training.

5.3 Efficiency Analysis

The main contribution of this paper is to add a semantic unit level to the response, so it is necessary to discuss the efficiency of our method. We first analyze the complexity of the two kinds of semantic units theoretically and then combine HHKS and DSU to further illustrate the impact of semantic unit on training.

For the SSU method, the main operations include SSU division (SSUD) and cross attention between SSU and documents (CA). The complexity of SSUD is related to the response’s length \(t-1\), the unit number n, and the unit size k. Among them, n and k are fixed constants in the experiment, n belongs to [1,3], and k belongs to [2,4]. Therefore, the complexity of dividing SSUD is related to the response length. Specifically, for the \(t-1\) time step, the length of the response is \(t-1\). We first split the response into \(t-1\) sequences and divide SSU for each sequence. As time step t changes, the complexity of SSUD changes by the square of t, i.e., \(\mathcal {O}(t^2)\). CA is performed with each unit division, and its complexity is also \(\mathcal {O}(t^2)\). When the unit number is 3, the above operations will be performed three times, taking up to \(\mathcal {O}(3t^2)\). For the overall SSU method, its time complexity is \(\mathcal {O}(t^2)\).

The DSU method, its main operations include SA, n-gram relevance calculation (NGRC), n-gram selection (NGS), and cross attention between DSU and documents (CA). SA and NGRC are performed with decoding length \(t-1\), and their complexity is \(\mathcal {O}(t)\). Among them, in our experiment, n-grams can be [1,2]/[2,3]/[2,3,4], so the complexity of NGRC takes up to \(\mathcal {O}(3t)\) at most. In NGS, it is still necessary to divide the response with a length of \(t-1\) into \(t-1\) sequences first, and then perform NGS and CA operations on each sequence. Therefore, the complexities of NGS and CA are both \(\mathcal {O}(t^2)\). So, for the overall DSU method, its time complexity is \(\mathcal {O}(t^2)\).

We take the more complex HHKS-DSU method as an example to illustrate the overhead it brings during the training process. Under the experiment setting of NVIDIA TITAN RTX 3090 GPU and the WoW dataset training set, the training time for the addition of SSU on HHKS, each epoch increases by about approximately 1 h on time and 934 M in space. Among them, 1 h includes its main operations including SA, NGRC, NGS, and CA with documents. During model training, the maximum training time is 50 epochs (if the model is fitted, the training will be stopped), which means an increase of approximately 2 days at most. The introduction of semantic units in response inevitably increases training time, which is a limitation of this paper.

5.4 Case Study

We list the responses generated by our model and the strongest baseline according to the same history in Table 4. The main meaning of the dialogue history is that the speaker was surprised at the mayor’s decision to declare the beach safe after capturing a tiger shark. Although the tiger shark is large and people might mistake it for the ‘great white shark’, it’s absurd to declare the beach safe because of this.

The reference response evaluated the dialogue history and further supplemented the mayor’s refusal to close the beach in time, emphasizing his decision-making mistakes. DoHA’s response is a simple ’yes’ and changes the topic, without mentioning more content of the dialogue history. Although it is fluent and pushes the dialogue forward, it has no relevance to the document. The static unit response mentioned the next attack incident from the document, accurately capturing the document’s knowledge while generating a closely linked response to the history, furthering the conversation. The Dynamic Unit response evaluated the mayor’s actions in the dialogue history and explained that he did so because the size of the discovered shark was small. It effectively used the relevant information from both history and the document.

Table 4 A dialogue example from the CMU-DoG dataset

To observe the underlying reasons for the different content of each model, we compared the top-10 document words that each model focused on throughout the response generation process in Table 5. It can be found that the knowledge chosen by DoHA is not applied in the response. However, the knowledge selected by our model can be applied in the dialogue. Especially in this example, the words ‘son’, ‘shock’, ‘go’, and ‘into’ from the static model and the keyword ‘shark’ in the dynamic model are all applied in the final response. In terms of knowledge selection, our static and dynamic model is more effective, not only can be seamlessly connected with history, but it also accurately captures the knowledge from the document and integrates it into the dialogue.

Table 5 Top 10 selected knowledge words for example in Table 4
Fig. 6
figure 6

Visualization of semantic unit level attention scores over the last layer for static and dynamic method

In addition, to further observe the effect of our static and dynamic semantic units, we observed the attention scores of semantic units to document words in a single decoding step in Fig. 6.

In our static unit method, we compared the weights of cross-attention at the unit level and the word level when interacting with knowledge. Currently, the model has already generated half of the sentence: ‘I know, scared me so bad. And then another attack happens! The police chief’s son goes into’ and the next word to be generated is ’shock’. The vertical axis selects the five knowledge words with the highest average weight. The horizontal axis represents the semantic unit of size two for the current response. Although both the sentence and word levels selected the word ’shock’, observing the high-weight response content corresponding to it horizontally, the phrases in Fig. 6 like ‘scared me’, ‘, scared’, and ‘attack happens’ are semantically highly correlated with ‘shock’. This indicates that the semantic information of the static unit in the first half of the response played a role in knowledge selection. In this example, the unit-level response information can capture accurate knowledge information better, complementing the word-level cross-attention and enhancing knowledge selection.

In our dynamic unit method, we conducted a similar analysis. At the current moment, the model has already generated ‘he had to, considering the size of the shark posed a small tiger’, and the next word to be generated is ‘shark’. The vertical axis selects the five knowledge words with the highest average weight. The horizontal axis represents the current dynamic semantic unit division. We can find that the multiple previous units, such as ‘the shark posed’, ‘shark posed’, ‘a small tiger’, and ‘small tiger’, have a high correlation with the knowledge of ‘shark’. The dynamic unit in the first half played a good role in knowledge selection.

The above case analysis proves that our method effectively captures the semantic intricacies across different kinds of units, offering a novel viewpoint for interacting with knowledge. This significantly enhances the knowledge selection ability, leading to more fluent and document-aware responses.

6 Conclusion

In this paper, we introduce a document-grounded dialogue response generation model based on multi-granularity responses to provide more accurate document information selection effects. It interacts with documents using representations at the response word level and the semantic unit level to pinpoint key information positions, thereby producing more appropriate responses. Experimental results on two public datasets consistently indicate that the performance of this model significantly surpasses the baseline models.