1 Introduction

In recent years, dialogue systems have been prevalently applied in customer services, online health consultation, chatbots, etc. Dialogue classification, which aims at assigning predefined labels to an entire dialogue, is a fundamental task for many applications, including dialogue theme recognition, customer satisfaction analysis, service quality, etc. [1].

Most existing researches on classification in dialogue systems focus on the intent of users in each turn within a dialogue [2, 3]. These methods, taking the sentence-level user utterance as input and output of predicted intent, are not appropriate to classify the entire dialogues at document-level because sentences in dialogues are meant to be understood with the help of the context of all messages in the dialogue. The dependence on extended context requires that the classification process must regard a large block of utterances as input, which should be classified as a whole [4].

Fig. 1
figure 1

An example dialog from telecom customer service, is defined as “consulting" rather than “business cancelling"

An intuitive solution to solve the above problem is to treat the whole dialogue as a document and use document classification methods. These methods either concatenate sentences into a long sequence [5, 6] or combine them hierarchically [7]. The main challenge is that a dialogue may contain multiple semantic topics, some of which are irrelevant to the business of the application task. Such irrelevant topics, regarded as noise, may be meaningless or misleading for classification models to predict correct results. For example in Fig. 1, the customer is consulting whether the 30 yuan data package is cancelled. The ground truth category should be business consultation. However, the mentioned canceling topic might mislead the models to identify the category of business cancellation. Therefore, existing models can hardly identify the noise in dialogues to determine the accurate categories.

We propose DialGNN, a generic framework based on heterogeneous graph neural networks for document-level dialogue classification. Firstly, a heterogeneous graph is constructed for each dialogue to represent the latent relationships among the sentences and the words within the dialogue. The sentences and words in each dialogue are regarded as nodes of different types in the graph. Then we combine graph neural networks and pre-training language models to learn latent representations of nodes and edges in the dialogue graph. During the messages passing over the graph, the representations of word-nodes and sentence-nodes are updated together, which helps to learn more implicit relationships among words and sentences. To validate the effectiveness, we conduct a set of experiments based on a public datasetFootnote 1 and an e-commerce customer service dataset contributed by ourselves. The comparison results show that the proposed method outperforms the sota methods.

2 Related Work

2.1 Dialogue Classification

Dialogue classification involves assigning predefined labels to dialogues or their segments, such as utterances or turns, based on their functional or intentional significance within the conversation [8, 9].

Most existing studies on dialogue classification focus on sentence-level or utterance-level intent recognition of user statements [10]. These studies commonly employ hierarchical neural networks to model the sequential and structural information within words, characters, and utterances [11]. However, these approaches fail to explicitly account for the transition of speakers during the dialogue, which can impact the interpretation of dialogue acts. For instance, when speaker A poses a question, the subsequent utterance from speaker B is more likely to be an answer. Conversely, if the speaker remains the same, the following act is less likely to be an answer.

Tavabi [12] proposed to integrate the turn changes in conversations among speakers when modeling dialogue acts. They learned conversation-invariant speaker turn embeddings to represent the speaker turns in conversation; the learned speaker turn embeddings were then merged with the utterance embeddings for the downstream task of dialogue act classification. They showed that their model outperformed several baselines on three benchmark public datasets.

Another challenge for dialogue classification is that a dialogue may contain multiple semantic topics, some of which are irrelevant to the business of the application task [13]. In some cases, these topics may be irrelevant to the primary objective or business task of the application. This complexity arises from the natural flow of conversation, where participants may introduce unrelated or tangential subjects alongside the main focus of the dialogue. Consequently, accurately classifying dialogues requires the ability to identify and filter out irrelevant topics, ensuring that the assigned labels reflect the pertinent information and align with the specific objectives of the application.

Kumar [14] addressed this problem by augmenting small data to classify contextualized dialogue acts for exploratory visualization. They collected a new corpus of conversations, CHICAGO-CRIME-VIS, geared towards supporting data visualization exploration, and they annotated it for a variety of features, including contextualized dialogue acts. They applied data augmentation techniques to the training data, such as paraphrasing and back-translation, to increase the diversity and robustness of the data. They ran experiments with different classifiers and found that conditional random fields outperformed other methods.

Guo [15] recognized the importance of removing redundant information from dialogue text and thus adopted a long text segmentation method based on resampling, which solves the limitations of the BERT input length as well.

2.2 Heterogeneous Graph Network

For news classification, Kang [16] proposed a heterogeneous graph called News Classification Graph to represent the relationships between multiple news, such as their relevance in time, place and people. Moreover, they proposed a Joint Heterogeneous graph Network (JHN) to properly embed the News Classification Graph.

For aspect-based sentiment analysis, trying to capture the sentiment relationship among aspect terms, Niu [17] constructs a heterogeneous graph that models the inter-aspect relationships and aspect-context relationships simultaneously.

To combine multiple aspects of a review together and make use of the link between a sentence and its words, Yang [18] propose a dual-level attention-based heterogeneous graph convolutional network, including node-level and type-level attentions.

For short text classification, Yang [19] proposed a word-concept heterogeneous graph convolution network to avoid regarding introduced concepts as noises and learn the representations with interactive information. Kong [20] considers the lack of labeled data. Adopting an uncertainty-aware mechanism, they proposed a heterogeneous graph attention network. Furthermore, the lack of context, the sparsity of short text features and the inability of word embedding and external knowledge bases to supplement short text information are also challenges for short text classification. Aiming to improve classification accuracy and reduce computational difficulty, Zhang [21] built a text, word and POS tag-based graph convolutional network which does not require pre-training word embedding as initial node features.

For utterance-level dialogue classification, many graph-based methods are applied to capture the implicit feature information in the dialogue structure, Qin [6] designed the co-interactive graph interaction layer to capture contextual information and interaction information, which are important information hidden in dialogue. Shen [22] proposed a neural network based on directed acyclic graph to better represent dialogue information flow and combine the advantages of graph neural network and recurrent neural network models.

For dialogue-level dialogue classification relevant models in this field are scarce, but graph-based models are still concerned by relevant researchers. Pang [23] regarded speakers, local discourses and utterances as main information and used graphs to model them to construct a multi-factor graph.

3 Methodology

DialGNN encompasses three essential modules that collectively contribute to its functionality: DialGraph Construction, Node Representation, and Heterogeneous Graph Network, as illustrated in Fig. 2.

Fig. 2
figure 2

The architecture of DialGNN framework. The forward process of an example dialogue (with 6 sentences) contain 4 stages: initialization of representation with BERT, construction of DialGraph, update node representation with Heterogeneous Graph Network, and classification of dialogue node to perform intent prediction

The DialGraph Construction module plays a crucial role by transforming a given dialogue into a heterogeneous graph. This graph enables the capturing of intricate relationships among words, sentences, and the overall dialogue structure. By representing the dialogue in this manner, DialGNN gains a comprehensive understanding of its underlying dynamics.

Thus, with a pretrained BERT model to initialize the representation of nodes in DialGraph, DialGNN takes the contextual semantics into consideration and can effectively combine the contextual information as well as the structures within dialogues. The Node Representation module within DialGNN undertakes the task of initializing the node representations within the DialGraph. This is achieved by employing BERT-based embeddings, which are pre-trained contextual representations capable of capturing rich semantic information. Through this initialization process, the Node Representation module equips the graph with meaningful and informative node representations.

The final module, Heterogeneous Graph Network, is responsible for encoding the heterogeneous graphs generated by DialGraph Construction. It employs graph attention networks to capture relevant dependencies and interactions among nodes within the graph. By updating the representations of the nodes based on these learned relationships, the Heterogeneous Graph Network module enhances the graph’s ability to handle downstream tasks effectively. Now let’s delve into the detailed descriptions of each section.

3.1 DialGraph Construction

There are several efforts to convert a dialogue into a topological graph [24]. They majorly regard each sentence as a node and construct a homogeneous graph where the edges between nodes are formed with contextual relations. That is, only sentences within a fixed window size have edges. Such methods might fail to capture the relations among sentences with long distance and may ignore the impact of some words which significantly contribute to the predicted categories.

To this end, we construct a heterogeneous graph named DialGraph with word nodes, sentence nodes and dialogue nodes. The edges between sentence nodes and word nodes represent the containing relations. Then more implicit relations among different sentences can be derived from the relations between sentences and words, such as the co-occurrence, semantic distance, and term frequencies. Inspired by the usage of [CLS] tag in BERT, we add a \(0^{th}\) sentence node as the dialogue node, rather than using a pooling layer of sentence node embedding. Formally, the heterogeneous graph DialGraph is defined as follows.

Given a dialogue \(C=\{s_1, s_2, \dots , s_n\}\), the DialGraph is denoted as \(G = \{V, E\}\), where \(V = V_w \cup V_s \cup V_c\), \(E =\{e_{10}, e_{11},\dots ,e_{mn}\}\) represents the node set and the edge set respectively. Here, \(V_w = \{w_1, w_2, \dots , w_m\}\) denotes m unique words, \(V_s\) corresponds to the n sentences, \(V_c\) is the dialogue node. E is a real-value edge weight matrix and \(e_{ij} (i \in [1, m], j \in [0, n])\) indicates the \(j^{th}\) sentence contains the \(i^{th}\) word. Note that the dialogue node \(V_c\) connects to all word nodes.

The node updates in DialGNN are determined by considering the features of neighboring nodes and the associated edge weights. In this regard, the word nodes update their representations based on the features and edge weights of the corresponding sentence nodes. Similarly, the sentence nodes update their representations by considering the features and edge weights of the word nodes connected to them. Furthermore, the dialogue nodes update their representations by incorporating the features and edge weights of the sentence nodes connected to them. This approach ensures that the node representations in DialGNN are iteratively refined, taking into account the contextual information from neighboring nodes and their respective edge weights.

3.2 Node Representation

We denote \(\mathrm{\textbf{X}_w} \in \mathbb {R}^{m \times d_w}\), \(\mathrm{\textbf{X}_s} \in \mathbb {R}^{n \times d_s}\), and \(\mathrm{\textbf{X}_c} \in \mathbb {R}^{n \times d_c}\) as the input feature matrices representing word, sentence, and conversation nodes, respectively. Here, \(d_w\), \(d_s\), and \(d_c\) refer to the dimensions of the word embeddings, sentence representation vectors, and dialogue representation vectors, respectively.

Here, we use BERT-based [25] embeddings to get the initialized representations of the words, the sentences and the dialogue. Note that other embedding models and other pre-trained language models can also be utilized.

To incorporate the varying importance of relationships between nodes, we employ TF-IDF (Term Frequency-Inverse Document Frequency) values to initialize the weights of the edges. TF-IDF is a statistical measure commonly used in natural language processing to evaluate the significance of a term in a document relative to a collection of documents. By assigning TF-IDF values as the edge weights, we can capture the importance of the connections between nodes in the graph structure.

3.3 Heterogeneous Graph Network

Given the constructed DialGraph with node features \(\mathrm{\textbf{X}_w} \cup \mathrm{\textbf{X}_s} \cup \mathrm{\textbf{X}_c}\), we leverage the graph attention networks [26] to update the representations of nodes.

We refer to \(h_i \in \mathbb {R}^{d_h}, i \in [0, m + n]\) as the hidden states of input nodes. The graph attention(GAT) layer is designed as follows:

$$\begin{aligned} \alpha _{ij} = \textrm{softmax}(\textrm{LeakyReLU}(W_a[W_qh_i;W_kh_j])) \end{aligned}$$
(1)
$$\begin{aligned} u_i = \sigma \left( \sum _{j \in N_i} \alpha _{ij}W_vh_j\right) \end{aligned}$$
(2)

where \(W_a, W_q, W_k, W_v\) are learnable linear transformation matrices and \(\alpha _{ij}\) is the attention weights between \(h_i\) and \(h_j\). The multi-head attention can be denoted as follows:

$$\begin{aligned} u_i = \parallel _{k=1}^{K=1}\sigma \left( \sum _{j \in N_i}\alpha _{ij}^kW^kh_i\right) \end{aligned}$$
(3)

Furthermore, we also add a residual connection to avoid gradient vanishing. Therefore, the final output can be formulated as follows:

$$\begin{aligned} h_{i}^{'} = u_i + h_i \end{aligned}$$
(4)

Besides, we modify the GAT layer to infuse the scalar edge weights \(e_{ij}\), which are mapped to the multi-dimension embedding. Hence, the Eq. 1 is modified as follows:

$$\begin{aligned} z_{ij} = \textrm{LeakyReLU}(W_a[W_qh_i;W_kh_j;e_{ij}]) \end{aligned}$$
(5)

After each GAT layer, we introduce a feed-forword network that includes two linear project layer as Transformer [27].

$$\begin{aligned} \textrm{FFN}(x) = \textrm{ReLU}(xW_1 + b_1)W_2 + b_2 \end{aligned}$$
(6)

3.4 Training and Optimization

During the training stage, the representations of the dialogue node, sentence nodes and word nodes are updated alternately. Since the dialogue node can be regarded as the \(0^{th}\) sentence connected with all words. The process of updating the dialogue node is the same as the process of updating the sentence nodes. Thus, one iteration of the training includes a sentence-to-word update process and a word-to-sentence update process.

In the sentence-to-word update process, the dialogue node and the sentence nodes are updated in the \(t^{th}\) iteration based on their connected word nodes via the GAT and FFN layer as follows:

$$\begin{aligned} U_{s \leftarrow w}^{t+1}&= \textrm{GAT}(H_s^t, H_w^t, H^t_w)\end{aligned}$$
(7)
$$\begin{aligned} H_s^t&= \textrm{FFN}(U_{s \leftarrow w}^{t+1} + H_s^t) \end{aligned}$$
(8)

where \(H_w^0 = \mathrm{\textbf{X}_w}\), \(H_s^0 = \mathrm{\textbf{X}_s}\) and \(U_{s \leftarrow w}^1 \in \mathbb {R}^{m \times d_h}\). \(\mathrm{GAT()}\) denotes that \(H_s^0\) is used as the attention query and \(H_w^0\) is used as the key and value.

Then in the sentence-to-word update process, the word nodes are updated through the new dialogue node and the sentence nodes.

$$\begin{aligned} U_{w \leftarrow s}^{t+1}&= \textrm{GAT}(H_w^t, H_s^t, H^t_s)\end{aligned}$$
(9)
$$\begin{aligned} H_{w}^{t+1}&= \textrm{FFN}(U_{w \leftarrow s}^{t+1} + H_w^t) \end{aligned}$$
(10)

Finally, classification for the dialogue node determines the label of the whole dialogue and cross-entropy loss is used to optimize the model [28].

4 Experiments

In this section, we perform several experiments to assess and analyze the effectiveness of our proposed dialogue classification approach. Our objectives are to address the following research questions:

  • How does our approach compare with existing methods on the dialogue classification task? (Section 4.3.1)

  • How does the heterogeneous graph information affect the dialogue classification performance? (Section 4.3.2)

  • What are the contributions of each component in our approach? (Section 4.3.3)

4.1 Datasets and Experiment Settings

We use two datasets for our experiments: China Mobile Dataset (CM) and E-commerce Customer Service Dataset (ECS). CM is a dataset of phone call dialogues between customers and service staff, where the goal is to identify the business type requested by the customers. ECS is a dataset of online chat dialogues between customers and sellers, staff or AI systems, where the goal is to classify the dialogue acts or emotions. The statistical information of them is shown in Table 1.

Table 1 The statistics of CM and ECS datasets

China Mobile Dataset(CM). This dataset assumes a scenario where the customer service staff answer the phone calls from different customers. The aim is to determine which business is the accurate request of the calls given the whole dialogue history. The contents are the ASR texts from customer service dialogues of phone calls. The labels are pre-defined business types. The dataset contains 19,784 labeled conversation segments, with 37 different human–machine dialogue intent categories. Table 2 shows the business and conversation intention types.

Table 2 Business types and conversation intentions

E-commerce Customer Service Dataset (ECS). ECS is contributed by ourselves to the community. The dialogues took place between a customer and a seller, a staff, or an AI system. The user goal is relatively straightforward, that is, to complain about an unsatisfied experience. The labels are event types, such as malicious refunding, counterfeit, right infringement, etc.

Table 3 Samples of ECS dataset

Table 3 displays a representative ECS sample, where each column represents distinct elements. The first column serves as a unique dialogue key. The second column contains a sequence of sentence IDs associated with the dialogue. The third column comprises a JSON list with keys such as “id" (the JSON ID linked to the sequence), “text" (the sentence content), and “member_type" (1 for customer, 2 for customer service, 3 for automatic AI customer service). The fourth column indicates the dialogue category, classified into coarse and fine levels.

We adopt several widely used evaluation metrics, which are accuracy and F1-score, to evaluate the performance of DialGNN. We choose categorical cross entropy as the loss function for our model on two datasets. The learning rate (lr) is set to \(5\times e^{-5}\) and batch size is set to 128. We set the multi-head graph attention network 3 layers, 2 heads and 1024 hidden units. For embeddings, we take 768-dimensional BERT and Chinese RoBERTa embeddings as word embeddings. To increase the training and inference efficiency, we adopted mini-batch processing by processing a subset of the data simultaneously during training. This allows for parallel computation across different graph instances, further enhancing efficiency. All the codes for our experiments are available on Github (https://github.com/821code/DialGNN).

4.2 Baselines

To evaluate the performance of our proposed framework, we compare it with several baseline models that use different sequence encoders.

  • TextRNN [29] is a type of recurrent neural network that can handle text data and take into account the order of words. TextRNN uses a recursive way to pass the output of the previous time step as the input of the current time step, so as to transfer the context information to the next time step.

  • TextRNN-Att [30] is a text classification model that combines recurrent neural networks (RNNs) and attention mechanisms. TextRNN-Att uses a bidirectional RNN to encode the input text into hidden states, and then applies an attention layer to aggregate the hidden states into a sentence representation. The attention layer assigns different weights to different parts of the text, depending on their relevance to the classification task.

  • TextCNN [31] short for Text Convolutional Neural Network, is a deep learning model designed for text classification and sentiment analysis tasks. TextCNN can handle variable-length sentences and learn complex semantic features from the text.

  • CNN-LSTM [32] is a widely-used model consisting of regional CNN and LSTM. By combining the regional CNN and LSTM components, the CNN-LSTM model can leverage both the local spatial information captured by the CNN and the sequential dependencies captured by the LSTM. This hybrid approach allows the model to effectively extract meaningful features from input data and capture complex relationships within sequential data.

  • BERT [25] is a transformers-based language model, which is pre-trained on large-scale corpus and has achieved remarkable success in many NLP tasks. The use of transformers in BERT enables it to capture contextual dependencies in a more comprehensive manner. The transformer architecture utilizes attention mechanisms to weigh the importance of different words in a sentence based on their relevance to each other. This attention mechanism allows BERT to consider the entire context when representing a word, rather than just relying on its immediate neighbors.

  • Roberta [33] is a robustly optimized version of BERT, a pre-trained language model that uses bidirectional transformers to learn contextual representations of text.

  • ERNIE [34] stands for Enhanced Representation through kNowledge IntEgration, which indicates its ability to incorporate various types of knowledge into the pre-training process of language models.

  • Han [7] is a hierarchical attention network containing two levels of attention mechanisms applied at word-level and sentence-level, which is similar to the graph. The hierarchical nature of Han’s attention mechanisms allows it to effectively model relationships and dependencies between words and sentences. The model can capture not only the local interactions between words but also the broader interactions and contextual dependencies between sentences.

  • DAG [22] is an acronym for Directed Acyclic Graph. DAG can be used to model the structure and context of a conversation, where each node represents an utterance and each edge represents the dependency or influence between utterances. DAG can capture the information flow and the long-distance dependencies in a conversation.

  • InductGCN [35] constructs a graph based on the statistics of training documents only and represents document vectors with a weighted sum of word vectors. It then conducts one-directional GCN propagation during testing.

  • TextGCN [36] incorporates semantic information and relationships from text data by constructing a text graph and applying graph convolution operations. It can perform text classification without the need for external embeddings, making it a valuable approach for specific text classification tasks.

  • AttentionXML [37] introduced an attention mechanism and a probabilistic label tree (PLT). Attention mechanism ensures that the model captures the subtle and context-dependent associations between text and labels, enhancing classification accuracy.

4.3 Results and Analysis

4.3.1 DialGNN Comparing with Baseline Methods

Table 4 presents the performance evaluation of different models on two distinct datasets, CM and ECS, for dialogue classification. Notably, DialGNN(BERT) denotes the amalgamation of the BERT model with the DialGNN structural framework, referred to subsequently as DialGNN.

On the CM dataset, DialGNN demonstrates remarkable performance with 70.2% accuracy and 59.3% F1 score. This outcome underscores the DialGNN structure’s innate capacity to capture intricate linguistic patterns embedded within dialogues. Meanwhile, the ECS dataset reveals similar excellence, with an accuracy rate of 60.3% and an equally robust F1 score of 54.9%. This success can be primarily attributed to the indispensable role played by the DialGNN structure in facilitating the modeling of intricate dependencies and contextual information across the sequence of dialogue turns.

The results obtained underscore the inherent limitations of conventional baseline models, such as TextRNN, TextCNN, and CNN_LSTM. These models, rooted in their sequential processing architecture, consistently display a comparatively inferior performance on both the CM and ECS datasets.

Table 4 Comparsion with baseline methods on CM and ECS datasets

Comparing to BERT, Roberta, and ERNIE, DialGNN outperforms them due to its heterogeneous graph architecture designed for dialogue classification, especially in understanding the flow of conversations, tracking changing topics, and capturing user intents that evolve across sentences.

Futhermore, the underperformance of the DAG model on the ECS dataset can be attributed to a fundamental mismatch between the model’s design and dialogue classification task. The DAG model primarily operates at the sentence level, focusing on classifying individual sentences, while the datasets in question require dialogue-level classification.

InductGCN’s performance on both the CM and ECS datasets is relatively lower than DialGNN. Its limitations may stem from its capacity to capture nuanced linguistic patterns, contextual dependencies, and user intents in dialogues, which are pivotal in dialogue classification tasks.

In the case of dialogue text, it contains interactions and transitions between speakers, and TextGCN’s design does not explicitly address the complex relationships between speakers. It tends to deal with individual sentences or text blocks and may not capture the context and relationships between speakers adequately, limiting its ability to understand and model the overall semantics of the dialogue.

As for AttentionXML, if there are weak label correlations or label relationships that have minimal impact on classification in a given dataset, the model might not fully leverage the advantages of the attention mechanism. AttentionXML relies on learning label relationships to better capture the relationships between text and labels. If these relationships are not significant in the dataset, the model’s performance could be constrained.

Crucially, DialGNN incorporates nodes at different levels of semantic granularity, allowing for flexible integration of various pre-trained language representation models. The use of BERT within the graph structure endows DialGNN with powerful text representational capabilities.

To further validate the generalizability of DialGNN, Table 5 presents the comparison results of different sequence encoders with and without DialGNN. We can find that for all baseline models, combining with DialGNN achieves significant improvements. Even on the strong baseline model BERT, the DialGNN gains 5.5% and 6.4% F1 scores on the CM and ECS datasets, respectively.

Table 5 The performance comparisons of baseline models and combining DialGNN

As shown in Table 5, models with DialGNN-seg have better performance on the ECS dataset. DialGNN-seg refers to a trick to handle too-long dialogues. The BERT model restricts the maximum length of the input sequence to 512 due to computational issues. For a dialogue from the ECS dataset with more than 512 tokens, we truncate it into 512 tokens in the basic DialGNN settings. In the DialGNN-seg settings, we obtain the initial embeddings by sliding a context window of 512 tokens. So the DialGNN-seg incorporates more contextual information into node embeddings and gains a better performance.

4.3.2 Comparisons on Graph Designs

Table 6 displays the performance evaluation of various graph designs that utilize pre-trained models on CM Dataset. The comparison groups consist of different design variations, including those with context relation modeling (specifically DialogueGCN, which requires a substantial amount of GPU memory, and BERT-tiny as the base model), asynchronous initialization, and designs without a dialogue node.

The results of the comparison reveal that alternative graph designs tend to compromise the quality of the latent representations provided by pre-trained models. This suggests that the mentioned designs are not able to effectively capture and incorporate the contextual relationships present in the data. In particular, the performance metrics indicate that these alternative designs result in a degradation of the pre-trained model’s ability to represent and understand the underlying patterns in the China Mobile Dataset.

These findings highlight the importance of preserving the latent representation quality obtained from pre-trained models when designing graph structures for natural language processing tasks. The results strongly suggest that the approaches involving context relation modeling, asynchronous initialization, and the inclusion of dialogue nodes are essential for maintaining the integrity and effectiveness of the pre-trained models when applied to the China Mobile Dataset. By utilizing these design elements, the models can leverage the full potential of the pre-trained representations and achieve better performance in capturing the intricacies of the dataset.

Table 6 The results of Different Graph Designs

4.3.3 Ablation Study

To validate the contribution of each component, a series of experiments are designed to observe the performances and the results are summarized in Table 7, “w/o" indicates that this component is not included in the model.

Table 7 The results of Ablation study on CM Dataset

The results presented in the table demonstrate that the TF-IDF initialization approach for edge weights significantly enhances the overall performance of the system. This initialization technique, which utilizes TF-IDF algorithm, provides a valuable foundation for establishing the weights of connections between nodes in the graph structure.

Moreover, both the sentence-to-word updating step and the word-to-sentence updating step have been found to play crucial roles in the functionality of the DialGNN system. These steps are integral in facilitating the flow of information and the exchange of knowledge between sentences and words in the graph. By ensuring a bidirectional and iterative updating process, these steps enable the system to capture and incorporate relevant information from both sentence-level and word-level representations.

The findings underscore the significance of each component in the overall performance of the system. The TF-IDF initialization contributes substantially to the quality of the edge weights, enhancing the accuracy and effectiveness of the system. Additionally, the sentence-to-word updating step and the word-to-sentence updating step are deemed essential for the optimal functioning of DialGNN, enabling seamless information propagation and integration between sentence and word representations.

4.4 Case Study

Fig. 3
figure 3

The Examples of Case Study on CM and ECS Datasets. Keywords in different examples are marked in color respectively

Drawing upon the preceding model analysis, a case study was conducted utilizing samples extracted from the dataset. As depicted in Fig. 3, Table 8 presents the classification outcomes achieved by DialGNN in comparison to other models. Remarkably, DialGNN demonstrated unerring classification accuracy for all the examples. DialGNN’s superiority resides in its capacity to transcend mere feature-based word-level classification. It distinguishes itself by comprehending the nuanced semantics of words within the broader context and the underlying dialogue structure.

Table 8 Classification results of different models for Examples. Bold indicates correct classification

Referring to Fig. 3, Example 1 showcases a distinctive dialogue structure. In the conversation, both the customer and the customer service initially hold different interpretations of the event, which evolve over the course of the discussion. DialGNN inspired by BERT, introduces a novel approach by incorporating a “0th sentence node" as a dialog node. This innovative addition allows the model to integrate various discourse features within the entire dialogue, leading to a more comprehensive comprehension of the dialogue at the macro level. In this specific case, DialGNN can synthesize the customer’s descriptions of the event, thus obtaining a deeper understanding. Consequently, it can accurately identify the correct label, even when faced with conflicting opinions, which distinguishes it from other models that may primarily focus on direct opinion information.

Moving on to Example 2, while other models may be influenced by the frequency of terms like ‘July’, ‘Bill’, ‘Settlement’, and quantities mentioned in intermediate exchanges, they could interpret the core intent as a general ‘Business Processing’. In contrast, DialGNN leverages its dialog structure representation through a graph and an attention mechanism. It recognizes that the conversation’s beginning and end are critical segments since they mark the initiation and conclusion of inquiries. While other models may overlook these subtle cues, DialGNN focuses on the dialog’s core intent concentration areas and accurately identifies it as an ‘Invoice Reissuance’ issue. This underscores DialGNN’s strength in capturing contextual nuances within dialogues.

Finally, Example 3 showcases a dialogue containing vital contextual semantics. While other models tend to reinforce associations between strongly characterized words like ‘move’, ‘area’, and ‘location’ and tagged words such as ‘install’ and ‘moving’, DialGNN takes a distinctive approach. It capitalizes on the powerful contextual semantic features embedded within pre-trained models, seamlessly integrating information across sentences and words through interactive updates within its graph structure. Unlike other models that may categorize based on individual word characteristics, DialGNN excels in understanding words within the context and dialogue background. This deep comprehension is evident in this case, where it discerns the customer’s dissatisfaction with the business office from the event description and interactions with customer service.

5 Conclusion

In summary, our work introduces the innovative DialGNN framework, which leverages heterogeneous graph neural networks to gain a deeper understanding of multi-turn dialogues. Our framework offers versatile compatibility with various encoders and demonstrates the potential to enhance their performance, even when used in conjunction with pre-trained language models. The extensive array of experiments conducted showcases the efficacy of DialGNN in the context of dialogue understanding. The ability of DialGNN to capture nuanced linguistic patterns, contextual dependencies, and evolving user intents across dialogues sets it apart as a robust and adaptable framework with the potential to advance a myriad of real-world applications. Our study serves as a pivotal step in the ongoing evolution of dialogue systems, paving the way for enhanced dialogue comprehension, customer satisfaction analysis, service quality assurance, and dialogue topic categorization.