DialGNN: Heterogeneous Graph Neural Networks for Dialogue Classification

Yan, Yan; Zhang, Bo-Wen; Min, Peng-hao; Ding, Guan-wen; Liu, Jun-yuan

doi:10.1007/s11063-024-11595-z

DialGNN: Heterogeneous Graph Neural Networks for Dialogue Classification

Open access
Published: 08 April 2024

Volume 56, article number 143, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

DialGNN: Heterogeneous Graph Neural Networks for Dialogue Classification

Download PDF

Yan Yan¹,
Bo-Wen Zhang²,
Peng-hao Min¹,
Guan-wen Ding¹ &
…
Jun-yuan Liu¹

567 Accesses
1 Altmetric
Explore all metrics

Abstract

Dialogue systems have attracted growing research interests due to its widespread applications in various domains. However, most research work focus on sentence-level intent recognition to interpret user utterances in dialogue systems, while the comprehension of the whole documents has not attracted sufficient attention. In this paper, we propose DialGNN, a heterogeneous graph neural network framework tailored for the problem of dialogue classification which takes the entire dialogue as input. Specifically, a heterogeneous graph is constructed with nodes in different levels of semantic granularity. The graph framework allows flexible integration of various pre-trained language representation models, such as BERT and its variants, which endows DialGNN with powerful text representational capabilities. DialGNN outperforms on CM and ECS datasets, which demonstrates robustness and the effectiveness. Specifically, our model achieves a notable enhancement in performance, optimizing the classification of document-level dialogue text. The implementation of DialGNN and related data are shared through https://github.com/821code/DialGNN.

Dialogue Relation Extraction with Document-Level Heterogeneous Graph Attention Networks

Article 20 January 2023

UTMGAT: a unified transformer with memory encoder and graph attention networks for multidomain dialogue state tracking

Article 25 June 2024

DocTalk: Combining Dependency-Based Text Graphs and Deep Learning into a Practical Dialog Engine

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, dialogue systems have been prevalently applied in customer services, online health consultation, chatbots, etc. Dialogue classification, which aims at assigning predefined labels to an entire dialogue, is a fundamental task for many applications, including dialogue theme recognition, customer satisfaction analysis, service quality, etc. [1].

Most existing researches on classification in dialogue systems focus on the intent of users in each turn within a dialogue [2, 3]. These methods, taking the sentence-level user utterance as input and output of predicted intent, are not appropriate to classify the entire dialogues at document-level because sentences in dialogues are meant to be understood with the help of the context of all messages in the dialogue. The dependence on extended context requires that the classification process must regard a large block of utterances as input, which should be classified as a whole [4].

An intuitive solution to solve the above problem is to treat the whole dialogue as a document and use document classification methods. These methods either concatenate sentences into a long sequence [5, 6] or combine them hierarchically [7]. The main challenge is that a dialogue may contain multiple semantic topics, some of which are irrelevant to the business of the application task. Such irrelevant topics, regarded as noise, may be meaningless or misleading for classification models to predict correct results. For example in Fig. 1, the customer is consulting whether the 30 yuan data package is cancelled. The ground truth category should be business consultation. However, the mentioned canceling topic might mislead the models to identify the category of business cancellation. Therefore, existing models can hardly identify the noise in dialogues to determine the accurate categories.

We propose DialGNN, a generic framework based on heterogeneous graph neural networks for document-level dialogue classification. Firstly, a heterogeneous graph is constructed for each dialogue to represent the latent relationships among the sentences and the words within the dialogue. The sentences and words in each dialogue are regarded as nodes of different types in the graph. Then we combine graph neural networks and pre-training language models to learn latent representations of nodes and edges in the dialogue graph. During the messages passing over the graph, the representations of word-nodes and sentence-nodes are updated together, which helps to learn more implicit relationships among words and sentences. To validate the effectiveness, we conduct a set of experiments based on a public dataset^{Footnote 1} and an e-commerce customer service dataset contributed by ourselves. The comparison results show that the proposed method outperforms the sota methods.

2 Related Work

2.1 Dialogue Classification

Dialogue classification involves assigning predefined labels to dialogues or their segments, such as utterances or turns, based on their functional or intentional significance within the conversation [8, 9].

Most existing studies on dialogue classification focus on sentence-level or utterance-level intent recognition of user statements [10]. These studies commonly employ hierarchical neural networks to model the sequential and structural information within words, characters, and utterances [11]. However, these approaches fail to explicitly account for the transition of speakers during the dialogue, which can impact the interpretation of dialogue acts. For instance, when speaker A poses a question, the subsequent utterance from speaker B is more likely to be an answer. Conversely, if the speaker remains the same, the following act is less likely to be an answer.

Tavabi [12] proposed to integrate the turn changes in conversations among speakers when modeling dialogue acts. They learned conversation-invariant speaker turn embeddings to represent the speaker turns in conversation; the learned speaker turn embeddings were then merged with the utterance embeddings for the downstream task of dialogue act classification. They showed that their model outperformed several baselines on three benchmark public datasets.

Another challenge for dialogue classification is that a dialogue may contain multiple semantic topics, some of which are irrelevant to the business of the application task [13]. In some cases, these topics may be irrelevant to the primary objective or business task of the application. This complexity arises from the natural flow of conversation, where participants may introduce unrelated or tangential subjects alongside the main focus of the dialogue. Consequently, accurately classifying dialogues requires the ability to identify and filter out irrelevant topics, ensuring that the assigned labels reflect the pertinent information and align with the specific objectives of the application.

Kumar [14] addressed this problem by augmenting small data to classify contextualized dialogue acts for exploratory visualization. They collected a new corpus of conversations, CHICAGO-CRIME-VIS, geared towards supporting data visualization exploration, and they annotated it for a variety of features, including contextualized dialogue acts. They applied data augmentation techniques to the training data, such as paraphrasing and back-translation, to increase the diversity and robustness of the data. They ran experiments with different classifiers and found that conditional random fields outperformed other methods.

Guo [15] recognized the importance of removing redundant information from dialogue text and thus adopted a long text segmentation method based on resampling, which solves the limitations of the BERT input length as well.

2.2 Heterogeneous Graph Network

For news classification, Kang [16] proposed a heterogeneous graph called News Classification Graph to represent the relationships between multiple news, such as their relevance in time, place and people. Moreover, they proposed a Joint Heterogeneous graph Network (JHN) to properly embed the News Classification Graph.

For aspect-based sentiment analysis, trying to capture the sentiment relationship among aspect terms, Niu [17] constructs a heterogeneous graph that models the inter-aspect relationships and aspect-context relationships simultaneously.

To combine multiple aspects of a review together and make use of the link between a sentence and its words, Yang [18] propose a dual-level attention-based heterogeneous graph convolutional network, including node-level and type-level attentions.

For short text classification, Yang [19] proposed a word-concept heterogeneous graph convolution network to avoid regarding introduced concepts as noises and learn the representations with interactive information. Kong [20] considers the lack of labeled data. Adopting an uncertainty-aware mechanism, they proposed a heterogeneous graph attention network. Furthermore, the lack of context, the sparsity of short text features and the inability of word embedding and external knowledge bases to supplement short text information are also challenges for short text classification. Aiming to improve classification accuracy and reduce computational difficulty, Zhang [21] built a text, word and POS tag-based graph convolutional network which does not require pre-training word embedding as initial node features.

For utterance-level dialogue classification, many graph-based methods are applied to capture the implicit feature information in the dialogue structure, Qin [6] designed the co-interactive graph interaction layer to capture contextual information and interaction information, which are important information hidden in dialogue. Shen [22] proposed a neural network based on directed acyclic graph to better represent dialogue information flow and combine the advantages of graph neural network and recurrent neural network models.

For dialogue-level dialogue classification relevant models in this field are scarce, but graph-based models are still concerned by relevant researchers. Pang [23] regarded speakers, local discourses and utterances as main information and used graphs to model them to construct a multi-factor graph.

3 Methodology

DialGNN encompasses three essential modules that collectively contribute to its functionality: DialGraph Construction, Node Representation, and Heterogeneous Graph Network, as illustrated in Fig. 2.

The DialGraph Construction module plays a crucial role by transforming a given dialogue into a heterogeneous graph. This graph enables the capturing of intricate relationships among words, sentences, and the overall dialogue structure. By representing the dialogue in this manner, DialGNN gains a comprehensive understanding of its underlying dynamics.

Thus, with a pretrained BERT model to initialize the representation of nodes in DialGraph, DialGNN takes the contextual semantics into consideration and can effectively combine the contextual information as well as the structures within dialogues. The Node Representation module within DialGNN undertakes the task of initializing the node representations within the DialGraph. This is achieved by employing BERT-based embeddings, which are pre-trained contextual representations capable of capturing rich semantic information. Through this initialization process, the Node Representation module equips the graph with meaningful and informative node representations.

The final module, Heterogeneous Graph Network, is responsible for encoding the heterogeneous graphs generated by DialGraph Construction. It employs graph attention networks to capture relevant dependencies and interactions among nodes within the graph. By updating the representations of the nodes based on these learned relationships, the Heterogeneous Graph Network module enhances the graph’s ability to handle downstream tasks effectively. Now let’s delve into the detailed descriptions of each section.

3.1 DialGraph Construction

There are several efforts to convert a dialogue into a topological graph [24]. They majorly regard each sentence as a node and construct a homogeneous graph where the edges between nodes are formed with contextual relations. That is, only sentences within a fixed window size have edges. Such methods might fail to capture the relations among sentences with long distance and may ignore the impact of some words which significantly contribute to the predicted categories.

To this end, we construct a heterogeneous graph named DialGraph with word nodes, sentence nodes and dialogue nodes. The edges between sentence nodes and word nodes represent the containing relations. Then more implicit relations among different sentences can be derived from the relations between sentences and words, such as the co-occurrence, semantic distance, and term frequencies. Inspired by the usage of [CLS] tag in BERT, we add a $0^{th}$ sentence node as the dialogue node, rather than using a pooling layer of sentence node embedding. Formally, the heterogeneous graph DialGraph is defined as follows.

Given a dialogue $C=\{s_1, s_2, \dots , s_n\}$, the DialGraph is denoted as $G = \{V, E\}$, where $V = V_w \cup V_s \cup V_c$, $E =\{e_{10}, e_{11},\dots ,e_{mn}\}$ represents the node set and the edge set respectively. Here, $V_w = \{w_1, w_2, \dots , w_m\}$ denotes m unique words, $V_s$ corresponds to the n sentences, $V_c$ is the dialogue node. E is a real-value edge weight matrix and $e_{ij} (i \in [1, m], j \in [0, n])$ indicates the $j^{th}$ sentence contains the $i^{th}$ word. Note that the dialogue node $V_c$ connects to all word nodes.

The node updates in DialGNN are determined by considering the features of neighboring nodes and the associated edge weights. In this regard, the word nodes update their representations based on the features and edge weights of the corresponding sentence nodes. Similarly, the sentence nodes update their representations by considering the features and edge weights of the word nodes connected to them. Furthermore, the dialogue nodes update their representations by incorporating the features and edge weights of the sentence nodes connected to them. This approach ensures that the node representations in DialGNN are iteratively refined, taking into account the contextual information from neighboring nodes and their respective edge weights.

3.2 Node Representation

We denote $\mathrm{\textbf{X}_w} \in \mathbb {R}^{m \times d_w}$, $\mathrm{\textbf{X}_s} \in \mathbb {R}^{n \times d_s}$, and $\mathrm{\textbf{X}_c} \in \mathbb {R}^{n \times d_c}$ as the input feature matrices representing word, sentence, and conversation nodes, respectively. Here, $d_w$, $d_s$, and $d_c$ refer to the dimensions of the word embeddings, sentence representation vectors, and dialogue representation vectors, respectively.

Here, we use BERT-based [25] embeddings to get the initialized representations of the words, the sentences and the dialogue. Note that other embedding models and other pre-trained language models can also be utilized.

To incorporate the varying importance of relationships between nodes, we employ TF-IDF (Term Frequency-Inverse Document Frequency) values to initialize the weights of the edges. TF-IDF is a statistical measure commonly used in natural language processing to evaluate the significance of a term in a document relative to a collection of documents. By assigning TF-IDF values as the edge weights, we can capture the importance of the connections between nodes in the graph structure.

3.3 Heterogeneous Graph Network

Given the constructed DialGraph with node features $\mathrm{\textbf{X}_w} \cup \mathrm{\textbf{X}_s} \cup \mathrm{\textbf{X}_c}$, we leverage the graph attention networks [26] to update the representations of nodes.

We refer to $h_i \in \mathbb {R}^{d_h}, i \in [0, m + n]$ as the hidden states of input nodes. The graph attention(GAT) layer is designed as follows:

$$\begin{aligned} \alpha _{ij} = \textrm{softmax}(\textrm{LeakyReLU}(W_a[W_qh_i;W_kh_j])) \end{aligned}$$

(1)

$$\begin{aligned} u_i = \sigma \left( \sum _{j \in N_i} \alpha _{ij}W_vh_j\right) \end{aligned}$$

(2)

where $W_a, W_q, W_k, W_v$ are learnable linear transformation matrices and $\alpha _{ij}$ is the attention weights between $h_i$ and $h_j$. The multi-head attention can be denoted as follows:

$$\begin{aligned} u_i = \parallel _{k=1}^{K=1}\sigma \left( \sum _{j \in N_i}\alpha _{ij}^kW^kh_i\right) \end{aligned}$$

(3)

Furthermore, we also add a residual connection to avoid gradient vanishing. Therefore, the final output can be formulated as follows:

$$\begin{aligned} h_{i}^{'} = u_i + h_i \end{aligned}$$

(4)

Besides, we modify the GAT layer to infuse the scalar edge weights $e_{ij}$, which are mapped to the multi-dimension embedding. Hence, the Eq. 1 is modified as follows:

$$\begin{aligned} z_{ij} = \textrm{LeakyReLU}(W_a[W_qh_i;W_kh_j;e_{ij}]) \end{aligned}$$

(5)

After each GAT layer, we introduce a feed-forword network that includes two linear project layer as Transformer [27].

$$\begin{aligned} \textrm{FFN}(x) = \textrm{ReLU}(xW_1 + b_1)W_2 + b_2 \end{aligned}$$

(6)

3.4 Training and Optimization

During the training stage, the representations of the dialogue node, sentence nodes and word nodes are updated alternately. Since the dialogue node can be regarded as the $0^{th}$ sentence connected with all words. The process of updating the dialogue node is the same as the process of updating the sentence nodes. Thus, one iteration of the training includes a sentence-to-word update process and a word-to-sentence update process.

In the sentence-to-word update process, the dialogue node and the sentence nodes are updated in the $t^{th}$ iteration based on their connected word nodes via the GAT and FFN layer as follows:

$$\begin{aligned} U_{s \leftarrow w}^{t+1}&= \textrm{GAT}(H_s^t, H_w^t, H^t_w)\end{aligned}$$

(7)

$$\begin{aligned} H_s^t&= \textrm{FFN}(U_{s \leftarrow w}^{t+1} + H_s^t) \end{aligned}$$

(8)

where $H_w^0 = \mathrm{\textbf{X}_w}$, $H_s^0 = \mathrm{\textbf{X}_s}$ and $U_{s \leftarrow w}^1 \in \mathbb {R}^{m \times d_h}$. $\mathrm{GAT()}$ denotes that $H_s^0$ is used as the attention query and $H_w^0$ is used as the key and value.

Then in the sentence-to-word update process, the word nodes are updated through the new dialogue node and the sentence nodes.

$$\begin{aligned} U_{w \leftarrow s}^{t+1}&= \textrm{GAT}(H_w^t, H_s^t, H^t_s)\end{aligned}$$

(9)

$$\begin{aligned} H_{w}^{t+1}&= \textrm{FFN}(U_{w \leftarrow s}^{t+1} + H_w^t) \end{aligned}$$

(10)

Finally, classification for the dialogue node determines the label of the whole dialogue and cross-entropy loss is used to optimize the model [28].

4 Experiments

In this section, we perform several experiments to assess and analyze the effectiveness of our proposed dialogue classification approach. Our objectives are to address the following research questions:

How does our approach compare with existing methods on the dialogue classification task? (Section 4.3.1)
How does the heterogeneous graph information affect the dialogue classification performance? (Section 4.3.2)
What are the contributions of each component in our approach? (Section 4.3.3)

4.1 Datasets and Experiment Settings

We use two datasets for our experiments: China Mobile Dataset (CM) and E-commerce Customer Service Dataset (ECS). CM is a dataset of phone call dialogues between customers and service staff, where the goal is to identify the business type requested by the customers. ECS is a dataset of online chat dialogues between customers and sellers, staff or AI systems, where the goal is to classify the dialogue acts or emotions. The statistical information of them is shown in Table 1.

Table 1 The statistics of CM and ECS datasets

Full size table

China Mobile Dataset(CM). This dataset assumes a scenario where the customer service staff answer the phone calls from different customers. The aim is to determine which business is the accurate request of the calls given the whole dialogue history. The contents are the ASR texts from customer service dialogues of phone calls. The labels are pre-defined business types. The dataset contains 19,784 labeled conversation segments, with 37 different human–machine dialogue intent categories. Table 2 shows the business and conversation intention types.

Table 2 Business types and conversation intentions

Full size table

E-commerce Customer Service Dataset (ECS). ECS is contributed by ourselves to the community. The dialogues took place between a customer and a seller, a staff, or an AI system. The user goal is relatively straightforward, that is, to complain about an unsatisfied experience. The labels are event types, such as malicious refunding, counterfeit, right infringement, etc.

Table 3 Samples of ECS dataset

Full size table

Table 3 displays a representative ECS sample, where each column represents distinct elements. The first column serves as a unique dialogue key. The second column contains a sequence of sentence IDs associated with the dialogue. The third column comprises a JSON list with keys such as “id" (the JSON ID linked to the sequence), “text" (the sentence content), and “member_type" (1 for customer, 2 for customer service, 3 for automatic AI customer service). The fourth column indicates the dialogue category, classified into coarse and fine levels.

We adopt several widely used evaluation metrics, which are accuracy and F1-score, to evaluate the performance of DialGNN. We choose categorical cross entropy as the loss function for our model on two datasets. The learning rate (lr) is set to $5\times e^{-5}$ and batch size is set to 128. We set the multi-head graph attention network 3 layers, 2 heads and 1024 hidden units. For embeddings, we take 768-dimensional BERT and Chinese RoBERTa embeddings as word embeddings. To increase the training and inference efficiency, we adopted mini-batch processing by processing a subset of the data simultaneously during training. This allows for parallel computation across different graph instances, further enhancing efficiency. All the codes for our experiments are available on Github (https://github.com/821code/DialGNN).

4.2 Baselines

To evaluate the performance of our proposed framework, we compare it with several baseline models that use different sequence encoders.

TextRNN [29] is a type of recurrent neural network that can handle text data and take into account the order of words. TextRNN uses a recursive way to pass the output of the previous time step as the input of the current time step, so as to transfer the context information to the next time step.
TextRNN-Att [30] is a text classification model that combines recurrent neural networks (RNNs) and attention mechanisms. TextRNN-Att uses a bidirectional RNN to encode the input text into hidden states, and then applies an attention layer to aggregate the hidden states into a sentence representation. The attention layer assigns different weights to different parts of the text, depending on their relevance to the classification task.
TextCNN [31] short for Text Convolutional Neural Network, is a deep learning model designed for text classification and sentiment analysis tasks. TextCNN can handle variable-length sentences and learn complex semantic features from the text.
CNN-LSTM [32] is a widely-used model consisting of regional CNN and LSTM. By combining the regional CNN and LSTM components, the CNN-LSTM model can leverage both the local spatial information captured by the CNN and the sequential dependencies captured by the LSTM. This hybrid approach allows the model to effectively extract meaningful features from input data and capture complex relationships within sequential data.
BERT [25] is a transformers-based language model, which is pre-trained on large-scale corpus and has achieved remarkable success in many NLP tasks. The use of transformers in BERT enables it to capture contextual dependencies in a more comprehensive manner. The transformer architecture utilizes attention mechanisms to weigh the importance of different words in a sentence based on their relevance to each other. This attention mechanism allows BERT to consider the entire context when representing a word, rather than just relying on its immediate neighbors.
Roberta [33] is a robustly optimized version of BERT, a pre-trained language model that uses bidirectional transformers to learn contextual representations of text.
ERNIE [34] stands for Enhanced Representation through kNowledge IntEgration, which indicates its ability to incorporate various types of knowledge into the pre-training process of language models.
Han [7] is a hierarchical attention network containing two levels of attention mechanisms applied at word-level and sentence-level, which is similar to the graph. The hierarchical nature of Han’s attention mechanisms allows it to effectively model relationships and dependencies between words and sentences. The model can capture not only the local interactions between words but also the broader interactions and contextual dependencies between sentences.
DAG [22] is an acronym for Directed Acyclic Graph. DAG can be used to model the structure and context of a conversation, where each node represents an utterance and each edge represents the dependency or influence between utterances. DAG can capture the information flow and the long-distance dependencies in a conversation.
InductGCN [35] constructs a graph based on the statistics of training documents only and represents document vectors with a weighted sum of word vectors. It then conducts one-directional GCN propagation during testing.
TextGCN [36] incorporates semantic information and relationships from text data by constructing a text graph and applying graph convolution operations. It can perform text classification without the need for external embeddings, making it a valuable approach for specific text classification tasks.
AttentionXML [37] introduced an attention mechanism and a probabilistic label tree (PLT). Attention mechanism ensures that the model captures the subtle and context-dependent associations between text and labels, enhancing classification accuracy.

4.3 Results and Analysis

4.3.1 DialGNN Comparing with Baseline Methods

Table 4 presents the performance evaluation of different models on two distinct datasets, CM and ECS, for dialogue classification. Notably, DialGNN(BERT) denotes the amalgamation of the BERT model with the DialGNN structural framework, referred to subsequently as DialGNN.

On the CM dataset, DialGNN demonstrates remarkable performance with 70.2% accuracy and 59.3% F1 score. This outcome underscores the DialGNN structure’s innate capacity to capture intricate linguistic patterns embedded within dialogues. Meanwhile, the ECS dataset reveals similar excellence, with an accuracy rate of 60.3% and an equally robust F1 score of 54.9%. This success can be primarily attributed to the indispensable role played by the DialGNN structure in facilitating the modeling of intricate dependencies and contextual information across the sequence of dialogue turns.

The results obtained underscore the inherent limitations of conventional baseline models, such as TextRNN, TextCNN, and CNN_LSTM. These models, rooted in their sequential processing architecture, consistently display a comparatively inferior performance on both the CM and ECS datasets.

Table 4 Comparsion with baseline methods on CM and ECS datasets

Full size table

Comparing to BERT, Roberta, and ERNIE, DialGNN outperforms them due to its heterogeneous graph architecture designed for dialogue classification, especially in understanding the flow of conversations, tracking changing topics, and capturing user intents that evolve across sentences.

Futhermore, the underperformance of the DAG model on the ECS dataset can be attributed to a fundamental mismatch between the model’s design and dialogue classification task. The DAG model primarily operates at the sentence level, focusing on classifying individual sentences, while the datasets in question require dialogue-level classification.

InductGCN’s performance on both the CM and ECS datasets is relatively lower than DialGNN. Its limitations may stem from its capacity to capture nuanced linguistic patterns, contextual dependencies, and user intents in dialogues, which are pivotal in dialogue classification tasks.

In the case of dialogue text, it contains interactions and transitions between speakers, and TextGCN’s design does not explicitly address the complex relationships between speakers. It tends to deal with individual sentences or text blocks and may not capture the context and relationships between speakers adequately, limiting its ability to understand and model the overall semantics of the dialogue.

As for AttentionXML, if there are weak label correlations or label relationships that have minimal impact on classification in a given dataset, the model might not fully leverage the advantages of the attention mechanism. AttentionXML relies on learning label relationships to better capture the relationships between text and labels. If these relationships are not significant in the dataset, the model’s performance could be constrained.

Crucially, DialGNN incorporates nodes at different levels of semantic granularity, allowing for flexible integration of various pre-trained language representation models. The use of BERT within the graph structure endows DialGNN with powerful text representational capabilities.

To further validate the generalizability of DialGNN, Table 5 presents the comparison results of different sequence encoders with and without DialGNN. We can find that for all baseline models, combining with DialGNN achieves significant improvements. Even on the strong baseline model BERT, the DialGNN gains 5.5% and 6.4% F1 scores on the CM and ECS datasets, respectively.

Table 5 The performance comparisons of baseline models and combining DialGNN

Full size table

As shown in Table 5, models with DialGNN-seg have better performance on the ECS dataset. DialGNN-seg refers to a trick to handle too-long dialogues. The BERT model restricts the maximum length of the input sequence to 512 due to computational issues. For a dialogue from the ECS dataset with more than 512 tokens, we truncate it into 512 tokens in the basic DialGNN settings. In the DialGNN-seg settings, we obtain the initial embeddings by sliding a context window of 512 tokens. So the DialGNN-seg incorporates more contextual information into node embeddings and gains a better performance.

4.3.2 Comparisons on Graph Designs

Table 6 displays the performance evaluation of various graph designs that utilize pre-trained models on CM Dataset. The comparison groups consist of different design variations, including those with context relation modeling (specifically DialogueGCN, which requires a substantial amount of GPU memory, and BERT-tiny as the base model), asynchronous initialization, and designs without a dialogue node.

The results of the comparison reveal that alternative graph designs tend to compromise the quality of the latent representations provided by pre-trained models. This suggests that the mentioned designs are not able to effectively capture and incorporate the contextual relationships present in the data. In particular, the performance metrics indicate that these alternative designs result in a degradation of the pre-trained model’s ability to represent and understand the underlying patterns in the China Mobile Dataset.

These findings highlight the importance of preserving the latent representation quality obtained from pre-trained models when designing graph structures for natural language processing tasks. The results strongly suggest that the approaches involving context relation modeling, asynchronous initialization, and the inclusion of dialogue nodes are essential for maintaining the integrity and effectiveness of the pre-trained models when applied to the China Mobile Dataset. By utilizing these design elements, the models can leverage the full potential of the pre-trained representations and achieve better performance in capturing the intricacies of the dataset.

Table 6 The results of Different Graph Designs

Full size table

4.3.3 Ablation Study

To validate the contribution of each component, a series of experiments are designed to observe the performances and the results are summarized in Table 7, “w/o" indicates that this component is not included in the model.

Table 7 The results of Ablation study on CM Dataset

Full size table

The results presented in the table demonstrate that the TF-IDF initialization approach for edge weights significantly enhances the overall performance of the system. This initialization technique, which utilizes TF-IDF algorithm, provides a valuable foundation for establishing the weights of connections between nodes in the graph structure.

Moreover, both the sentence-to-word updating step and the word-to-sentence updating step have been found to play crucial roles in the functionality of the DialGNN system. These steps are integral in facilitating the flow of information and the exchange of knowledge between sentences and words in the graph. By ensuring a bidirectional and iterative updating process, these steps enable the system to capture and incorporate relevant information from both sentence-level and word-level representations.

The findings underscore the significance of each component in the overall performance of the system. The TF-IDF initialization contributes substantially to the quality of the edge weights, enhancing the accuracy and effectiveness of the system. Additionally, the sentence-to-word updating step and the word-to-sentence updating step are deemed essential for the optimal functioning of DialGNN, enabling seamless information propagation and integration between sentence and word representations.

4.4 Case Study

Drawing upon the preceding model analysis, a case study was conducted utilizing samples extracted from the dataset. As depicted in Fig. 3, Table 8 presents the classification outcomes achieved by DialGNN in comparison to other models. Remarkably, DialGNN demonstrated unerring classification accuracy for all the examples. DialGNN’s superiority resides in its capacity to transcend mere feature-based word-level classification. It distinguishes itself by comprehending the nuanced semantics of words within the broader context and the underlying dialogue structure.

Table 8 Classification results of different models for Examples. Bold indicates correct classification

Full size table

Referring to Fig. 3, Example 1 showcases a distinctive dialogue structure. In the conversation, both the customer and the customer service initially hold different interpretations of the event, which evolve over the course of the discussion. DialGNN inspired by BERT, introduces a novel approach by incorporating a “0th sentence node" as a dialog node. This innovative addition allows the model to integrate various discourse features within the entire dialogue, leading to a more comprehensive comprehension of the dialogue at the macro level. In this specific case, DialGNN can synthesize the customer’s descriptions of the event, thus obtaining a deeper understanding. Consequently, it can accurately identify the correct label, even when faced with conflicting opinions, which distinguishes it from other models that may primarily focus on direct opinion information.

Moving on to Example 2, while other models may be influenced by the frequency of terms like ‘July’, ‘Bill’, ‘Settlement’, and quantities mentioned in intermediate exchanges, they could interpret the core intent as a general ‘Business Processing’. In contrast, DialGNN leverages its dialog structure representation through a graph and an attention mechanism. It recognizes that the conversation’s beginning and end are critical segments since they mark the initiation and conclusion of inquiries. While other models may overlook these subtle cues, DialGNN focuses on the dialog’s core intent concentration areas and accurately identifies it as an ‘Invoice Reissuance’ issue. This underscores DialGNN’s strength in capturing contextual nuances within dialogues.

Finally, Example 3 showcases a dialogue containing vital contextual semantics. While other models tend to reinforce associations between strongly characterized words like ‘move’, ‘area’, and ‘location’ and tagged words such as ‘install’ and ‘moving’, DialGNN takes a distinctive approach. It capitalizes on the powerful contextual semantic features embedded within pre-trained models, seamlessly integrating information across sentences and words through interactive updates within its graph structure. Unlike other models that may categorize based on individual word characteristics, DialGNN excels in understanding words within the context and dialogue background. This deep comprehension is evident in this case, where it discerns the customer’s dissatisfaction with the business office from the event description and interactions with customer service.

5 Conclusion

In summary, our work introduces the innovative DialGNN framework, which leverages heterogeneous graph neural networks to gain a deeper understanding of multi-turn dialogues. Our framework offers versatile compatibility with various encoders and demonstrates the potential to enhance their performance, even when used in conjunction with pre-trained language models. The extensive array of experiments conducted showcases the efficacy of DialGNN in the context of dialogue understanding. The ability of DialGNN to capture nuanced linguistic patterns, contextual dependencies, and evolving user intents across dialogues sets it apart as a robust and adaptable framework with the potential to advance a myriad of real-world applications. Our study serves as a pivotal step in the ongoing evolution of dialogue systems, paving the way for enhanced dialogue comprehension, customer satisfaction analysis, service quality assurance, and dialogue topic categorization.

Notes

http://www.cips-cl.org/static/CCL2018/call-evaluation.html.

References

Firdaus M, Golchha H, Ekbal A, Bhattacharyya P (2021) A deep multi-task model for dialogue act classification, intent detection and slot filling. Cogn Comput 13(2):626–645
Article Google Scholar
Colombo P, Chapuis E, Manica M, Vignon E, Varni G, Clavel C (2020) Guiding attention in sequence-to-sequence models for dialogue act prediction. In: Conitzer V, Sha F (eds) Proceedings of the AAAI conference on artificial intelligence, vol 34. AAAI Press, New York, NY, pp 7594–7601
Ahmadvand A, Choi JI, Agichtein E (2019) Contextual dialogue act classification for open-domain conversational agents. In: Carterette B, Kanoulas E, Sanderson M, Croft WB (eds) Proceedings of the 42nd international Acm Sigir Conference on Research and Development in Information Retrieval, vol 19. ACM, Paris, France, pp 1273–1276
Schuurmans J, Frasincar F (2019) Intent classification for dialogue utterances. IEEE Intell Syst 35(1):82–88
Article Google Scholar
Qin L, Che W, Li Y, Ni M, Liu T (2020) Dcr-net: a deep co-interactive relation network for joint dialog act recognition and sentiment classification. In: Conitzer V, Sha F (eds) Proceedings of the AAAI conference on artificial intelligence, vol 34. AAAI Press, New York, NY, pp 8665–8672
Qin L, Li Z, Che W, Ni M, Liu T (2021) Co-gat: A co-interactive graph attention network for joint dialog act recognition and sentiment classification. In: Yang Q, Leyton-Brown K (eds) Proceedings of the AAAI conference on artificial intelligence, vol 35. AAAI Press, New York, NY, pp 3210–3218
Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1. Association for Computational Linguistics, San Diego, CA, pp 1480–1489
Tetreault VRJ (2019) Dialogue act classification with context-aware self-attention. In: Burstein J, Doran C, Solorio T (eds) Proceedings of NAACL-HLT, vol 1. Association for Computational Linguistics, Minneapolis, Minnesota, pp 3727–3733
Li C, Li L, Qi J (2018) A self-attentive model with gate mechanism for spoken language understanding. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (eds) Proceedings of the 2018 conference on empirical methods in Natural Language Processing, vol 1. Association for Computational Linguistics, Brussels, Belgium, pp 3824–3833
Duran N, Battle S, Smith J (2023) Sentence encoding for dialogue act classification. Nat Lang Eng 29(3):794–823
Article Google Scholar
Kumar H, Agarwal A, Dasgupta R, Joshi S (2018) Dialogue act sequence labeling using hierarchical encoder with crf. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the Aaai conference on artificial intelligence, vol 32. AAAI Press, New Orleans, Louisiana, pp 3440–3447
He Z, Tavabi L, Lerman K, Soleymani M (2021) Speaker turn modeling for dialogue act classification. In: Walker MA, Ji H, Stent A (eds) Findings of the Association for Computational Linguistics: EMNLP 2021, vol 1. Punta Cana, Dominican Republic, Association for Computational Linguistics, pp 2150–2157
Chapter Google Scholar
Yan M, Lou X, Chan CA, Wang Y, Jiang W (2023) A semantic and emotion-based dual latent variable generation model for a dialogue system. CAAI transactions on intelligence technology 8(1):1–15
Google Scholar
Kumar A, Di Eugenio B, Aurisano J, Johnson A (2020) Augmenting small data to classify contextualized dialogue acts for exploratory visualization. In: Calzolari N, Béchet F (eds) Proceedings of the Twelfth Language Resources and Evaluation Conference, vol 1. European Language Resources Association, Marseille, France, pp 590–599
Guo M, Zhang Y, Yang Q, Shen G (2022) An intelligent multi-turn dialogue classification method based on resampling. J Phys: Conf Ser 2218(1):12–34
Google Scholar
Kang Z, Liu Y, Bi G, Fang F, Yin P (2021) No news is an island: Joint heterogeneous graph network for news classification. 2021 IEEE Symposium on Computers and Communications (ISCC) 2218(1):1–7
Niu H, Xiong Y, Gao J, Miao Z, Wang X, Ren H, Zhang Y, Zhu Y (2022) Composition-based heterogeneous graph multi-channel attention network for multi-aspect multi-sentiment classification. In: Burstein J, Doran C, Solorio T (eds) International Conference on Computational Linguistics, vol 1. Gyeongju, Republic of Korea, International Committee on Computational Linguistics, pp 6827–6836
Google Scholar
Yuan P, Jiang L, Liu J, Zhou D, Li P, Gao Y (2021) Dual-level attention based on a heterogeneous graph convolution network for aspect-based sentiment classification. Wirel Commun Mob Comput 2021(6):1–13
Google Scholar
Yang S, Liu Y, Zhang Y, Zhu J (2022) A word-concept heterogeneous graph convolutional network for short text classification. Neural Process Lett 55(2):735–750
Google Scholar
Kong D, Ji Z, Sang Y, Dong W, Yang Y (2022) Ua-hgat: Uncertainty-aware heterogeneous graph attention network for short text classification. 2022 IEEE International Conferences on Internet of Things (iThings) and IEEE Green Computing & Communications (GreenCom) and IEEE Cyber, Physical & Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics) 2022(1):495–500
Zhang B, He Q, Zhang D (2022) Heterogeneous graph neural network for short text classification. Appl Sci 12(17):8–11
Article Google Scholar
Shen W, Wu S, Yang Y, Quan X (2021) Directed acyclic graph network for conversational emotion recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol 1. ACL, Bangkok, Thailand, pp 1551–1560
Pang J, Xu H, Song S, Zou B, He X (2022) Mfdg: A multi-factor dialogue graph model for dialogue intent classification. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, vol 13714. Springer, Cham, pp 691–706
Chapter Google Scholar
Ghosal D, Majumder N (2019) DialogueGCN: A graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 conference on empirical methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP, (eds) Inui K, Jiang J, Ng V, Wan X, vol 1. Association for Computational Linguistics, Hong Kong, China, pp 154–164
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1. Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186
Veličković P, Cucurull G, Casanova A, Romero A, Lió P, Bengio Y (2018) Graph attention networks. In: Murray I, Larochelle H (eds) International Conference on Learning Representations, vol 1050. OpenReview, Vancouver, Canada, pp 1–12
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30(1):5998–6008
Google Scholar
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, vol 14. OpenReview, San Diego, CA, USA, pp 1–15
Google Scholar
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: Kambhampati S (ed) Proceedings of IJCAI 2016, vol 5. AAAI Press, New York, pp 2873–2879
Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of ACL 2016, vol 2. ACL, Berlin, pp 207–212
Kim Y (2014) Convolutional neural networks for sentence classification. In: Moschitti A, Pang B, Daelemans W (eds) Proceedings of EMNLP 2014, vol 1. ACL, Doha, pp 1746–1751
Wang J, Yu L-C, Lai KR, Zhang X (2016) Dimensional sentiment analysis using a regional CNN-LSTM model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers, (ed) Ammar W, Erk K, Smith NA, vol 2. Association for Computational Linguistics, Berlin, Germany, pp 225–230
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692
Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2020) Ernie 2.0: A continual pre-training framework for language understanding. In: Conitzer V, Sha F (eds) Proceedings of the AAAI conference on artificial intelligence, vol 34. AAAI Press, New York, NY, pp 8968–8975
Wang K, Han SC, Poon J (2022) Induct-gcn: Inductive graph convolutional networks for text classification. In: Yang J, Zhou J, Sun Z (eds) 2022 26th International Conference on Pattern Recognition (ICPR), vol 1. IEEE, New York, USA, pp 1243–1249
Chapter Google Scholar
Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In: McIlraith S, Weinberger K (eds) Proceedings of the AAAI Conference on Artificial Intelligence, vol 33. AAAI, San Francisco, pp 7370–7377
You R, Zhang Z, Wang Z, Dai S, Mamitsuka H, Zhu S (2019) Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. Adv Neural Inf Process Syst 32(32):9699–9710
Google Scholar

Download references

Funding

This study was funded by Fundamental Research Funds for the Central Universities (2023ZKPYJD06) and National Natural Science Foundation of China (61901476).

Author information

Authors and Affiliations

Department of Artificial Intelligence, China University of Mining and Technology, Beijing, 100083, China
Yan Yan, Peng-hao Min, Guan-wen Ding & Jun-yuan Liu
Beijing Academy of Artificial Intelligence, Beijing, 100084, China
Bo-Wen Zhang

Authors

Yan Yan
View author publications
You can also search for this author in PubMed Google Scholar
Bo-Wen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peng-hao Min
View author publications
You can also search for this author in PubMed Google Scholar
Guan-wen Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jun-yuan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yan Yan and Bo-wen Zhang conceived the idea and designed the DialGNN framework. Peng-hao Min and Guan-wen Ding implemented the experiments and analyzed the results. Jun-yuan Liu prepared the figures and tables. All authors wrote and revised the manuscript.

Corresponding author

Correspondence to Bo-Wen Zhang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yan, Y., Zhang, BW., Min, Ph. et al. DialGNN: Heterogeneous Graph Neural Networks for Dialogue Classification. Neural Process Lett 56, 143 (2024). https://doi.org/10.1007/s11063-024-11595-z

Download citation

Accepted: 12 March 2024
Published: 08 April 2024
DOI: https://doi.org/10.1007/s11063-024-11595-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

DialGNN: Heterogeneous Graph Neural Networks for Dialogue Classification

Abstract

Similar content being viewed by others

Dialogue Relation Extraction with Document-Level Heterogeneous Graph Attention Networks

UTMGAT: a unified transformer with memory encoder and graph attention networks for multidomain dialogue state tracking

DocTalk: Combining Dependency-Based Text Graphs and Deep Learning into a Practical Dialog Engine

1 Introduction