DialGNN: Heterogeneous Graph Neural Networks for Dialogue Classi�cation

Dialogue systems have attracted growing research interests due to its widespread applications in various domains. Existing studies on classiﬁcation tasks in dialogue systems majorly focus on the sentence-level intent recognition of users’ utterances. While in real-world applications, classiﬁcation of the entire dialogue, also beneﬁts many downstream tasks such as customer satisfaction analysis, service quality assurance, dialogue topic categorization, etc. In this paper, we propose DialGNN, a heterogeneous graph neural network framework tailored for the problem of dialogue classiﬁcation which takes the entire dialogue as input. Speciﬁcally, a heterogeneous graph is constructed with nodes in diﬀerent levels of semantic granularity. The graph framework allows ﬂexible integration of various pre-trained language representation models, such as BERT and its variants, which endows DialGNN with powerful text representational capabilities. The experimental results on two real-world datasets demonstrate the robustness and the eﬀectiveness of the proposed DialGNN framework. The implementation of Dial-GNN and related data are shared through https://github.com/anonymous-auth/ DialGNN.


Introduction
In recent years, dialogue systems have been prevalently applied in customer services, online health consultation, chatbots, etc. Dialogue classification, which aims at assigning predefined labels to an entire dialogue, is a fundamental task for many applications, including dialogue theme recognition, customer satisfaction analysis, service quality, etc [1].
Most existing researches on classification in dialogue systems focus on the intent of users in each turn within a dialogue [2,3].These methods, taking the sentencelevel user utterance as input and output the predicted intent, are not appropriate to classify the entire dialogues in document-level because sentences in dialogues are meant to be understood with the help of the context of all messages in the dialogue.The dependence on extended context requires that the classification process must regard a large block of utterances as input, which should be classified as a whole [4].Fig. 1 An example dialog from telecom customer service, defined as "consulting" rather than "business cancelling".
An intuitive solution to solve the above problem is to treat the whole dialogue as a document and use document classification methods.These methods either concatenate sentences into a long sequence [5,6] or combine them hierarchically [7].The main challenge is that a dialogue may contain multiple semantic topics, some of which are irrelevant to the business of the application task.Such irrelevant topics, regarded as noise, may be meaningless or misleading for classification models to predict correct results.For example in figure 1, the customer is consulting whether the 30 yuan data package is cancelled.The ground truth category should be business consultation.However, the mentioned canceling topic might mislead the models to identify as the category of business cancellation.Therefore, existing models can hardly identify the noise in dialogues to determine the accurate categories.
We propose DialGNN, a generic framework based on heterogeneous graph neural networks for document-level dialogue classification.Firstly, a heterogeneous graph is constructed for each dialogue to represent the latent relationships among the sentences and the words within the dialogue.The sentences and words in each dialogue are regarded as nodes of different types in the graph.Then we combine graph neural networks and pre-training language models to learn latent representations of nodes and edges in the dialogue graph.During the messages passing over the graph, the representations of word-nodes and sentence-nodes are updated together, which helps to learn more implicit relationships among words and sentences.To validate the effectiveness, we conduct a set of experiments based on a public dataset 1 and an e-commerce customer service dataset contributed by ourselves.The comparison results show that the proposed method outperforms the sota methods.

Dialogue Classification
Dialogue classification involves assigning predefined labels to dialogues or their segments, such as utterances or turns, based on their functional or intentional significance within the conversation [8,9].
Most existing studies on dialogue classification focus on sentence-level or utterancelevel intent recognition of user statements [10].These studies commonly employ hierarchical neural networks to model the sequential and structural information within words, characters, and utterances [11].However, these approaches fail to explicitly account for the transition of speakers during the dialogue, which can impact the interpretation of dialogue acts.For instance, when speaker A poses a question, the subsequent utterance from speaker B is more likely to be an answer.Conversely, if the speaker remains the same, the following act is less likely to be an answer.
Tavabi [12] proposed to integrate the turn changes in conversations among speakers when modeling dialogue acts.They learned conversation-invariant speaker turn embeddings to represent the speaker turns in conversation; the learned speaker turn embeddings were then merged with the utterance embeddings for the downstream task of dialogue act classification.They showed that their model outperformed several baselines on three benchmark public datasets.
Another challenge for dialogue classification is that a dialogue may contain multiple semantic topics, some of which are irrelevant to the business of the application task [13].In some cases, these topics may be irrelevant to the primary objective or business task of the application.This complexity arises from the natural flow of conversation, where participants may introduce unrelated or tangential subjects alongside the main focus of the dialogue.Consequently, accurately classifying dialogues requires the ability to identify and filter out the irrelevant topics, ensuring that the assigned labels reflect the pertinent information and align with the specific objectives of the application.
Kumar [14] addressed this problem by augmenting small data to classify contextualized dialogue acts for exploratory visualization.They collected a new corpus of conversations, CHICAGO-CRIME-VIS, geared towards supporting data visualization exploration, and they annotated it for a variety of features, including contextualized dialogue acts.They applied data augmentation techniques to the training data, such as paraphrasing and back-translation, to increase the diversity and robustness of the data.They ran experiments with different classifiers and found that conditional random fields outperformed other methods.
Guo [15] recognized the importance of removing redundant information from dialogue text and thus adopted a long text segmentation method based on resampling, which solves the limitations of the BERT input length as well.

Heterogeneous Graph Network
For news classification, Kang [16] proposed a heterogeneous graph called News Classification Graph to represent the relationships between multiple news, such as their relevance in time, place and people.Moreover, they proposed Joint Heterogeneous graph Network (JHN) to properly embed the News Classification Graph.
For aspect-based sentiment analysis, trying to captures the sentiment relationship among aspect terms, Niu [17] constructs a heterogeneous graph which models the interaspect relationships and aspect-context relationships simultaneously.
To combine multiple aspects of a review together and make use of the link between a sentence and its words, Yang [18] propose a dual-level attention-based heterogeneous graph convolutional network, including node-level and type-level attentions.
For short text classification, Yang [19] proposed a word-concept heterogeneous graph convolution network to avoid regarding introduced concepts as noises and learn the representations with interactive information.Kong [20] considers the lack of labeled data.Adopting an uncertainty-aware mechanism, they proposed a heterogeneous graph attention network.Further more, the lack of context, the sparsity of short text features and the inability of word embedding and external knowledge bases to supplement short text information are also challenges for short text classification.Aiming to improves classification accuracy and reduces computational difficulty, Zhang [21] bulit a text, word and POS tag-based graph convolutional network which does not require pre-training word embedding as initial node features.

Methodology
DialGNN encompasses three essential modules that collectively contribute to its functionality: DialGraph Construction, Node Representation, and Heterogeneous Graph Network, as illustrated in Fig. 2.
The DialGraph Construction module plays a crucial role by transforming a given dialogue into a heterogeneous graph.This graph enables the capturing of intricate relationships among words, sentences, and the overall dialogue structure.By representing the dialogue in this manner, DialGNN gains a comprehensive understanding of its underlying dynamics.
The Node Representation module within DialGNN undertakes the task of initializing the node representations within the DialGraph.This is achieved by employing BERT-based embeddings, which are pre-trained contextual representations capable of capturing rich semantic information.Through this initialization process, the Node Representation module equips the graph with meaningful and informative node representations.
Fig. 2 The architecture of the proposed framework.
The final module, Heterogeneous Graph Network, is responsible for encoding the heterogeneous graphs generated by DialGraph Construction.It employs graph attention networks to capture relevant dependencies and interactions among nodes within the graph.By updating the representations of the nodes based on these learned relationships, the Heterogeneous Graph Network module enhances the graph's ability to handle downstream tasks effectively.Now let's delve into the detailed descriptions of each section.

DialGraph Construction
There are several efforts to convert a dialogue into a topological graph [22].They majorly regard each sentence as a node and construct a homogeneous graph where the edges between nodes are formed with contextual relations.That is, only sentences within a fixed window size have edges.Such methods might fail to capture the relations among sentences with long distance and may ignore the impact of some words which significantly contribute to the predicted categories.
To this end, we construct a heterogeneous graph named DialGraph with word nodes, sentence nodes and dialogue nodes.The edges between sentence nodes and word nodes represent the containing relations.Then more implicit relations among different sentences can be derived from the relations between sentences and words, such as the co-occurrence, semantic distance, and term frequencies.Inspired by the usage of [CLS] tag in BERT, we add a 0 th sentence node as the dialogue node, rather than using a pooling layer of sentence node embedding.Formally, the heterogeneous graph DialGraph is defined as follows.
Given a dialogue C = {s 1 , s 2 , . . ., s n }, the DialGraph is denoted as G = {V, E}, where V = V w ∪ V s ∪ V c , E = {e 10 , e 11 , . . ., e mn } represents the node set and the edge set respectively.Here, V w = {w 1 , w 2 , . . ., w m } denotes m unique words, V s corresponds to the n sentences, V c is the dialogue node.E is a real-value edge weight matrix and e ij (i ∈ [1, m], j ∈ [0, n]) indicates the j th sentence contains the i th word.Note that the dialogue node V c connects to all word nodes.
The node updates in DialGNN are determined by considering the features of neighboring nodes and the associated edge weights.In this regard, the word nodes update their representations based on the features and edge weights of the corresponding sentence nodes.Similarly, the sentence nodes update their representations by considering the features and edge weights of the word nodes connected to them.Furthermore, the dialogue nodes update their representations by incorporating the features and edge weights of the sentence nodes connected to them.This approach ensures that the node representations in DialGNN are iteratively refined, taking into account the contextual information from neighboring nodes and their respective edge weights.

Node Representation
We denote X w ∈ R m×dw , X s ∈ R n×ds , and X c ∈ R n×dc as the input feature matrices representing word, sentence, and conversation nodes, respectively.Here, d w , d s , and d c refer to the dimensions of the word embeddings, sentence representation vectors, and dialogue representation vectors, respectively.
Here, we use BERT-based [23] embeddings to get the initialized representations of the words, the sentences and the dialogue.Note that other embedding models and other pre-trained language models can also be utilized.
To incorporate the varying importance of relationships between nodes, we employ TF-IDF (Term Frequency-Inverse Document Frequency) values to initialize the weights of the edges.TF-IDF is a statistical measure commonly used in natural language processing to evaluate the significance of a term in a document relative to a collection of documents.By assigning TF-IDF values as the edge weights, we can capture the importance of the connections between nodes in the graph structure.

Heterogeneous Graph Network
Given the constructed DialGraph with node features X w ∪ X s ∪ X c , we leverage the graph attention networks [24] to update the representations of nodes.
We refer to h i ∈ R d h , i ∈ [0, m + n] as the hidden states of input nodes.The graph attention(GAT) layer is designed as follows: where W a , W q , W k , W v are learnable linear transformation matrices and α ij is the attention weights between h i and h j .The multi-head attention can be denoted as follows: Furthermore, we also add a residual connection to avoid gradient vanishing.Therefore, the final output can be formulated as follows: Besides, we modify the GAT layer to infuse the scalar edge weights e ij , which are mapped to the multi-dimension embedding.Hence, the Equation 1 is modified as follows: After each GAT layer, we introduce a feed-forword network that includes two linear project layer as Transformer [25].

Training and Optimization
During the training stage, the representations of the dialogue node, sentence nodes and word nodes are updated alternately.Since the dialogue node can be regarded as the 0 th sentence connected with all words.The process of updating the dialogue node is the same as the process of updating the sentence nodes.Thus, one iteration of the training include a sentence-to-word update process and a word-to-sentence update process.
In the sentence-to-word update process, the dialogue node and the sentence nodes are updated in the t th iteration based on their connected word nodes via the GAT and FFN layer as follows: where H 0 w = X w , H 0 s = X s and U 1 s←w ∈ R m×d h .GAT() denotes that H 0 s is used as the attention query and H 0 w is used as the key and value.Then in the sentence-to-word update process, the word nodes are updated through the new dialogue node and the sentence nodes.
Finally, classification for the dialogue node determines the label of the whole dialogue and cross-entropy loss is used to optimize the model [26].

Experiments
In this section, we perform several experiments to assess and analyze the effectiveness of our proposed dialogue classification approach.Our objectives are to address the following research questions: • How does our approach compare with existing methods on the dialogue classification task?(Section 4.3) • How does the heterogeneous graph information affect the dialogue classification performance?(Section 4.4) • What are the contributions of each component in our approach?(Section 4.5) We use two datasets for our experiments: China Mobile Dataset (CM) and Ecommerce Customer Service Dataset (ECS).CM is a dataset of phone call dialogues between customers and service staff, where the goal is to identify the business type requested by the customers.ECS is a dataset of online chat dialogues between customers and sellers, staff or AI systems, where the goal is to classify the dialogue acts or emotions.We describe these datasets in more detail in Section 4.1.
We also compare our approach with several baseline models that use different sequence encoders, such as CNN-LSTM, Han and BERT.We describe these models in more detail in Section 4.2.

Datasets
China Mobile Dataset(CM).This dataset assumes a scenario where the customer service staff answer the phone calls from different customers.The aim is to determine which business is the accurate request of the calls given the whole dialogue history.The contents are the ASR texts from customer service dialogues of phone calls.The labels are pre-defined business types.The dataset contains 19,784 labeled conversation segments, with 37 different human-machine dialogue intent categories.Table 1 shows the business and conversation intention types.

Complainting
Uninformed customization, business usage, business processing, business regulations dissatisfaction, information security, network problems, marketing problems, cost problems, etc.

E-commerce Customer Service Dataset (ECS)
. ECS is contributed by ourselves to the community.The dialogues took place between a customer and a seller, a staff, or an AI system.The user goal is relatively straightforward, that is, to complain about an unsatisfied experience.The labels are event types, such as malicious refunding, counterfeit, right infringement, etc.The statistics of the two datasets mentioned above can be found in Table 2. Table 3 displays a representative ECS sample, where each column represents distinct elements.The first column serves as a unique dialogue key.The second column contains a sequence of sentence IDs associated with the dialogue.The third column comprises a JSON list with keys such as "id" (the JSON ID linked to the sequence), "text" (the sentence content), and "member type" (1 for customer, 2 for customer service, 3 for automatic AI customer service).The fourth column indicates the dialogue category, classified into coarse and fine levels.

Baselines
For a deep insight into the effectiveness of the proposed framework, we choose several baseline models as sequence encoders.
• CNN-LSTM [27] is a widely-used model consisting of regional CNN and LSTM.By combining the regional CNN and LSTM components, the CNN-LSTM model can leverage both the local spatial information captured by the CNN and the sequential dependencies captured by the LSTM.This hybrid approach allows the model to effectively extract meaningful features from input data and capture complex relationships within sequential data.
• Han [7] is a hierarchical attention network containing two levels of attention mechanisms applied at word-level and sentence-level, which is similar to the graph.The hierarchical nature of Han's attention mechanisms allows it to effectively model relationships and dependencies between words and sentences.The model can capture not only the local interactions between words but also the broader interactions and contextual dependencies between sentences.
• BERT [23] is a transformers-based language model, which is pre-trained on largescale corpus and has achieved remarkable success in many NLP tasks.The use of transformers in BERT enables it to capture contextual dependencies in a more comprehensive manner.The transformer architecture utilizes attention mechanisms to weigh the importance of different words in a sentence based on their relevance to each other.This attention mechanism allows BERT to consider the entire context when representing a word, rather than just relying on its immediate neighbors.

Main Results
Table 4 presents the comparison results of different sequence encoders with and without DialGNN.We can find that for all baseline models, combining with DialGNN achieves significant improvements.Even on the strong baseline model BERT, the DialGNN gains 5.5% and 6.4% F1 scores on the CM and ECS datasets, respectively.As shown in Table 4, models with DialGNN-seg have better performance on the ECS dataset.DialGNN-seg refers to a trick to handle too-long dialogues.The BERT model restricts the maximum length of the input sequence to 512 due to computational issues.For a dialogue from the ECS dataset with more than 512 tokens, we truncate it into 512 tokens in the basic DialGNN settings.In the DialGNN-seg settings, we obtain the initial embeddings by sliding a context window of 512 tokens.So the DialGNN-seg incorporates more contextual information into node embeddings and gains a better performance.

Comparisons on Graph Designs
Table 5 displays the performance evaluation of various graph designs that utilize pretrained models on the China Mobile Dataset.The comparison groups consist of different design variations, including those with context relation modeling (specifically DialogueGCN, which requires a substantial amount of GPU memory, and BERT-tiny as the base model), asynchronous initialization, and designs without a dialogue node.
The results of the comparison reveal that alternative graph designs tend to compromise the quality of the latent representations provided by pretrained models.This suggests that the mentioned designs are not able to effectively capture and incorporate the contextual relationships present in the data.In particular, the performance metrics indicate that these alternative designs result in a degradation of the pretrained model's ability to represent and understand the underlying patterns in the China Mobile Dataset.These findings highlight the importance of preserving the latent representation quality obtained from pretrained models when designing graph structures for natural language processing tasks.The results strongly suggest that the approaches involving context relation modeling, asynchronous initialization, and the inclusion of dialogue nodes are essential for maintaining the integrity and effectiveness of the pretrained models when applied to the China Mobile Dataset.By utilizing these design elements, the models can leverage the full potential of the pretrained representations and achieve better performance in capturing the intricacies of the dataset.

Ablation study
To validate the contribution of each component, a series of experiments are designed to observe the performances and the results are summarized in Table 6.
The results presented in the table demonstrate that the TF-IDF initialization approach for edge weights significantly enhances the overall performance of the system.This initialization technique, which utilizes the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm, provides a valuable foundation for establishing the weights of connections between nodes in the graph structure.
Moreover, both the sentence to word updating step and the word to sentence updating step have been found to play crucial roles in the functionality of the Dial-GNN system.These steps are integral in facilitating the flow of information and the exchange of knowledge between sentences and words in the graph.By ensuring a bidirectional and iterative updating process, these steps enable the system to capture and incorporate relevant information from both sentence-level and word-level representations.
The findings underscore the significance of each component in the overall performance of the system.The TF-IDF initialization contributes substantially to the quality of the edge weights, enhancing the accuracy and effectiveness of the system.Additionally, the sentence to word updating step and the word to sentence updating step are deemed essential for the optimal functioning of DialGNN, enabling seamless information propagation and integration between sentence and word representations.

Table 2
The statistics of CM and ECS datasets.

Table 3
Samples of ECS dataset

Table 4
The performance comparisons of baseline models and combining DialGNN.

Table 5
The results of Different Graph Designs

Table 6
The results of Ablation study on CM Dataset.