1 Introduction

Text classification aims to classify a given document into certain pre-defined classes, and is considered to be a fundamental task in natural language processing (NLP). It includes a large number of downstream tasks, such as topic classification (Zhang et al. 2015), and sentiment analysis (Tai et al. 2015). Traditional text classification methods build representation on the text using N-gram (Cavnar et al. 1994) or Term Frequency-Inverse Document Frequency (TF-IDF) (Hakim et al. 2014) and apply traditional machine learning models, such as SVM (Joachims 2005), to classify the documents. With the development of neural networks, more deep learning models have been applied to text classification, including convolutional neural networks (CNN) (Kim 2014), recurrent neural networks (RNN) (Tang et al. 2015) and attention-based (Vaswani et al. 2017) models and large language models (Devlin et al. 2018).

However, these methods are either unable to handle the complex relationships between words and documents (Yao et al. 2019), and can not efficiently explore the contextual-aware word relations (Zhang et al. 2020). Graph neural networks (GNN) are introduced to resolve such obstacles. GNN is used with graph-structure datasets, so a graph needs to be built for text classification. There are two main approaches to constructing graphs: corpus-level and document-level graphs. The datasets are either built into single or multiple corpus-level graphs representing the whole corpus or numerous document-level graphs and each of them represents a document. The corpus-level graph can capture the global structural information of the entire corpus, while the document-level graph can explicitly capture the word-to-word relationships within a document. Both ways of applying graph neural networks to text classification achieve good performance.

This paper mainly focuses on GNN-based text classification techniques, datasets, and their performance. The graph construction approaches for both corpus-level and document-level graphs are addressed in detail. Papers on the following aspects will be reviewed:

  • GNNs-based text classification approaches. Papers that design GNN-based frameworks to enhance the feature representation or directly apply GNNs to conduct sequence text classification tasks will be summarized, described and discussed. GNNs applied for token-level classification (Natural Language Understanding) tasks, including NER, slot filling, etc, will not be discussed in this work.

  • Text classification benchmark datasets and their performance applied by GNN-based models. The text classification datasets with commonly used metrics used by GNNs-based text classification models will be summarized and categorized based on task types and the model performance on these datasets.

1.1 Related surveys and our contribution

Before 2019, the text classification survey papers (Xing et al. 2010; Khan et al. 2010; Harish et al. 2010; Aggarwal and Zhai 2012; Vijayan et al. 2017) have focused on covering traditional machine learning-based text classification models. Recently, with the rapid development of deep learning techniques, (Minaee et al. 2021; Zulqarnain et al. 2020; Zhou 2020; Li et al. 2022) review the various deep learning-based approaches. In addition, some papers not only review the SoTA model architectures but summarize the overall workflow (Jindal et al. 2015; Kadhim 2019; Mirończuk and Protasiewicz 2018; Kowsari et al. 2019; Bhavani and Kumar 2021) or specific techniques for text classification including word embedding (Selva Birunda and Kanniga 2021), feature selection (Deng et al. 2019; Shah and Patel 2016; Pintas et al. 2021), term weighting (Patra and Singh 2013; Alsaeedi 2020) and etc. Meanwhile, some growing potential text classification architectures are surveyed, such as CNNs (Yang et al. 2016), attention mechanisms (Mariyam et al. 2021). Since the powerful ability to represent non-Euclidean relation, GNNs have been used in multiple practical fields and reviewed e.g. financial application (Wang et al. 2021), traffic prediction (Liu and Tan 2021), bio-informatics (Zhang et al. 2021), power system (Liao et al. 2021), recommendation system (Gao et al. 2022; Liang et al. 2021; Yang et al. 2021). Moreover, (Bronstein et al. 2017; Battaglia et al. 2018; Zhang et al. 2019; Zhou et al. 2020; Wu et al. 2020) comprehensively review the general algorithms and applications of GNNs, as well as certain surveys mainly focus on specific perspectives including graph construction (Skarding et al. 2021; Thomas et al. 2022), graph representation (Hamilton et al. 2017), training (Xie et al. 2022), pooling (Liu et al. 2022) and more. However, only (Minaee et al. 2021; Li et al. 2022) briefly introduce certain SoTA GNN-based text classification models. A recent short review paper (Malekzadeh et al. 2021) reviews the concept of GNNs and four SoTA GNN-based text classification models. However, our study focuses explicitly on applying GNN techniques in text classification tasks. We delve into various GNN-related methodologies, including graph construction, node and edge representation, and training approaches commonly employed in text classification. Unlike (Malekzadeh et al. 2021) that typically review a limited number of models, our survey encompasses around 30 models categorised into document-level and corpus-level classifications, enabling a comprehensive analysis for comparing and contrasting these approaches. Additionally, our study goes beyond merely examining models by providing an in-depth analysis of metrics and datasets commonly used in GNN-based text classification tasks, aiming to offer valuable insights for future research in similar areas.

The contribution of this survey includes:

  • This is the first survey focused only on graph neural networks for text classification with a comprehensive description and critical discussion on more than twenty GNN text classification models.

  • We categorize the existing GNN text classification models into two main categories with multiple sub-categories, and the tree structure of all the models shows in Fig. 1.

  • We compare these models in terms of graph construction, node embedding initialization, and graph learning methods. And we also compare the performance of these models on the benchmark datasets and discuss the key findings.

  • We discuss the existing challenges and some potential future work for GNN text classification models.

1.2 Text classification tasks

Text classification involves assigning a pre-defined label to a given text sequence. The process typically involves encoding pre-processed raw text into numerical representations and using classifiers to predict the corresponding categories. Typical sub-tasks include sentiment analysis, topic labelling, news categorization, and hate speech detection. Specific frameworks can be extended to advanced applications such as information retrieval, summarising, question answering, and natural language inference. This paper focuses specifically on GNN-based models used for typical text classification.

  • Sentiment analysis is a task that aims to identify the emotional states and subjective opinions expressed in the input text, such as reviews, micro-blogs, etc. This can be achieved through binary or multi-class classification. Effective sentiment analysis can aid in making informed business decisions based on user feedback.

  • Topic classification is a supervised deep learning task to automatically understand the text content and classify it into multiple domain-specific categories, typically more than two. The data sources may be gathered from different domains, including Wikipedia pages, newspapers, scientific papers, etc.

  • Junk information detection involves detecting inappropriate social media content. Social media providers commonly use approaches like hate speech, abusive language, advertising or spam detection to remove such content efficiently.

1.3 Text classification development

Many traditional machine learning methods and deep learning models are selected as baselines for comparison with the GNN-based text classifiers. We mainly summarized those baselines into three types:

Traditional machine learning: In earlier years, traditional methods such as Support Vector Machines (SVM) (Zhang et al. 2011) and Logistic Regression (Genkin et al. 2007) utilized sparse representations like Bag of Words (BoW) and TF-IDF. However, recent advancements (Lilleberg et al. 2015; Yin and Jin 2015; Ren et al. 2016) have focused on dense representations, such as Word2vec, GloVe, and Fasttext, to mitigate the limitations of sparse representations. These dense representations are also used as inputs for sophisticated methods, such as Deep Averaging Networks (DAN) (Iyyer et al. 2015) and Paragraph Vector (Doc2Vec) (Le and Mikolov 2014), to achieve new state-of-the-art results.

Sequential models: RNNs and CNNs have been utilized to capture local-level semantic and syntactic information of consecutive words from input text bodies. The upgraded models, such as LSTM (Graves 2012) and GRU (Cho et al. 2014), have been proposed to address the vanishing or exploding gradient problems caused by vanilla RNN. CNN-based structures have been applied to capture N-gram features by using one or more convolution and pooling layers, such as Dynamic CNN (Kalchbrenner et al. 2014) and TextCNN (Kim 2014). However, these models can only capture local dependencies of consecutive words. To capture longer-term or non-Euclidean relations, improved RNN structures, such as Tree-LSTM (Tai et al. 2015) and MT-LSTM (Liu et al. 2015), and global semantic information, like TopicRNN (Dieng et al. 2016), have been proposed. Additionally, graph (Peng et al. 2018) and tree structure (Mou et al. 2015) enhanced CNNs have been proposed to learn more about global and long-term dependencies.

Attentions and transformers: attention mechanisms (Bahdanau et al. 2014) have been widely adopted to capture long-range dependencies, such as hierarchical attention networks (Abreu et al. 2019) and attention-based hybrid models (Yang et al. 2016). More attention-based text classification frameworks are summarized by (Minaee et al. 2021). Self-attention-based transformer architectures have achieved state-of-the-art performance on many text classification benchmarks via unsupervised pre-training tasks to generate strong contextual word representations (Devlin et al. 2018; Liu et al. 2019). However, although those large-scale models implicitly store general domain knowledge and are widely used to generate more representative textual representations, they only focus on learning the relation between input text bodies and ignore the global and corpus-level information (Lu et al. 2020; Lin et al. 2021).

1.4 Outline

The outline of this survey is as follows:

  • Section 1 presents the research questions and provides an overview of applying Graph Neural Networks to text classification tasks, along with the scope and organization of this survey.

  • Section 2 provides background information on text classification and graph neural networks and introduces the key concepts of applying GNNs to text classification from a designer’s perspective.

  • Section 3 and Sect. 4 discuss previous work on Corpus-level Graph Neural Networks and Document-level Graph Neural Networks, respectively, and provide a comparative analysis of the strengths and weaknesses of these two approaches.

  • Section 5 introduces the commonly used datasets and evaluation metrics in GNN for text classification.

  • Section 6 reports the performance of various GNN models on a range of benchmark datasets for text classification and discusses the key findings.

  • The challenges for the existing methods and some potential future works are discussed in Sect. 7.

  • In Sect. 8, we present the conclusions of our survey on GNN for text classification and discuss potential directions for future work.

Fig. 1
figure 1

Categorizing the graph neural network text classification models

2 Backgrounds of GNN

2.1 Definition of graph

A graph in this paper is represented as \(G = (V, E)\), where V and E represent a set of nodes (vertices) and edges of G, respectively. A single node in the node set is represented \(v_{i} \in V\), as well as \(e_{ij} = (v_{i},v_{j}) \in E\) donates an edge between node \(v_{i}\) and \(v_{j}\). The adjacent matrix of graph G is represented as A, where \(A \in {\mathbb {R}}^{n \times n}\) and n is the number of nodes in graph G. If \(e_{ij} \in E\), \(A_{ij} = 1\), otherwise \(A_{ij} = 0\). In addition, we use \({{\varvec{X}}}\) and \({{\varvec{E}}}\) to represent the nodes and edges representations in graph G, where \({{\varvec{X}}} \in {\mathbb {R}}^{n \times m}\) and \({{\varvec{E}}} \in {\mathbb {R}}^{n \times c}\). \({{\varvec{x}}}_i \in {\mathbb {R}}^m\) represents the m-dimensional vector of node \(v_{i}\) and \({{\varvec{e}}}_{ij} \in {\mathbb {R}}^c\) represents the c-dimensional vector of edge \(e_{ij}\) (most of the recent studies set \(c=1\) to represent a weighting scalar). A donates the edge feature weighted adjacent matrix.

2.2 Traditional graph-based algorithms

Before GNNs were broadly used for representing irregular relations, traditional graph-based algorithms have been applied to model the non-Euclidean structures in text classification e.g. Random Walk (Szummer and Jaakkola 2001; Zhou and Li 2005), Graph Matching (Schenker et al. 2004; Silva et al. 2014), Graph Clustering (Matsuo et al. 2006) which has been well summarized in (Wu et al. 2021). There are three common limitations of those traditional graph-based algorithms. Firstly, most of those algorithms mainly focus on capturing graph-level structure information without considering the significance of node and edge features. For example, Random Walk-based approaches (Zhou and Li 2005; Szummer and Jaakkola 2001) mainly focus on using distance or angle between node vectors to calculate transition probability while ignoring the information represented by node vectors. Secondly, since the traditional graph-based algorithms are only suitable for specific tasks, there is no unified learning framework for addressing various practical tasks. For example, Kaur and Kumar (2018) proposes a graph clustering method that requires a domain knowledge-based ontology graph. Lastly, the traditional graph-based methods are comparative time inefficient like the Graph Edit Distance-based graph matching methods have exponential time complexity (Silva et al. 2014).

2.3 Foundations of GNN

To tackle the limitation of traditional graph-based algorithms and better represent non-Euclidean relations in practical applications, Graph Neural Networks are proposed by Scarselli et al. (2008). GNNs have a unified graph-based framework and simultaneously model the graph structure, node, and edge representations. This section will provide the general mathematical definitions of Graph Neural Networks. The general forward process of GNN can be summarised as follows:

$$\begin{aligned} {{\varvec{H}}}^{ (l)} = {\mathcal {F}} ({{\varvec{A}}},{{\varvec{H}}}^{ (l-1)}) \end{aligned}$$
(1)

where \({{\varvec{A}}} \in {\mathbb {R}}^{n \times n}\) represents the weighted adjacent matrix and \({{\varvec{H}}}^{ (l)} \in {\mathbb {R}}^{n \times d}\) is the updated node representations at the l-th GNN layers by feeding \(l-1\)-th layer node features \({{\varvec{H}}}^{ (l-1)} \in {\mathbb {R}}^{n \times k}\) ( k is the dimensions of previous layers node representations ) into pre-defined graph filters \({\mathcal {F}}\).

The most commonly used graph filtering method is defined as follows:

$$\begin{aligned} {{\varvec{H}}}^{ (l)} = \phi (\tilde{{{\varvec{A}}}}{{\varvec{H}}}^{ (l-1)}{{\varvec{W}}}) \end{aligned}$$
(2)

where \(\tilde{{{\varvec{A}}}} = {{\varvec{D}}}^{-\frac{1}{2}}{{\varvec{AD}}}^{-\frac{1}{2}}\) is the normalized symmetric adjacency matrix. \({{\varvec{A}}} \in {\mathbb {R}}^{n \times n}\) is the adjacent matrix of graph G and \({{\varvec{D}}}\) is the degree matrix of \({{\varvec{A}}}\), where \(D_{ii} = \Sigma _{j}A_{ij}\). \({{\varvec{W}}} \in {\mathbb {R}}^{k \times d}\) is the weight matrix and \(\phi \) is the activation function. If we design a two layers of GNNs based on the above filter could get a vanilla Graph Convolutional Network (GCN) (Welling and Kipf 2016) framework for text classification:

$$\begin{aligned} {{\varvec{Y}}} = softmax (\tilde{{{\varvec{A}}}} (ReLU (\tilde{{{\varvec{A}}}}{{\varvec{H}}}{{\varvec{W}}}^{ (0)})){{\varvec{W}}}^{ (1)}) \end{aligned}$$
(3)

where \({{\varvec{W}}}^0\) and \({{\varvec{W}}}^1\) represent different weight matrix for different GCN layers and \({{\varvec{H}}}\) is the input node features. ReLU function is used for non-linearization and softmax is used to generated predicted categories \({{\varvec{Y}}}\). The notation of GNN can be found in Table 1.

Table 1 Commonly used notations in Graph Neural Networks

2.4 GNN for text classification

This paper mainly discusses how GNNs are applied in Text Classification tasks. Before we present the specific applications in this area, we first introduce the key concepts of applying GNNs to text classification from a designer’s view. We suppose for addressing a text classification task, we need to design a graph \(G = (V,E)\). The general procedures include Graph Construction, Initial Node Representation, Edge Representations, and Training Setup.

2.4.1 Graph construction

Some applications have explicit graph structures, including constituency or dependency graphs (Tang et al. 2020), knowledge graphs (Ostendorff et al. 2019; Marin et al. 2014), social networks (Dai et al. 2022) without constructing graph structure and defining corresponding nodes and edges. However, for text classification, the most common graph structures are implicit, which means we need to define a new graph structure for a specific task, such as designing a word-word or word-document co-occurrence graph. In addition, for text classification tasks, the graph structure can be generally classified into two types:

  • Corpus-level/document-level: Corpus-level graphs intend to construct the graph to represent the whole corpus, such as Yao et al. (2019); Liu et al. (2020); Lin et al. (2021); Wu et al. (2019), while the document-level graphs focus on representing the non-Euclidean relations existing in a single text body like Chen et al. (2020); Nikolentzos et al. (2020); Zhang et al. (2020). Supposing a specific corpus \({\mathcal {C}}\) contains a set of documents (text bodies) \({\mathcal {C}} = \{D_1,D_2,...,D_j\}\) and each \(D_i\) contains a set of tokens \(D_i = \{t_{i_1},t_{i_2},...,t_{i_k}\}\). The vocabulary of \({\mathcal {C}}\) can be represented as \({\mathcal {D}} = \{t_1,t_2,...,t_l\}\), where l is the length of \({\mathcal {D}}\). For the most commonly adopted corpus-level graph \(G_{corpus} = (V_{corpus},E_{corpus})\), a node \(v_i\) in \(V_{corpus}\) follows \(v_i \in {\mathcal {C}} \cup {\mathcal {D}}\) and the edge \(e_{ij} \in E_{corpus}\) is one kind of relations between \(v_i\) and \(v_j\). Regarding the document level graph \(G_{doc_i} = (V_{doc_i},E_{doc_i})\), a node \(v_{i_j}\) in \(V_{doc_i}\) follows \(v_{i_j} \in D_i\).

After designing the graph scale for the specific tasks, specifying the graph types is also important to determine the nodes and their relations. For text classification tasks, the commonly used graph construction ways can be summarized into:

  • Homogeneous/heterogeneous graphs: homogeneous graphs have the same node and edge type while heterogeneous graphs have various node and edge types. For a graph \(G = (V,E)\), we use \({\mathcal {N}}^v\) and \({\mathcal {N}}^e\) to represent the number of types of V and E. If \({\mathcal {N}}^v = {\mathcal {N}}^e = 1\), G is a homogeneous graph. If \({\mathcal {N}}^v >1 \) or \( {\mathcal {N}}^e > 1\), G is a heterogeous graph.

  • Static/dynamic graphs: Static graphs aim to use the constructed graph structure by various external or internal information to leverage to enhance the initial node representation such as dependency or constituency graph (Tang et al. 2020), co-occurrence between word nodes (Zhang et al. 2020), TF-IDF between word and document nodes (Yao et al. 2019; Wu et al. 2019; Lei et al. 2021) and so on. However, compared with the static graph, the dynamic graph initial representations or graph topology are changing during training without certain domain knowledge and human efforts. The feature representations or graph structure can jointly learn with downstream tasks to be optimised together. For example, Wang et al. (2020) proposed a novel topic-aware GNN text classification model with dynamically updated edges between topic nodes with others (e.g. document, word). Piao et al. (2021) also designed a dynamic edge-based graph to update the contextual dependencies between nodes. Additionally, Chen et al. (2020) propose a dynamic GNN model to jointly update the edge and node representation simultaneously. We provide more details about the above-mentioned models in Sect. 3 and Sect. 4.

Another widely used pair of graph categories are directed or undirected graphs based on whether the directions of edges are bi-directional or not. For text classification, most of the GNN designs follow the unidirectional way. In addition, those graph-type pairs are not parallel, which means they can be combined.

2.4.2 Initial node representation

Based on the pre-defined graph structure and specified graph type, selecting the appropriate initial node representations is the key procedure to ensure the proposed graph structure can effectively learn node. According to the node entity type, the existing node representation approaches for text classification can be generally summarized into:

  • Word-level representation: non-context word embedding methods such as GloVe (Pennington et al. 2014), Word2vec (Mikolov et al. 2013), FastText (Bojanowski et al. 2017) are widely adopted by many GNN-based text classification framework to represent the node features numerically. However, those embedding methods are restricted to capturing only syntactic similarity and fail to represent the complex semantic relationships between words. They cannot capture the meaning of out-of-vocabulary (OOV) words, and their representations are fixed. Therefore, there are some recent studies selecting ELMo (Peters et al. 2018), BERT (Devlin et al. 2018), GPT (Radford et al. 2018) to get contextual word-level node representation. Notably, even if the one-hot encoding is the simplest word representation method, many GNN-based text classifiers use one-hot encoding and achieve state-of-the-art performance. Few frameworks use randomly initialised vectors to represent the word-level node features.

  • Document-level representation: similar to other NLP applications, document-level representations are normally acquired by aggregating the word-level representation via some deep learning frameworks. For example, some researchers select by extracting the last-hidden state of LSTM or using the [CLS] token from BERT to represent the input text body numerically. Furthermore, it is also a commonly used document-level node representation way to use TF-IDF-based document vectors.

Most GNN-based text classification frameworks will compare the performance between different node representation methods to conduct quantitative analysis, as well as provide reasonable justifications for demonstrating the effectiveness of the selected initial node representation based on a defined graph structure.

2.4.3 Edge features

Well-defined edge features can effectively improve the graph representation learning efficiency and performance to exploit more explicit and implicit relations between nodes. Based on the predefined graph types, the edge feature types can be divided into structural features and non-structural features. The structural edge features are acquired from explicit relations between nodes, such as dependency or constituency relation between words, word-word adjacency relations, etc. That relationship between nodes is explicitly defined and widely employed in other NLP applications. However, more commonly used edge features are non-structural features which implicitly exist between the nodes and are specifically applied to specific graph-based frameworks. The typically non-structural edge features are firstly defined by Kim (2014) for GNNs-based text classification tasks, including:

  • PMI (point-wise mutual information) measures the co-occurrence between two words in a sliding window W and is calculated as:

    $$\begin{aligned} \text {PMI} (i,j)= & {} log\frac{p (i,j)}{p (i)p (j)}{;} \end{aligned}$$
    (4)
    $$\begin{aligned} p (i,j) = \frac{\#W (i,j)}{\#W}{;}\end{aligned}$$
    (5)
    $$\begin{aligned} p (i) = \frac{\#W (i)}{\#W}{.} \end{aligned}$$
    (6)

    where \(\#W\) is the number of windows in total, and \(\#W (i)\), \(\#W (i,j)\) shows the number of windows containing word i and both word i and j respectively.

  • TF-IDF (term frequency-inverse document frequency) is the broadly used weight of the edges between document-level nodes and word-level nodes.

Except for those two widely used implicit edge features, some specific edge weighting methods are proposed to meet the demands of particular graph structures for exploiting more information of input text bodies.

2.4.4 Training setup

After specifying the graph structure and types, the graph representation learning tasks and training settings also need to be determined to decide how to optimise the designed GNNs. Generally, the graph representation learning tasks can be categorized into three levels, including Node-level, Graph-level and Edge-level. Node-level and graph-level tasks involve node or graph classification, clustering, regression, etc., while edge-level tasks include link prediction or edge classification for predicting the existence of the relation between two nodes or the corresponding edge categories.

Similar to other deep learning model training settings, GNNs also can be divided into supervised, semi-supervised and unsupervised training settings. Supervised training provides labelled training data, while unsupervised training utilises unlabeled data to train the GNNs. However, compared with supervised or unsupervised learning, semi-supervised learning methods are broadly used by GNNs designed for text classification applications, which could be classified into two types:

  • Inductive learning adjusts the weights of proposed GNNs based on a labelled training set for learning the overall statistics to induce the general trained model for following processing. The unlabeled set can be fed into the trained GNNs to compute the expected outputs.

  • Transductive learning intends to exploit labelled and unlabeled sets simultaneously for leveraging the relations between different samples to improve the overall performance.

2.4.5 Evolution of GNNs for text classification

TextGCN (Yao et al. 2019) and Text-Level-GNN (Huang et al. 2019) were the first to frame a text classification task as a node or graph classification task, achieved by constructing graphs based on textual data. Following these works, the field witnessed a proliferation of methodologies, exploring various avenues: (1) advancements in graph learning models, (2) improved graph construction strategies, (3) integration with State-of-the-Art text classification methods like Bert (Devlin et al. 2018).

In terms of the advancements in graph learning models, SGC (Wu et al. 2019) simplifies the Graph Convolutional Network (GCN) architecture, thereby conserving computational resources, S2GC (Zhu and Koniusz 2020) and NMGC (Lei et al. 2021) mitigate over-smoothing challenges by integrating skip-connection mechanisms, TensorGCn (Liu et al. 2020), TextGTL (Li et al. 2021) and ME-GCN (Wang et al. 2022) direct their efforts towards the acquisition of enriched edge information, T-VGAE (Xie et al. 2021) employs graph auto-encoder methodologies to enhance representation learning. HGAT (Linmei et al. 2019), ReGNN (Li et al. 2019), HyperGAT (Ding et al. 2020), MLGNN (Liao et al. 2021) and DADGNN (Liu et al. 2021) leverage attention mechanisms for model enhancement. A detailed exposition of these Graph Neural Network (GNN) models can be found in Sect. 3 and 4.

3 Corpus-level GNN for text classification

We define a corpus-level Graph Neural Network as “constructing a graph to represent the whole corpus"; thus, only one or several graphs will be built for the given corpus. We categorize Corpus-level GNN into four subcategories based on the types of nodes shown in the graph.

3.1 Document and word nodes as a graph

Most corpus-level graphs include word nodes and document nodes, and there are word-document edges and word-word edges. By applying K (normally K=2 or 3) layer GNN, word nodes will serve as a bridge to propagate the information from one document node to another.

3.1.1 PMI and TF-IDF as graph edges: TextGCN, SGC, S\(^2\)GC, NMGC, TG-transformer, bertgCN

TextGCN (Yao et al. 2019) (Yao et al. 2019) builds a corpus-level graph with training document nodes, test document nodes and word nodes. Before constructing the graph, a common preprocessing method (Kim 2014) has been applied, and words shown fewer than five times or in NLTK (Bird et al. 2009) stopwords list have been removed. The edge value between the document node and the word node is TF-IDF, and that between the word nodes is PMI. The adjacency matrix of this graph is shown as follows.

$$\begin{aligned} A_{ij} = {\left\{ \begin{array}{ll} \text {PMI} (i,j)&{}i, j\text { are words}, \text {PMI} (i,j)>0{;}\\ \text {TF-IDF}_{i,j}&{}i\text { is document}, j \text { is word}{;}\\ 1&{}i=j{;}\\ 0&{}\text {otherwise}{.} \end{array}\right. } \end{aligned}$$
(7)

A two-layer GCN is applied to the graph, and the dimension of the second layer output equals the number of classes in the dataset. Formally, the forward propagation of TextGCN shows as follows:

$$\begin{aligned} \varvec{Z} = \text {softmax} (\tilde{\varvec{A}} (\text {ReLU} (\tilde{\varvec{A}}\varvec{X}\varvec{W}^{ (0)}))\varvec{W}^{ (1)}) \end{aligned}$$
(8)

where \({\tilde{A}}\) is the normalized adjacency of A and X is one-hot embedding. \(W_0\) and \(W_1\) are learnable parameters of the model. The representation on training documents is used to calculate the loss, and that on test documents is used for prediction. TextGCN is the first work that treats a text classification task as a node classification problem by constructing a corpus-level graph and has inspired many following works.

Based on TextGCN, several works follow the same graph construction method and node initialization but apply different graph propagation models.

SGC (Wu et al. 2019) To make GCN efficient, SGC (Simple Graph Convolution) removes the nonlinear activation function in GCN layers; therefore, the K-layer propagation of SGC is shown as follows:

$$\begin{aligned} \varvec{Z} = \text {softmax} (\tilde{\varvec{A}}... (\tilde{\varvec{A}} (\tilde{\varvec{A}}\varvec{X}\varvec{W}^{ (0)})\varvec{W}^{ (1)})...\varvec{W}^{ (K)}) \end{aligned}$$
(9)

which can be reparameterized into

$$\begin{aligned} \varvec{Z} = \text {softmax} (\tilde{\varvec{A}}^K\varvec{X}\varvec{W}) \end{aligned}$$
(10)

and K is 2 when applied to text classification tasks. With a smaller number of parameters and only one feedforward layer, SGC saves computation time and resources while improving performance.

S\(^2\)GC (Zhu and Koniusz 2020) To solve the over smoothing issues in GCN, (Zhu and Koniusz 2020) propose Simple Spectral Graph Convolution (S\(^2\)GC), which includes self-loops using Markov Diffusion Kernel. The output of S\(^2\)GC is calculated as:

$$ Z = {\text{softmax}}\left( {\frac{1}{K}\Sigma _{{k = 0}}^{K} \widetilde{A}^{k} XW} \right) $$
(11)

And can be generalized into:

$$ Z = {\text{softmax}}\left( {\frac{1}{K}\Sigma _{{k = 0}}^{K} \left( {1 - \alpha )\widetilde{A}^{k} X + \alpha X} \right)W} \right) $$
(12)

Similarly, K = 2 on text classification tasks and \(\alpha \) denotes the trade-off between self-information of the node and consecutive neighbourhood information. S\(^2\)GC can also be viewed as introducing skip connections into GCN.

NMGC (Lei et al. 2021) Other than using the sum of each GCN layer in S\(^2\)GC, NMGC applies min pooling using the Multi-hop neighbour Information Fusion (MIF) operator to address over-smoothing problems. A MIF function is defined as:

$$\begin{aligned} \text {MIF} (K) = \text {min} (\tilde{\varvec{A}}\varvec{X}\varvec{W},\tilde{\varvec{A}}^2\varvec{X}\varvec{W},...,\tilde{\varvec{A}}^K\varvec{X}\varvec{W}) \end{aligned}$$
(13)

NMGC-K firstly applies a MIF (K) layer, then a GCN layer, and K is 2 or 3. For example, when K = 3, the output is:

$$\begin{aligned} \varvec{Z} = \text {softmax} (\tilde{\varvec{A}} (\text {ReLU min} (\tilde{\varvec{A}}\varvec{X}\varvec{W}^{ (0)},\tilde{\varvec{A}}^2\varvec{X}\varvec{W}^{ (0)},\tilde{\varvec{A}}^3\varvec{X}\varvec{W}^{ (0)}))\varvec{W}^{ (1)}) \end{aligned}$$
(14)

NMGC can also be treated as a skip-connection in Graph Neural Networks, making the shallow layer of GNN directly contribute to the final representation.

TG-Transformer (Zhang and Zhang 2020) TextGCN treats the document nodes and word nodes as the same type of nodes during propagation, and to introduce heterogeneity into the TextGCN graph, TG-Transformer (Text Graph Transformer) adopts two sets of weights for document nodes and word nodes, respectively. To cope with a large corpus graph, subgraphs are sampled from the TextGCN graph using PageRank algorithm (Page et al. 1999). The input embedding of is the sum of three types of embedding: pretrained GloVe embedding, node type embedding, and Weisfeiler-Lehman structural encoding (Niepert et al. 2016). During propagation, self-attention (Vaswani et al. 2017) with graph residual (Zhang and Meng 2019) is applied.

BertGCN (Lin et al. 2021) To combine BERT (Devlin et al. 2018) and TextGCN, BertGCN enhances TextGCN by replacing the document node initialization with the BERT [CLS] output of each epoch and replacing the word input vector with zeros. BertGCN trains BERT and TextGCN jointly by interpolating the output of TextGCN and BERT:

$$\begin{aligned} \varvec{Z} = \lambda \varvec{Z}_{GCN} + (1-\lambda )\varvec{Z}_{BERT} \end{aligned}$$
(15)

where \(\lambda \) is the trade-off factor. To optimize the memory during training, a memory bank is used to track the document input and a smaller learning rate is set to BERT module to remain the consistency of the memory bank. BertGCN shows that with the help of TextGCN, BERT can achieve better performance.

3.1.2 Multi-graphs/multi-dimensional edges: tensorGCN, ME-GCN

TensorGCN (Liu et al. 2020) Instead of constructing a single corpus-level graph, TensorGCN builds three independent graphs: A semantic-based graph, a Syntactic-based graph, and a Sequential-based graph to incorporate semantic, syntactic and sequential information, respectively and combines them into a tensor graph.

Three graphs share the same set of TF-IDF values for the word-document edge but different values for word-word edges. Semantic-based graph extracts the semantic features from a trained Long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) model and connects the words sharing high similarity. The syntactic-based graph uses Stanford CoreNLP parser (Manning et al. 2014) and constructs edges between words when they have a larger probability of having a dependency relation. For the Sequential-based graph, the PMI value is applied as TextGCN does.

The propagation includes intra-graph propagation and inter-graph propagation. The model first applies the GCN layer on three graphs separately as intra-graph propagation. Then, the same nodes on three graphs are treated as a virtual graph, and another GCN layer is applied as inter-graph propagation.

ME-GCN (Wang et al. 2022) To fully utilize the corpus information and analyze rich relational information of the graph, ME-GCN (Multi-dimensional Edge-Embedded GCN) builds a graph with multi-dimensional word-word, word-document and document-document edges. Word2vec and Doc2vec embedding is firstly trained on the given corpus and the similarity of each dimension of trained embedding is used to construct the multi-dimensional edges. The trained embedding also serves as the input embedding of the graph nodes. During propagation, GCN is firstly applied on each dimension and representations on different dimensions are either concatenated or fed into a pooling method to get the final representations of each layer.

3.1.3 Making textGCN inductive: heteGCN, InducT-GCN, T-VGAE

HeteGCN (Ragesh et al. 2021) HeteGCN (Heterogeneous GCN) optimizes the TextGCN by decomposing the TextGCN undirected graph into several directed subgraphs. Several subgraphs from the TextGCN graph are combined sequentially as different layers: feature graph (word-word graph), feature-document graph (word-document graph), and document-feature graph (document-word graph). Different combinations were tested and the best model is shown as:

$$\begin{aligned} \varvec{Z} = \text {softmax} (\varvec{A}_{w-d} (\text {ReLU} (\varvec{A}_{w-w}\varvec{X}_{w}\varvec{W}^{ (0)}))\varvec{W}^{ (1)}) \end{aligned}$$
(16)

where \(\varvec{A}_{w-w}\) and \(\varvec{A}_{w-d}\) show the adjacency matrix for the word-word subgraph and word-document subgraph. Since the input of HeteGCN is the word node embeddings without using document nodes, it can also work in an inductive way while the previous corpus-level graph text classification models are all transductive models.

InducT-GCN (Wang et al. 2022) InducT-GCN (InducTive Text GCN) aims to extend the transductive TextGCN into an inductive model. Instead of using the whole corpus to build the graph, InducT-GCN builds a training corpus graph and makes the input embedding of the document the TF-IDF vectors, aligning with the one-hot word embeddings. The weights are learned following TextGCN but InducT-GCN builds virtual subgraphs for prediction on new test documents.

T-VGAE (Xie et al. 2021) T-VGAE (Topic Variational Graph Auto-Encoder) applies Variational Graph Auto-Encoder on the latent topic of each document to make the model inductive. A vocabulary graph \(A_v\) connects the words using PMI values, is constructed while each document is represented using the TF-IDF vector. All the document vectors are stacked into a matrix which can also be treated as a bipartite graph \(A_d\). Two graph auto-encoder models are applied on \(A_v\) and \(A_d\), respectively. The overall workflow shows as:

$$\begin{aligned} \varvec{Z}_v= & {} \text {Encoder}_{GCN} (\varvec{A}_v,\varvec{X}_v){;} \end{aligned}$$
(17)
$$\begin{aligned} \varvec{Z}_d= & {} \text {Encoder}_{UDMP} (\varvec{A}_d,\varvec{Z}_v){;}\end{aligned}$$
(18)
$$\begin{aligned} \varvec{A}_v^*= & {} \text {Decoder} (\varvec{Z}_v){;}\end{aligned}$$
(19)
$$\begin{aligned} \varvec{A}_d^*= & {} \text {Decoder} (\varvec{Z}_d,\varvec{Z}_v){.} \end{aligned}$$
(20)

where \(X^v\) is an Identity Matrix. The \(\text {Encoder}_{GCN}\) and the decoders are applied following VGAE (Kipf and Welling 2016) while \(\text {Encoder}_{UDMP}\) is an unidirectional message passing variant of \(\text {Encoder}_{GCN}\). The training objective is to minimise the reconstruction error, and \(Z_d\) is used for the classification task.

3.2 Document nodes as a graph

To show the global structure of the corpus directly, some models only adopt document nodes in the non-heterogeneous graph.

knn-GCN (Benamira et al. 2019) knn-GCN constructs a k-nearest-neighbours graph by connecting the documents with their K nearest neighbours using Euclidean distances of the embedding of each document. The embedding is generated in an unsupervised way: either using the mean of pretrained GloVe word vectors or applying LDA (Blei et al. 2003). Both GCN and Attention-based GNN (Thekumparampil et al. 2018) are used as the graph model.

TextGTL (Li et al. 2021) Similar to TensorGCN, TextGTL (Text-oriented Graph-based Transductive Learning) constructs three different document graphs: Semantics Text Graph, Syntax Text Graph, and Context Text Graph, while all the graphs are non-heterogeneous. Semantics Text Graph uses Generalized Canonical Correlation Analysis (Bach and Jordan 2002) and trains a classifier to determine the edge values between two document nodes. Syntax Text Graph uses the Stanford CoreNLP dependency parser (Manning et al. 2014) to construct units and also trains a classifier. Context Text Graph defines the edge values by summing up the PMI values of the overlapping words in two documents. Two GCN layers are applied, and the output of each graph is mixed as the output of this layer and input for the next layer for all three graphs:

$$\begin{aligned} \varvec{H}^{ (1)}= & {} \sigma (\varvec{A}\varvec{H}^{ (0)}\varvec{W}^{ (0)}){;} \end{aligned}$$
(21)
$$\begin{aligned} \varvec{H}^{ (2)}= & {} \sigma (\varvec{A}[\varvec{H}_{sem}^{ (1)}||\varvec{H}_{syn}^{ (1)}||\varvec{H}_{seq}^{ (1)}]\varvec{W}^{ (1)}){;} \end{aligned}$$
(22)
$$\begin{aligned} \varvec{Z} = \text {Pooling}_{mean} (\varvec{H}_{sem}^{ (2)},\varvec{H}_{syn}^{ (2)},\varvec{H}_{seq}^{ (2)}){.} \end{aligned}$$
(23)

where \(H^{ (0)}\) is the TF-IDF vector of the documents. Data augmentation with super nodes is also applied in TextGTL to strengthen the information in graph models.

3.3 Word nodes as a graph

By neglecting the document nodes in the graph, a graph with only word nodes shows good performance in deriving the graph-based embedding and is used for downstream tasks. Since no document nodes are included, this method can be easily adapted as an inductive learning model.

VGCN-BERT (Lu et al. 2020) VGCN-BERT enhances the input embedding of BERT by concatenating it with the graph embedding. It first constructs a vocabulary graph and uses PMI as the edge value. A variant of the GCN layer called VGCN (Vocabulary GCN) is applied to derive the graph word embedding:

$$\begin{aligned} \varvec{X}_{Graph} = \text {ReLU} (\varvec{X}_{BERT}\varvec{A}\varvec{W}^{ (0)})\varvec{W}^{ (1)} \end{aligned}$$
(24)

Where BERT embedding is used as the input, the graph word embeddings are concatenated with BERT embedding and fed into the BERT as extra information.

3.4 Extra topic nodes in the graph

Topic information of each document can also provide extra information in corpus-level graph neural networks. Several models also include topic nodes in the graph.

3.4.1 Single layer topic nodes: HGAT, STGCN

HGAT (Linmei et al. 2019) HGAT (Heterogeneous GAT) applies LDA (Blei et al. 2003) to extract topic information for each document; top P topics with the largest probabilities are selected as connected with the document. Instead of using the words directly to utilize the external knowledge, HGAT applies the entity linking tool TAGMEFootnote 1 to identify the entities in the document and connect them. The semantic similarity between entities using pretrained Word2vec with threshold is used to define the connectedness between entity nodes. Since the graph is a heterogeneous graph, a HIN (heterogeneous information network) model is implemented, which propagates solely on each sub-graph depending on the type of node. An HGAT model is applied by considering type-level attention and node-level attention. For a given node, the type-level attention learns the weights of different types of neighbouring nodes while node-level attention captures the importance of different neighbouring nodes when ignoring the type. By using the dual attention mechanism, HGAT can capture the information of type and node at the same time.

STGCN (Yan et al. 2013) In terms of short text classification, STGCN (Short-Text GCN) applies BTM to get topic information to avoid the data sparsity problem from LDA. The graph is constructed following TextGCN while extra topic nodes are included. Word-topic and document-topic edge values are from BTM, and a classical two-layer GCN is applied. The word embeddings learned from STGCN are concatenated with BERT embeddings and a bi-LSTM model is applied for final prediction.

3.4.2 Multi-layer topic nodes: DHTG

DHTG (Wang et al. 2020) To capture different levels of information, DHTG (Dynamic Hierarchical Topic Graph) introduces hierarchical topic-level nodes in the graph from fine-grain to coarse. Poisson gamma belief network (PGBN) (Zhou et al. 2015) is used as a probabilistic deep topic model. The first-layer topics are from the combination of words, while deeper layers are generated by previous layers’ topics with the weights of PGBN, and the weights serve as the edge values of each layer of topics. The cosine similarity is chosen as the edge value for the topics on the same layer. A two-layer GCN is applied, and the model is learned jointly with PGBN, which makes the edge of the topics dynamic.

3.5 Critical analysis

Compared with sequential models like CNN and LSTM, corpus-level GNN is able to capture the global corpus structure information with word nodes as bridges between document nodes and shows great performance without using external resources like pre-trained embedding or pre-trained model. However, the improvement in performance is marginal when pretrained embedding is included. Another issue is that most corpus-level GNN is transductive learning, which is not applicable in the real world. Meanwhile, constructing the whole corpus into a graph requires large memory space, especially when the dataset is large.

A detailed comparison of corpus-level GNN is displayed in Table 2.

4 Document-level GNN for text classification

By constructing the graph based on each document, a graph classification model can be used as a text classification model. Since each document is represented by one graph and new graphs can be built for test documents, the model can easily work in an inductive way.

4.1 Local word consecutive graph

The simplest way to convert a document into a graph with words as nodes is by connecting the consecutive words within a sliding window.

4.1.1 Simple consecutive graph models: text-Level-GNN, MPAD, TextING

Text-Level-GNN (Huang et al. 2019) Text-Level-GNN applies a small sliding window and constructs the graph with a small number of nodes and edges in each graph, which saves memory and computation time. The edge value is trainable and shared across the graphs when connecting the same two words, which also brings global information.

Unlike corpus-level graph models, Text-Level-GNN applies a message passing mechanism (MPM) (Gilmer et al. 2017) instead of GCN for graph learning. For each node, the neighbour information is aggregated using max-pooling with trainable edge values as the AGGREGATE function and then the weighted sum is used as the COMBINE function. Sum-pooling and an MLP classifier are applied as the READOUT function to get the representation of each graph. The propagation shows as:

$$\begin{aligned} \varvec{h}^{ (l+1)}_i= & {} (1-\alpha ) (max_{n\in {\mathcal {N}}_i}e_{ni}\varvec{h}^{ (l)}_n)+ \alpha \varvec{h}^{ (l)}_i{;} \end{aligned}$$
(25)
$$\begin{aligned} \varvec{z_i}= & {} \text {softmax} (\varvec{W}\Sigma _i\varvec{h}_i+\varvec{b}){.} \end{aligned}$$
(26)

where \(\varvec{h}^{ (l)}_i\) is ith word node presentation of layer l, \(e_{ni}\) is edge weight from node n to node i. A two-layer MPM is applied, and the input of each graph is pretrained GloVe vectors.

MPAD (Nikolentzos et al. 2020) MPAD (Message Passing Attention Networks) connects words within a sliding window of size 2 but also includes an additional master node connecting all nodes in the graph. The edge only shows the connectedness of each pair of word nodes and is fixed. A variant of Gated Graph Neural Networks is applied where the AGGREGATE function is the weighted sum and the COMBINE function is GRU (Chung et al. 2014). Self-attention is applied in the READOUT function.

To learn the high-level information, the master node is directly concatenated with the READOUT output, working as a skip connection mechanism. Each layer’s READOUT results are concatenated to capture multi-granularity information to get the final representation. Pretrained Word2vec is used as the initialization of word nodes input.

TextING (Zhang et al. 2020) To simplify MPAD, TextING ignores the master node in the document-level graphs, which makes the graph sparser. Compared with Text-Level-GNN, TextING has fixed edges. A similar AGGREGATE and COMBINE function are applied under the concept of e-gated Graph Neural Networks (GGNN) (Li et al. 2016) with the weighted sum and GRU. However, for the READOUT function, soft attention is used and both max-pooling and mean-pooling are applied to make sure that "every word plays a role in the text and the keywords should contribute more explicitly".

4.1.2 Advanced graph models: MLGNN, TextSSL, DADGNN

MLGNN (Liao et al. 2021) MLGNN (Multi-level GNN) builds the same graph as TextING but introduces three levels of MPM: bottom-level, middle-level and top-level. In the bottom-level MPM, the same method with Text-Level-GNN is applied with pretrained Word2vec as input embedding but the edge is non-trainable. A larger window size is adopted in the middle level, and Graph Attention Networks (GAT) (Veličković et al. 2018) is applied to learn distant word node information. In the top-level MPM, all word nodes are connected, and multi-head self-attention (Vaswani et al. 2017) is applied. By applying three different levels of MPM, MLGNN learns multi-granularity information well.

DADGNN (Liu et al. 2021) DADGNN (Deep Attention Diffusion GNN) constructs the same graph as TextING but uses attention diffusion to overcome the over-smoothing issue. Pretrained word embedding is used as the input of each node and an MLP layer is applied. Then, the graph attention matrix is calculated based on the attention to the hidden states of each node. The diffusion matrix is calculated as

$$\begin{aligned} \varvec{T} = \Sigma _{n=0}^{\infty }\epsilon _n\varvec{A}^n \end{aligned}$$
(27)

where A is the graph attention matrix and \(\epsilon \) is the learnable coefficients. \(A^n\) plays a role of connecting n-hop neighbours and (Liu et al. 2021) uses \(n \in [4,7]\) in practice. A multi-head diffusion matrix is applied for layer propagation.

TextSSL (Piao et al. 2021) To solve the word ambiguity problem and show the word synonymity and dynamic contextual dependency, TextSSL (Sparse Structure Learning) simultaneously learns the graph using intra-sentence and inter-sentence neighbours. The local syntactic neighbour is defined as the consecutive words, and trainable edges across graphs are also included by using Gumbel-softmax. By applying sparse structure learning, TextSSL manages to select edges with dynamic contextual dependencies.

4.2 Global word co-occurrence graph

Similar to the TextGCN graph, document-level graphs can also use PMI as the word-word edge values.

4.2.1 Only global word co-occurrence: DAGNN

DAGNN (Wu et al. 2019) To address the long-distance dependency, hierarchical information and cross-domain learning challenges in domain-adversarial text classification tasks, (Wu et al. 2019) propose DAGNN (Domain-Adversarial Graph Neural Network). Each document is represented by a graph with content words as nodes and PMI values as edge values, which can capture long-distance dependency information. Pretrained FastText is chosen as the input word embeddings to handle the out-of-vocabulary issue and a GCN model with skip connection is used to address the over-smoothing problem. The propagation is formulated as:

$$\begin{aligned} \varvec{H}^{ (l+1)} = (1-\alpha )\tilde{\varvec{A}}\varvec{H}^{ (l)} + \alpha \varvec{H}^{ (0)} \end{aligned}$$
(28)

To learn the hierarchical information of documents, DiffPool (Ying et al. 2018) is applied to assign each document into a set of clusters. Finally, adversarial training minimises the loss on source tasks and maximises the differentiation between source and target tasks.

4.2.2 Combine with extra edges: ReGNN, GFN

ReGNN (Li et al. 2019) ReGNN (Recursive Graphical Neural Network) uses PMI together with consecutive words as the word edges to capture global and local information. The graph propagation function is the same as GGNN while additive attention (Bahdanau et al. 2015) is applied in aggregation. Pretrained GloVe is the input embedding of each word node.

GFN (Dai et al. 2022) GFN (Graph Fusion Network) builds four types of graphs using the word co-occurrence statistics, PMI, the similarity of pretrained embedding and Euclidean distance of pretrained embedding. Although four corpus-level graphs are built, the graph learning happens on each document’s subgraphs, making the method a document-level GNN. For each subgraph, each type of graph is learned separately using the graph convolutional method, and then a fusion method of concatenation is used. After an MLP layer, average pooling is applied to get the document representation.

4.3 Other word graphs

Some other ways of connecting words in a document have been explored.

HyperGAT (Ding et al. 2020) (Ding et al. 2020) proposes HyperGAT (Hypergraph Attention Networks), which builds hypergraphs for each document to capture high-level interaction between words. Two types of hyperedges are included: sequential hyperedges connecting all words in a sentence and semantic hyperedges connecting top-K words after getting the topic of each word using LDA. Like traditional hypergraph propagations, HyperGAT follows the same two steps of updating but with an attention mechanism to highlight the key information: Node-level attention is applied to learn hyperedges representations, and edge-level attention is used to update node representations.

IGCN (Tang et al. 2020) Contextual dependency helps in understanding a document, and the graph neural network is no exception. IGCN constructs the graph with the dependency graph to show the connectedness of each pair of words in a document. Then, the word representation learned from Bi-LSTM using POS embedding and word embedding is used to calculate the similarity between each pair of nodes. Attention is used for the output to find the important relevant semantic features.

GTNT (Mei et al. 2021) Words with higher TF-IDF values should connect to more word nodes, with this in mind, GTNT (Graph Transformer Networks based Text representation) uses sorted TF-IDF value to determine the degree of each node and applies the Havel-Hakimi algorithm (Hakami 1962) to determine the edges between word nodes. A variant of GAT is applied during model learning. Despite the fact that GAT’s attention score is mutual for two nodes, GTNT uses relevant importance to adjust the attention score from one node to another. Pretrained Word2vec is applied as the input of each node.

4.4 Critical analysis

Most document-level GNNs connect consecutive words as edges in the graph and apply a graph neural network model, which makes them similar to CNN, where the receptive field enlarges when graph models go deeper. Also, the major differences among document-level GNNs are the details of graph models, e.g. different pooling methods and different attention calculations, which diminishes the impact of the contribution of these works. Compared with corpus-level GNN, document-level GNN adopts more complex graph models and also suffers from the out-of-memory issue when the number of words in a document is large. A detailed comparison of document-level GNN is displayed in Table 2.

4.5 Comparison between corpus-level and document-level GNN

The comparison of the framework between Corpus-level and Document-level GNN’s learning is shown in Fig. 2. A comprehensive comparison between corpus-level GNN and document-level GNN can be found in Table 3.

Table 2 Models’ detailed comparison in terms of whether using external resources, how to construct the edge and node input, and whether transductive learning or inductive learning
Table 3 A Comparison between Corpus-level GNN and Document-level GNN
Fig. 2
figure 2

Corpus-level GNN usually builds a single graph per corpus and learns the node representation while Document-level GNN usually builds one graph per document and learns the graph representation

5 Datasets and metrics

5.1 Datasets

Table 4 Commonly used text classification datasets by GNN-based models

There are many popular text classification benchmark datasets, while this paper mainly focuses on the datasets used by GNN-based text classification applications. Based on the purpose of applications, we divided the commonly adopted datasets into three types including Topic Classification, Sentiment Analysis and Other. Most of these text classification datasets contain a single target label of each text body. The key information of each dataset is listed in Table 4.

5.1.1 Topic classification

Topic classification models aim to classify input text bodies from diverse sources into predefined categories. News categorization is a typical topic classification task to obtain key information from news and classify them into corresponding topics. The input text bodies normally are paragraphs or whole documents especially for news categorization, while there are still some short text classification datasets from certain domains such as micro-blogs, bibliography, etc. Some typical datasets are listed:

  • Ohsumed (Joachims 1998) is acquired from the MEDLINE database and further processed by Yao et al. (2019) via selecting certain documents (abstracts) and filtering out the documents belonging to multiple categories. Those documents are classified into 23 cardiovascular diseases. The statistics of Yao et al. (2019) processed Ohsumed dataset is represented in Table 4, which is directly employed by other related works.

  • R8 / R52 are two subsets of the Reuters 21587 datasetFootnote 2 which contain 8 and 52 news topics from Reuters financial news services, respectively.

  • 20NG is another widely used news categorization dataset that contains 20 newsgroups. originally collected it citeLang95, but the procedures are not explicitly described.

  • AG News (Zhang et al. 2015) is a large-scale news categorization dataset compared with other commonly used datasets, which are constructed by selecting the top-4 largest categories from the AG corpus. Each news topic contains 30,000 samples for training and 1900 samples for testing.

  • Database systems and logic programming (DBLP) is a topic classification dataset to classify the computer science paper titles into six various topics (Mei et al. 2021). Different from paragraph or document based topic classification dataset, DBLP aims to categorise scientific paper titles into corresponding categories, the average input sentence length is much lower than others.

  • Dbpedia (Lehmann et al. 2015) is a large-scale multilingual knowledge base that contains 14 non-overlapping categories. Each category contains 40000 samples for training and 5000 samples for testing.

  • WebKB (Craven et al. 1998) is a long corpus web page topic classification dataset.

  • TREC (Li and Roth 2002) is a question topic classification dataset to categorise one question sentence into 6 question categories.

5.1.2 Sentiment analysis

The purpose of sentiment analysis is to analyse and mine the opinion of the textual content which could be treated as a binary or multi-class classification problem. The sources of existing sentiment analysis tasks come from movie reviews, product reviews or user comments, social media posts, etc. Most sentiment analysis datasets aim to predict people’s opinions of one or two input sentences, of which the average length of each input text body is around 25 tokens.

  • Movie review (MR) (Pang and Lee 2005) is a binary sentiment classification dataset for movie reviews, which contains positive and negative data equally distributed. Each review only contains one sentence.

  • Stanford sentiment treebank (SST) (Socher et al. 2013) is an upgraded version of MR which contains two subsets SST-1 and SST-2. SST-1 provides five fine-grained labels, while SST-2 is a binary sentiment classification dataset.

  • Internet movie database (IMDB) (Maas et al. 2011) is also an equally distributed binary classification dataset for sentiment analysis. Different from other short text classification datasets, the average number of words in each review is around 221.

  • Yelp 2014 (Tang et al. 2015) is a large-scale binary category-based sentiment analysis dataset for longer user reviews collected from Yelp.com.

GNN-based text classifiers also use certain binary sentiment classification benchmark datasets. Most of them are gathered from shorter user reviews or comments (normally one or two sentences) from different websites including Amazon Alexa Reviews (AAR), Twitter US Airline (TUA), Youtube comments (SenTube-A and SenTube-T) (Uryupina et al. 2014).

5.1.3 Other datasets

There are some datasets targeting other tasks, including hate detection, grammaticality checking, etc. For example, ArangoHate (Arango et al. 2019) is a hate detection dataset, a sub-task of intend detection, which contains 2920 hateful documents and 4086 normal documents by resampling the merged datasets from Davidson et al. (2017) and Waseem (2016). In addition, Founta et al. (2018) proposes another large-scale hate language detection dataset, namely FountaHate to classify the tweets into four categories, including 53,851, 14,030, 27,150, and 4,965 samples of normal, spam, hateful and abusive, respectively. Since there is no officially provided training and testing splitting radio for the above datasets, the numbers represented in Table 4 follow the ratios (train/development/test is 85:5:10) defined by Lu et al. (2020).

5.1.4 Dataset summary

Since an obvious limitation of corpus-level GNN models has high memory consumption limitation (Zhang and Zhang 2020; Huang et al. 2019; Ding et al. 2020), the datasets with a smaller number of documents and vocabulary sizes such as Ohsumed, R8/R52, 20NG or MR are widely used to ensure feasibly build and evaluate corpus-level graphs. For the document-level GNN-based models, some larger-size datasets like AG-News can be adopted without considering the memory consumption problem. From Table 4, we could find most of the related works mainly focus on the GNN applied in topic classification and sentiment analysis, which means the role of GNNs in other text classification tasks such as spam detection, intent detection, abstractive question answering need to be further exploited.

5.2 Evaluation methods

5.2.1 Performance metrics

In evaluating and comparing the performance of the proposed models with other baselines, accuracy and F1 are the most commonly used metrics to conduct overall performance analysis, ablation studies, and breakdown analysis. We use TP, FP, TN and FN to represent the number of true positive, false positive, true negative and false negative samples. N is the total number of samples.

  • Accuracy and error rate are basic evaluation metrics adopted by many GNN-based text classifiers such as Li et al. (2021); Liu et al. (2016); Wang et al. (2020); Yao et al. (2019); Zhang and Zhang (2020). Most of the related papers run all baselines and their models ten times or five times to show the \(mean \pm standard\) deviation of accuracy for reporting more convincing results. It can be defined as:

    $$\begin{aligned} Accuracy = \frac{ (TF+TN)}{N}; \end{aligned}$$
    (29)
    $$\begin{aligned} ErrorRate = 1-Accuracy = \frac{ (FP+FN)}{N}. \end{aligned}$$
    (30)
  • Precision, recall and F1 are metrics for measuring the performance, especially for imbalanced datasets. Precision is used to measure the result’s relevancy, while recall is utilized to measure how many truly relevant results are acquired. By calculating the harmonic average of Precision and Recall, we could get F1. Those three measurements can be defined as:

    $$\begin{aligned} Precision = \frac{TP}{ (TP+FP)}; \end{aligned}$$
    (31)
    $$\begin{aligned} Recall = \frac{TP}{ (TP+FN)}; \end{aligned}$$
    (32)
    $$\begin{aligned} F1 = \frac{2 \times Precision \times Recall}{ (Precision + Recall)}. \end{aligned}$$
    (33)

Few papers only utilise recall or precision to evaluate the performance (Mei et al. 2021). However, precision and recall are more commonly used together with F1 or Accuracy to evaluate and analyse the performance from different perspectives, e.g. Li et al. (2019); Linmei et al. (2019); Lu et al. (2020); Xie et al. (2021). In addition, based on different application scenarios, different F1 averaging methods are adopted by those papers to measure the overall F1 score of multi-class (Number of Classes is C) classification tasks, including:

  • Macro-F1 applies the same weights to all categories to get overall \(F1_{macro}\) by taking the arithmetic mean.

    $$\begin{aligned} F1_{macro}= \frac{1}{C}\Sigma _{i=1}^{C} F1_{i} \end{aligned}$$
    (34)
  • Micro-F1 is calculated by considering the overall \(P_{micro}\) and \(R_{micro}\). It can be defined as:

    $$\begin{aligned} F1_{micro}=\frac{2 \times P_{micro} \times R_{micro}}{ (P_{micro} + R_{micro})} \end{aligned}$$
    (35)

    where:

    $$\begin{aligned} P_{micro} = \frac{\Sigma _{i \in C} TP_i}{\Sigma _{i \in C} TP_i+FP_i}, R_{micro} = \frac{\Sigma _{i \in C} TP_i}{\Sigma _{i \in C} TP_i+FN_i} \end{aligned}$$
    (36)
  • Weighted-F1 is the weighted mean of F1 of each category where the weight \(W_i\) is related to the number of occurrences of the corresponding ith class, which can be defined as:

    $$\begin{aligned} F1_{macro}= \Sigma _{i=1}^{C} F1_{i} \times W_i \end{aligned}$$
    (37)

5.2.2 Other evaluation aspects

Since two limitations of GNN-based models are time and memory consumption, except the commonly used qualitative performance comparison, representing and comparing the GPU or CPU memory consumption and the training time efficiency of proposed models are also adopted by many related studies to demonstrate the practicality in real-world applications. In addition, based on the novelties of various models, specific evaluation methods are conducted to demonstrate the proposed contributions.

  • Memory consumption (Ding et al. 2020; Huang et al. 2019; Liu et al. 2021) lists the memory consumption of different models for comprehensively evaluating the proposed models in computational efficiency aspect.

  • Time measurement (Ragesh et al. 2021; Pasa et al. 2021) performs performance training time comparison between their proposed models and baselines on different benchmarks. Due to the doubts about the efficiency of applying GNNs for text classification, it is an effective way to demonstrate they could balance performance and time efficiency.

  • Parameter sensitivity is commonly conducted by GNNs studies to investigate the effect of different hyperparameters, e.g. varying sliding window sizes, embedding dimensions of proposed models to represent the model sensitivity via line chart such as Linmei et al. (2019); Ding et al. (2020); Liu et al. (2021).

  • Number of labelled documents is a widely adopted evaluation method by GNN-based text classification models (Li et al. 2021; Wang et al. 2020; Linmei et al. 2019; Mei et al. 2021; Yao et al. 2019; Ragesh et al. 2021; Ding et al. 2020) which mainly analyse the performance trend by using different proportions of training data to test whether the proposed model can work well under the limited labelled training data.

  • Vocabulary size is similar to the number of labelled documents, but it investigates the effects of using different sizes of vocabulary during the GNN training stage adopted by Wang et al. (2020).

5.2.3 Metrics summary

For general text classification tasks, accuracy, precision, recall, and varying F1 are commonly used evaluation metrics for comparison with other baselines. However, for GNN-based models, only representing the model performance cannot effectively represent the multi-aspects of the proposed models. In this case, there are many papers conducting external processes to evaluate and analyse the GNN-based classifier from multiple views, including time and memory consumption, model sensitivity and dataset quantity.

6 Performance

Table 5 Performance Table

While different GNN text classification models may be evaluated on different datasets, there are some datasets that are commonly used across many of these models, including 20NG, R8, R52, Ohsumed and MR. The accuracy of various models on these five datasets is presented in Table 5. Some of the results are reported with ten times average accuracy and standard derivation while some only report the average accuracy. Several conclusions can be drawn:

  • Models that use external resources usually achieve better performance than those that do not, especially models with BERT and RoBERTa (Lin et al. 2021; Ye et al. 2020).

  • Under the same setting, such as using GloVe as the external resource, Corpus-level GNN models (e.g. TG-Transformer (Zhang and Zhang 2020), TensorGCN (Liu et al. 2020)) typically outperform Document-level GNN models (e.g. TextING (Zhang et al. 2020), TextSSL (Piao et al. 2021)). This is because Corpus-level GNN models can work in a transductive way and make use of the test input, whereas Document-level GNN models can only use the training data.

  • The advantage of Corpus-level GNN models over Document-level GNN models only applies to topic classification datasets and not to sentiment analysis datasets such as MR. This is because sentiment analysis involves analyzing the order of words in a text, which is something that most Corpus-level GNN models cannot do.

7 Challenges and future work

7.1 Model performance

With the development of pre-trained models (Devlin et al. 2018; Liu et al. 2019), and prompt learning methods (Gao et al. 2021; Liu et al. 2021) achieve great performance on text classification. Applying GNNs in text classification without this pre-training style will not be able to achieve such good performance. For both corpus-level and document-level GNN text classification models, researching how to combine GNN models with these pretrained models to improve the pretrained model performance can be the future work. Meanwhile, more advanced graph models can be explored, e.g. more heterogeneous graph models on word and document graphs to improve the model performance.

7.2 Graph construction

Most GNN text classification methods use a single, static-value edge to construct graphs based on document statistics. This approach applies to both corpus-level GNN and document-level GNN. However, to better explore the complex relationship between words and documents, more dynamic hyperedges can be utilized. Dynamic edges in GNNs can be learned from various sources, such as the graph structure, document semantic information, or other models. And hyperedges can be built for a more expressive representation of the complex relationships between nodes in the graph.

7.3 Application

While corpus-level GNN text classification models have demonstrated good performance without using external resources, these models are mostly transductive. To apply them in real-world settings, an inductive learning approach should be explored. Although some inductive corpus-level GNNs have been introduced, the large amount of space required to construct the graph and the inconvenience of incremental training still present barriers to deployment. Improving the scalability of online training and testing for inductive corpus-level GNNs represents a promising area for future work.

8 Conclusion

This survey article introduces how Graph Neural Networks have been applied to text classification in two different ways: corpus-level GNN and document-level GNN, with a detailed structural figure. Details of these models have been introduced and discussed, along with the datasets commonly used by these methods. Compared with traditional machine learning and sequential deep learning models, graph neural networks can explore the relationship between words and documents in the global structure (corpus-level GNN) or the local document (document-level GNN), performing well. A detailed performance comparison is applied to investigate the influence of external resources, model learning methods, and types of different datasets. Furthermore, we propose the challenges for GNN text classification models and potential future work.